
Amazon glacier
In short, Amazon Glacier is a service with a very attractive price tag, created for storing archives / backups. But the process of restoring archives is quite complicated and / or expensive. However, the service is quite suitable for secondary backup.
More about Glacier already
wrote on Habré.
What's the post about
I want to share the Open Source client in Perl to synchronize the local directory with the Glacier service, also to tell about some of the nuances of working with glacier and describe the workflow of its work.
Functional
So, mtglacier. GitHub
linkFeatures of the program:
- The protocol for working with AWS is implemented independently; no third-party libraries are used. Independent implementation of Amazon Signature Version 4 Signing Process and Tree Hash calculation
- Implemented Multipart Upload
- Multi-threaded upload / download / delete archives
- Combining multi-threaded and multipart apload. Those. three streams can upload parts of file A, two more - parts of file B, another one to initiate the upload of file C, etc.
- Local log file (open for writing only in append mode). After all, to get a listing of files with Glacier, you must wait 4 hours
- Ability to compare checksums of local files with the log
- When downloading archives, the ability to limit the number of files that will be downloaded at a time (necessary because of the peculiarity of the billing download archives)
')
Objective of the project
Implement four things in one program (all this is separately, but not together)
- Implementation on Perl - I believe that the language / technology in which the program is made is also important for the end user / administrator. So it is better to have a choice of implementations in different languages.
- Amazon S3 support is definitely planned.
- Multipart operations + multithreaded operations -
multipart will help to avoid a situation where you upload several gigabytes to a remote server and suddenly the connection breaks. Multithreading speeds up downloads, and significantly speeds up the loading of a heap of small files or the removal of a large number of files.
- Own implementation of the protocol - it is planned to make the code reusable and publish as separate modules on the CPAN
How it works
When synchronizing files to a service,
mtglacier creates a log (text file) in which all file upload operations are recorded: for each operation, the local file name, upload time, Tree Hash file, received by the archive_id file.
When restoring files from Glacier to a local disk, data for recovery is taken from this log (since you can only get a listing of files on the glacier with a delay of four hours or more).
When deleting files from Glacier, deletion entries are added to the log. When re-synchronization in Glacier, only those files that are not there are processed, according to the log.
The two-pass file recovery procedure:
- Creates a task to download files that are present in the log, but not on the local disk
- After waiting four hours, you can run a repeated command to download these files.
Cautions
- Check Amazon Pricing before use.
- Learn the Amazon FAQ
- No, really, be sure to check prices.
- Before you play with Glacier, make sure you can delete all the archives at the end. The fact is that you cannot delete archives and a non-empty vault through the Amazon Console
- When multi-threading with glacier you need to select the optimal number of threads. Amazon returns HTTP 408 Timeout if the download is too slow for each stream (after that, the mtglacier stream pauses for one second and retries, but no more than five times). So multi-threaded download may be slower than single-threaded
- For now, this is the Beta version. Not recommended for use in production
How to use
- Create a Vault in Amazon Console
- Create glacier.cfg (specify the same region in which the vault was created)
key = YOURKEY
secret = YOURSECRET
region = us-east-1
- Sync local directory to Glacier. Use the concurrency parameter to specify the number of threads.
./mtglacier.pl sync --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --concurrency=3
- You can add files and sync again
- Check the integrity of files (checked only with the log)
./mtglacier.pl check-local-hash --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
- You can delete some files from / data / backup
- Create a data recovery task. Use the max-number-of-files parameter to specify the number of archovs you want to restore. Currently, it is not recommended to specify a value greater than a few dozen (it’s not yet implemented to load more than one page with a list of current Jobs)
./mtglacier.pl restore --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --max-number-of-files=10
- Wait 4 hours or more
- Recover deleted files
./mtglacier.pl restore-completed --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
- If backup is no longer needed, delete all files from Glacier
./mtglacier.pl purge-vault --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
Implementation
- HTTP / HTTPS operations are implemented through LWP :: UserAgent
- Interaction with the Amazon API is written from scratch, the only used module Digest :: SHA (Core module)
- Independent implementation of Amazon Tree Hash (this is their own algorithm for calculating the checksum, did not find similar among the TTH-algorithms, correct, if not right)
- Multithreading is implemented using fork-processes
- Handling abnormal termination of processes using signals
- Interaction between processes - through unnamed pipes
- Queue of tasks - OOP objects, similar to FSM
What is not enough
The main thing for the first beta was a stable (not alpha) version, ready by the end of the week, so a lot of things are still missing.
Necessarily will:
- Adjustable chunk size
- non-multipart upload
- topic for SNS notifications in config
- integration with the outside world to receive a signal through SNS notifications
- Internal refractoring
- Unit tests will be published when I put them in order.
- There will be another testsuite, perhaps a glacier server emulator
- Of course there will be a production-ready version (not Beta)