Amazon Glacier: Perl client with multi-thread / multipart download

Amazon glacier

In short, Amazon Glacier is a service with a very attractive price tag, created for storing archives / backups. But the process of restoring archives is quite complicated and / or expensive. However, the service is quite suitable for secondary backup.
More about Glacier already wrote on Habré.

What's the post about

I want to share the Open Source client in Perl to synchronize the local directory with the Glacier service, also to tell about some of the nuances of working with glacier and describe the workflow of its work.

Functional

So, mtglacier. GitHub link
Features of the program:

The protocol for working with AWS is implemented independently; no third-party libraries are used. Independent implementation of Amazon Signature Version 4 Signing Process and Tree Hash calculation
Implemented Multipart Upload
Multi-threaded upload / download / delete archives
Combining multi-threaded and multipart apload. Those. three streams can upload parts of file A, two more - parts of file B, another one to initiate the upload of file C, etc.
Local log file (open for writing only in append mode). After all, to get a listing of files with Glacier, you must wait 4 hours
Ability to compare checksums of local files with the log
When downloading archives, the ability to limit the number of files that will be downloaded at a time (necessary because of the peculiarity of the billing download archives)

Objective of the project

Implement four things in one program (all this is separately, but not together)

Implementation on Perl - I believe that the language / technology in which the program is made is also important for the end user / administrator. So it is better to have a choice of implementations in different languages.
Amazon S3 support is definitely planned.
Multipart operations + multithreaded operations -
multipart will help to avoid a situation where you upload several gigabytes to a remote server and suddenly the connection breaks. Multithreading speeds up downloads, and significantly speeds up the loading of a heap of small files or the removal of a large number of files.
Own implementation of the protocol - it is planned to make the code reusable and publish as separate modules on the CPAN

How it works

When synchronizing files to a service, mtglacier creates a log (text file) in which all file upload operations are recorded: for each operation, the local file name, upload time, Tree Hash file, received by the archive_id file.

When restoring files from Glacier to a local disk, data for recovery is taken from this log (since you can only get a listing of files on the glacier with a delay of four hours or more).

When deleting files from Glacier, deletion entries are added to the log. When re-synchronization in Glacier, only those files that are not there are processed, according to the log.

The two-pass file recovery procedure:

Creates a task to download files that are present in the log, but not on the local disk
After waiting four hours, you can run a repeated command to download these files.

Cautions

Check Amazon Pricing before use.
Learn the Amazon FAQ
No, really, be sure to check prices.
Before you play with Glacier, make sure you can delete all the archives at the end. The fact is that you cannot delete archives and a non-empty vault through the Amazon Console
When multi-threading with glacier you need to select the optimal number of threads. Amazon returns HTTP 408 Timeout if the download is too slow for each stream (after that, the mtglacier stream pauses for one second and retries, but no more than five times). So multi-threaded download may be slower than single-threaded
For now, this is the Beta version. Not recommended for use in production

How to use

Create a Vault in Amazon Console
Create glacier.cfg (specify the same region in which the vault was created)
```
 key = YOURKEY
 secret = YOURSECRET
 region = us-east-1
```
Sync local directory to Glacier. Use the concurrency parameter to specify the number of threads.
./mtglacier.pl sync --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --concurrency=3
You can add files and sync again
Check the integrity of files (checked only with the log)
./mtglacier.pl check-local-hash --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
You can delete some files from / data / backup
Create a data recovery task. Use the max-number-of-files parameter to specify the number of archovs you want to restore. Currently, it is not recommended to specify a value greater than a few dozen (it’s not yet implemented to load more than one page with a list of current Jobs)
./mtglacier.pl restore --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --max-number-of-files=10
Wait 4 hours or more
Recover deleted files
./mtglacier.pl restore-completed --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
If backup is no longer needed, delete all files from Glacier
./mtglacier.pl purge-vault --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log

Implementation

HTTP / HTTPS operations are implemented through LWP :: UserAgent
Interaction with the Amazon API is written from scratch, the only used module Digest :: SHA (Core module)
Independent implementation of Amazon Tree Hash (this is their own algorithm for calculating the checksum, did not find similar among the TTH-algorithms, correct, if not right)
Multithreading is implemented using fork-processes
Handling abnormal termination of processes using signals
Interaction between processes - through unnamed pipes
Queue of tasks - OOP objects, similar to FSM

What is not enough

The main thing for the first beta was a stable (not alpha) version, ready by the end of the week, so a lot of things are still missing.
Necessarily will:

Adjustable chunk size
non-multipart upload
topic for SNS notifications in config
integration with the outside world to receive a signal through SNS notifications
Internal refractoring
Unit tests will be published when I put them in order.
There will be another testsuite, perhaps a glacier server emulator
Of course there will be a production-ready version (not Beta)

Source: https://habr.com/ru/post/150324/

All Articles