Extending Git and Mercurial Repositories with Amazon S3

Surely, many of you have heard or know from your own experience that version control systems are not good friends with binary files, large files, and especially large binary files. Hereinafter we are talking about modern popular distributed version control systems like Mercurial and GIT.

Often, it does not matter. I don’t know if this is a cause or a consequence, but version control systems are mainly used to store relatively small text files. Sometimes a few pictures or libraries.

If the project uses a large number of high-resolution pictures, sound files, graphic source files, 3D, video or any other editors, then this is a problem. All these files are usually large and binary, and this means that all the advantages and convenience of version control systems and repository hosting systems with all associated services become unavailable.
')
Next, using an example, we will look at the integration of version control systems and Amazon S3 (cloud file storage) to take advantage of both solutions and compensate for weaknesses.

The solution is written in C #, uses the Amazon Web Services API, and shows an example configuration for the Mercurial repository. The code is open, the link will be at the end of the article. Everything is written more or less modularly, so adding support for something other than Amazon S3 should be easy. I can assume that setting up for GIT will be just as easy.

So, everything started with an idea - we need a program that, after integration with the version control system and with the repository itself, would work completely unnoticed, without requiring any additional actions from the user. Like magic.

Integration with a version control system can be implemented using so-called hooks - events that can be used to assign your own handlers. We are interested in those that run at the time of receiving or sending data to another repository. Mercurial has the necessary hooks called incoming and outgoing. Accordingly, you need to implement one team for each event. One for downloading updated data from the working folder to the cloud, and the second for the reverse process - downloading updates from the cloud to the working folder.

Integration with the repository is carried out using a file with metadata or an index file, or as you please. This file should contain a description of all monitored files, at least the path to them. And this particular file will be under version control. The monitored files themselves will be in .hgignore, the list of ignored files, otherwise the whole point of this idea disappears.

Repository Integration

The metadata file looks something like this:

<?xml version="1.0" encoding="utf-8"?> <assets> <locations> <location>Content\Textures</location> <location>Content\Sounds</location> <location searchPattern="*.pdf">Docs</location> <location>Reference Libraries</location> </locations> <amazonS3> <accesskey>*****************</accesskey> <secretkey>****************************************</secretkey> <bucketname>mybucket</bucketname> </amazonS3> <files> <file path="Content\Textures\texture1.dds" checksum="BEF94D34F75D2190FF98746D3E73308B1A48ED241B857FFF8F9B642E7BB0322A"/> <file path="Content\Textures\texture1.psd" checksum="743391C2C73684AFE8CEB4A60B0317E634B6E54403E018385A85F048CC5925DE"/> <!--        --> </files> </assets>

There are three sections in this file: locations, amazonS3 and files. The first two are configured by the user at the very beginning, and the last one is used by the program itself to track the files themselves.

Locations are the paths that will be used to search for tracked files. These are either absolute paths or paths relative to this xml file with settings. The same paths need to be added to the ignore version control file so that she herself does not try to keep track of them.

AmazonS3 is, as it is not difficult to guess, the settings of the cloud file storage. The first two keys are Access Keys, which can be generated for any AWS user. They are used to sign requests to the Amazon API cryptographically. Bucketname is the name of the bake, an entity inside Amazon S3, which can contain files and folders and will be used to store all versions of the files being monitored.

Files need not be configured, since this section will be edited by the program itself while working with the repository. It will contain a list of all files of the current version with paths and hashes for them. Thus, when, together with pull, we retrieve the new version of this xml file, then by comparing the contents of the Files section with the contents of the monitored folders themselves, we can understand which files were added, which were changed, and which were simply moved or renamed. During push comparison is performed in the opposite direction.

Integration with version control system

Now about the teams themselves. The program supports three commands: push, pull and status. The first two are for setting the corresponding hooks. Status displays information about the monitored files and its output is similar to the hg status output - from it you can understand which files were added to the working folder, changed, moved and which files are missing there.

The push command works as follows. First we get a list of monitored files from the xml file, paths and hashes. This will be the last state recorded in the repository. Next, information is collected about the current state of the working folder — the paths and hashes of all monitored files. After that there is a comparison of both lists.

There can be four different situations:

The working folder contains a new file. This happens when there are no matches for paths or hashes. As a result, the xml file is updated, an entry about the new file is added to it, and the file itself is loaded into S3.
The working folder contains the modified file. This happens when there is a match along the path, but no hash match. As a result, the xml file is updated, the hash is changed for the corresponding entry, and an updated version of the file is loaded into S3.
The working folder contains the moved or renamed file. This happens when there is a hash match, but there is no match along the path. As a result, the xml file is updated, the path is changed for the corresponding entry, and in S3 nothing needs to be loaded. The fact is that the key for storing files in S3 is a hash, and the path information is recorded only in the xml file. In this case, the hash has not changed, so downloading the same file again in S3 does not make sense.
The monitored file would be deleted from the working folder. This occurs when one of the local files does not match one of the xml file entries. As a result, this entry is removed from the xml file. Nothing is deleted from S3, since its main purpose is to store all versions of files so that you can roll back to any revision.

There is a fifth possible situation - the file has not been changed. This happens when there is a coincidence both along the path and in the hash. And no action in this situation is required.

The pull command also compares the list of files from xml with the list of local files and works quite similarly, just the other way. For example, when the xml contains a record of a new file, i.e. there is no match either along the path or along the hash, then this file is downloaded from S3 and is recorded locally along the specified path.

An example of hgrc with customized hooks:

 [hooks] postupdate = \path\to\assets.exe pull \path\to\assets.config \path\to\checksum.cache prepush = \path\to\assets.exe push \path\to\assets.config \path\to\checksum.cache

Hashing

Calls to S3 are minimized. Only two commands are used: GetObject and PutObject. The file is downloaded and downloaded from S3 only if it is a new or modified file. This is possible through the use of the file hash as a key. That is, physically all versions of all files are in S3 Bucket without any hierarchy, no folders at all. There is an obvious minus - collisions. If suddenly the two files have the same hash, then the information about one of them simply will not be fixed in S3.

The convenience of using hashes as a key still outweighs the potential danger, so I would not want to abandon them. It is only necessary to take into account the probability of collisions, reduce it if possible and make the consequences not so fatal.

Reducing the probability is very simple - you need to use a hash function with a longer key. In my implementation, I used SHA256, which is more than enough. However, this still does not exclude the possibility of collisions. You must be able to determine them before any changes have been made.

This is also not difficult. All local files are already hashed before executing the push and pull commands. You just need to check if there are any matches among the hashes. It is enough to do a check during push so that the collisions do not fix in the repository. If a collision is detected, a message is displayed to the user about this trouble and is asked to change one of the two files and push again. Given the low probability of such situations, this solution is satisfactory.

Optimization

There is no strict performance requirements for such a program. It works for one second or five - not so important. However, there are obvious places that can and should be taken into account. And probably the most obvious is hashing.

The approach chosen assumes that during the execution of any of the commands you need to calculate the hashes of all the monitored files. This operation can easily take a minute or more if there are several thousand files or if their total size is more than a gigabyte. Calculating hashes for a full minute is unforgivably long.

If you notice that the typical use of the repository does not mean changing all the files immediately before we push, then the solution becomes obvious - caching. In my implementation, I stopped at using the pipe delimited file, which would lie next to the program and contain information about all the previously calculated hashes:

| |

This file is loaded before the command is executed, used in the process, updated and saved after the command is executed. Thus, if for the logo.jpg file the hash was last calculated one day ago, and the file itself was last changed three days ago, then it does not make sense to recalculate its hash.

It is also a stretch to optimize using the BufferedStream instead of the original FileStream for reading files, including for reading to calculate the hash. Tests showed that using a BufferedStream with a 1 megabyte buffer (instead of the standard for FileStram 8 kilobytes) to calculate hashes of 10 thousand files with a total size of more than a gigabyte speeds up the process four times compared to FileStream for standard HDD. If there are not so many files and they themselves are larger than a megabyte, then the difference is not so significant and amounts to about 5-10 percent.

Amazon S3

Two points are worth clarifying. The most important is probably the price of the issue. As you know, for new users the first year of use is free, if not to go beyond the limits. The limits are as follows: 5 gigabytes, 20,000 requests GetObject per month and 2,000 PutObject requests per month. If you pay the full cost, then the month will be worth about $ 1. For this, you get a reservation on several data centers within the region and good speeds.

Also, I dare to assume that from the very beginning the reader is tormented by the next question - why this bike, if there is a Dropbox? The fact is that using Dropbox directly for collaboration is fraught with - it doesn’t cope with conflicts at all.

But what if not used directly? In fact, in the described solution, Amazon S3 can be easily replaced with the same Dropbox, Skydrive, BitTorrent Sync or other analogues. In this case, they will act as storage for all versions of files, and hashes will be used as file names. In my solution, this is implemented through FileSystemRemoteStorage, an analogue of AmazonS3RemoteStorage.

Promised link to source code: bitbucket.org/openminded/assetsmanager

Source: https://habr.com/ru/post/185700/

All Articles