📜 ⬆️ ⬇️

Backup a large number of small files

Sooner or later, setting up a backup of work files is puzzling any self-respecting modern IT specialist. After a number of programmer typos / errors, I found time for that.
The specificity of a web application is such that the working directory occupies more than 50GB on hard disks, including about 900 thousand small files (pictures, thumbnails, ...). Therefore, to solve the problem in the forehead with the help of tar and analogues did not work. Yes, and I would like to have some variability of the stored data, and in the case of a full backup, the implementation required a large amount of storage costs for essentially the same data with little changes. Plus, it would be nice to duplicate copies on a remote backup server to reduce the risk of losing critical information as a result of an iron crash. After a rigorous analysis of search results and discarding methods that were obviously unsuitable to me, I stopped at a couple of options that were most often imposed in the comments on self-written shell bicycles of enthusiasts.


rdiff-backup

Rdiff-backup seemed to be more suitable and convenient. Written in python, it stores data incrementally, allowing you to get the status of a file or directory at any time (in the foreseeable past, taking into account the storage time of images). Flexible control from the console gives complete freedom of action and provides complete control over the situation. The automatic creation of backups requires adding a couple of commands to the scheduler (the second to clean up old images that do not carry any value due to old changes).
But testing has shown that the utility is very gluttonous and copes with my task extremely reluctantly. The fact is that a small amount of data changes every day (300MB), but the changes affect about 30 thousand files. To identify the modified files, apparently, spent most of the time the program. After an hour of observations on the iowait increased to indecent 20% during the next run of the script, it was decided to try another software and compare them with each other.

rsnapshot

rsnapshot, written in Perl, is based on rsync. In the working directory of the program (let's call it the place where the backups are added), a series of folders with an index is created, which increases each time the program starts to the value specified in the configuration. Then the outdated copy is deleted. If you go to any of the created folders, then inside you can find a complete copy of the backup data. The total size of the folder also indicates this (when viewed using standard Midnight Commander tools, for example) - it is equal to the sum of all folders. In fact, it is not. The program creates hard links between identical data within the working directory. Thus, the last actual copy is the “heaviest”, and the size of all the others makes up the difference in the changed data.
')
Testing

Since both options use approximately the same size for storage, it's time to check the speed of the backup tasks.

For tests, a random project folder of 11 GB size was taken, containing 593 subdirectories of different nesting levels and 230911 files. File size floats from 4 to 800KB, as stated above, this is in graphic material. Both utilities were tested in turn, external factors were almost completely absent (no other users, workload and heavy processes). Using the time utility, we calculated the execution time of each of the test tasks, as well as for comparison, copying the entire directory using cp tools:

The first backup is a full copy to the place of backup 11090
realusersys
cp6m30.885s0m1.068s0m24.554s
rsnapshot7m53.879s1m57.299s1m22.441s
rdiff-backup10m50.314s3m26.073s1m0.928s


Restart (there were no changes in the folder)
realusersys
rsnapshot0m10.129s0m4.936s0m6.708s
rdiff-backup1m3.969s1m0.616s0m2.048s


Inside the directory one random folder is duplicated (the total size increases to 13267MB)
realusersys
rsnapshot0m31.175s0m22.001s0m17.365s
rdiff-backup27m53.517s1m58.819s0m19.005s


Restart after increasing the size of the directory (since the last change there are no changes)
realusersys
rsnapshot0m11.477s0m5.748s0m7.368s
rdiff-backup1m16.366s1m13.713s0m1.912s


Delete the duplicated folder, reducing the size of the directory to the original
realusersys
rsnapshot0m13.885s0m6.388s0m9.077s
rdiff-backup52m55.794s2m1.560s0m21.941s


Control restart without making changes
realusersys
rsnapshot0m11.250s0m5.132s0m7.068s
rdiff-backup1m2.380s1m0.088s0m1.792s


Results

As can be seen from the comparative testing tables, rdiff-backup hardly digests the change of a large number of small files, therefore, it is more profitable to use rsnapshot in order not to waste most of the server time on picking up the file system.
It may be useful for someone to see the test result and save time I spent on finding the best for the backup files described in the article.

Source: https://habr.com/ru/post/143383/


All Articles