Developing a new file format for backup sites

Now I am working on a new PHP-script that will back up not only the database, but also all the files of the site.

It was originally planned to use one of the common archive formats. In this case, the first thing that comes to mind is ZIP and TAR. There are many ready-made classes for them, and the ZIP extension is even included in the standard PHP distribution. But having studied the specifications of the formats, and also tried out ready-made solutions, I was inclined to invent my own bicycle.

Request "cycloheaters" to refrain from comments in the style of "enough for us to bike." In the end, without the creation of "bikes" there would be neither Google, nor Google Chrome, nor Facebook, nor WinRAR and 7-Zip.
')

Why not tar?

TAR is the most common format in unix-systems. And without any hesitation, the administrator will write himself for backup files or copy one of the many ready-made scripts from the Internet that use the tar system program. But TAR was developed a long time ago, and besides, it was originally intended to create archives on tape. In this regard, he has the following disadvantages.

No index

In the TAR there is no directory with the contents of the archive. Therefore, to view the contents of the archive, you need to scan the archive, walking through all the file headers. That on large archives can take considerable time.

Cannot use offset (fseek) inside the container

Since TAR does not support compression, it is usually completely compressed with gzip or bzip2. In this regard, if you need to get the file from the end of the archive, then you need, in fact, unpack all the compressed data from the archive, until we reach the desired file.

Cannot record stream to archive

As in TAR, the file size is stored in the file header (without it, we will not be able to find the next file header). That we can not directly write to the archive stream (ie, still non-existent file, the size of which is initially unknown). Therefore, when backing up the same MySQL, first they make a separate database dump, and only then it is packaged with TAR along with the rest of the files.

No file integrity control

Checksums are not stored for files.

Why not ZIP?

ZIP is the most common format on Windows, and, perhaps, in general in the world. He, like TAR, is from the 80s, but it was last updated in 2007. It has missing in the TAR index, the ability to quickly navigate through the container, control the integrity of files, even a stream recording. But he also has flaws.

No support for continuous (solid) archives

What is very bad for compressibility, when a lot of small files.

Time is stored in MS-DOS format.

This means that dates are stored with an accuracy of 2 seconds, plus in the case of PHP, these are redundant conversions from unix timestamp to MS-DOS format and back.

No encryption of file names in the archive

Encrypted only the contents of the files, headers and the central directory in clear text.

In principle, the flaws are much less significant than those of TAR, and I was about to use ZIP. But the situation was overshadowed by the fact that some advanced features (stream recording, unix file attributes), ready-made libraries, including the standard ZipArchive class, do not support. And, therefore, for their support I would have to write my own implementation of the ZIP.

In general, I decided to consider other options. After googling, it was possible to find documentation on the RAR, 7Z, DAR, ACE formats, and also poked around the slightly closed CBU and TIB. In the process of thought crept in building your own bike.

New file format for backup sites

So, the new format (without thinking twice about it was called SXB from Sypex Backup) is a compilation of ideas spied on in various formats plus a few of its own.

The main tasks that were set during the development of the SXB format:

Block data deduplication
Incremental backup at block level
Save streams to archive
Additional data types (not just files and directories)
Continuous archive
Index Availability
Quick random access to content
It may contain both compressed and uncompressed data (for badly compressible files, such as photos, archives, etc.)
Multi-volume
Encryption of both content and file headers
Versioning support
Content integrity control

Also, the development takes into account the specificity of PHP for working with files larger than 2 GB.
I will describe some points in more detail.

Block data deduplication

Data deduplication is a technology with the help of which duplicate data is excluded from the archive, which allows reducing its size. There is file-level and block-level deduplication. In the first case, the file hashes are compared, in the second - the hashes of individual file blocks.

Block deduplication was chosen for the SXB format as it is better suited for backup sites. For example, this allows for incremental backups that are constantly growing, logs or database tables to save them not completely, but only a small block with the added data.

Save streams to archive

This feature is needed so that the MySQL backup can be saved directly to the archive, without creating temporary files. That, in addition to speeding up work, reduces the requirements for free space, and also allows deduplication to be applied at the level of individual tables, rather than the entire SQL file.

Additional data types

It is difficult to imagine a modern site that does not use a database. And since a new format is being developed, the idea arose of moving away from the standard MySQL backup to an SQL file.

In the SXB format, in addition to files and directories, special data types have been added for database backup (table structure, table contents, triggers, functions, etc.). The structure of MySQL objects will be stored in the form of the corresponding CREATE queries, and the contents of the tables - in the form of text with tabs as a separator (this format is more compact and allows you to dump / restore data much faster).

In general, I’m getting round on this, or I’ve already written a lot, if I’m interested I can add more technical details.

And now about what the article was about.

I wanted to check deduplication on real sites. Of those large sites that were on hand, one on a private engine (206 MB), the second on vbulletin (934 MB), deduplication showed very encouraging results.

WinRAR was launched in continuous archive mode (the file is faster and more compact). For 7-Zip, the old LZMA algorithm was tested, and the new LZMA2 that loads all 4 processor cores. PHP 5.4.10 was used under Win 7 x64.

Yes, of course, with maximum compression, WinRAR and 7-Zip will show higher compression, but at the same time they eat gigabytes of memory and fully load all 4 cores for a considerable time. In principle, it is not necessary to deal with them by the degree of compression, there is a slightly different task. But the fact that the php script can show similar results to multi-threaded programs is encouraging.

Those who want to help in development, and who are just wondering how many duplicates are on your site, please protest a special simplified PHP script .

Interested in the number of data deduplication excluded, dimensions and preferably a comparison with other archivers (size and time, mode, the fastest one). As a result, the following info is issued:

dirs 162 files 6811 full_size 206.133 dup_size 28.721 archive_size 152.277 duplicates 3119 bigfiles 477 bf_blocks 5220 bf_dup 316 bf_size 154.426 bf_dup_size 8.027 time 13.53622

And finally, as reference information, links to format specifications.
ZIP File Format Specification
TAR File Format Specification
RAR File Format Specification
7Z File Format Specification
DAR File Format Specification
ACE File Format Specification

Source: https://habr.com/ru/post/165947/

All Articles