Continuation of the story about the adventures of a separate task in the ISPsystem. Says the head of development, Alexander Bryukhanov. The first part is
here .
Best the enemy of the good')
Writing a backup or installing and configuring software, we always had firing tasks. When you put any of the repositories, you can not be completely sure of the result. Yes, even if everything is done perfectly, the manteiners will break something sooner or later. As for backups: they are remembered when problems arise. People are already on edge, and if something else goes the way they expected ... well, you understand.
There are quite a few approaches to backup, but each has one goal: to make the process as fast as possible and at the same time as cheap as possible.
Attempt to please everyone
In the yard in 2011. It has been more than one year since the total backup of the servers has sunk into oblivion. No, backup copies of virtual servers were made, made now. For example, on WHD.moscow, I was told a truly elegant way of backing up virtual servers through live migration. Still, now it does not happen as massively as 10-15 years ago.
We started the development of the fifth version of our products based on our own framework in which a powerful system of events and internal calls was implemented.
It was decided to implement a truly flexible and versatile approach to setting up backups so that users can customize the time, select the type and contents of backups, and put them in different storages. Yes, we also conceived to stretch this solution into several products.
In addition, backup targets may also differ significantly. Someone makes backups to protect against equipment failure, someone is insured against data loss through the fault of the administrator. Naive, we wanted to please everyone.
From the outside, our attempt to make a flexible system looked like this:
With the toe of your right foot, you push the butt . Added custom storage. After all, what's the problem: pour ready-made archives in two places? In fact, there is a problem: if the archive cannot be uploaded to one of the repositories, can the backup be considered successful?
The second butt you push down the toe of your left foot . We break spears, realizing encryption archives. It's simple, until you think about what should happen when the user wants to change the password.
And now both butt you push together!
Why am I doing this? Insane flexibility has generated an infinite number of usage scenarios, and it has become almost impossible to test them all. Therefore, we decided to follow the path of simplification. Why ask the user if he wants to save metadata if it takes several kilobytes? Or, for example, do you really wonder which archiver we use?
Another funny mistake: there was a user who limited the backup time from 4:00 to 8:00. The problem was that the process itself was launched through the scheduler daily at 3:00 (standard setting @daily). The process started, determined that it was forbidden to work at that time, and quit. No backups were made.We write your bike dar
In the middle of 10x, Hip about clusters began to grow, followed by clouds. There is a trend - let's manage not one server anymore, but a group of servers and call it a cloud :) This was also affected by ISPmanager.
And, since we have a lot of servers, the idea of resuming data compression to a separate server has been revived. Like many years ago, we attempted to find a ready-made solution. Oddly enough, they found the bacula alive, but just as complex. To manage it, it was necessary, perhaps, to write a separate panel. And then dar caught my eye, having implemented many ideas that were once invested in ispbackup. It seemed that it was
happiness! But no, experience! the ideal solution that allows you to manage the backup process as we like.
In 2014, a solution using dar was written. But it contained two serious problems: first, the received dar archives can be unpacked only by the original archiver (that is, by the dar himself); secondly, dar forms the listing of files in memory in XML, its mother! format.
It is thanks to this utility that I learned that allocating memory in C in small blocks (centos 7 block must be less than 120 bytes), it is impossible to return it to the system without completing the process.
But otherwise, he was very likable to me. Therefore, in 2015, we decided to write our dar - isptar bicycle. As you probably guessed, the tar.gz format was chosen - fairly easy to implement. I figured out all sorts of PAX headers when I wrote ispbackup.
I must say that the documentation on this issue a bit. Therefore, in due time, I had to spend time studying how tar works with long file names and large sizes, restrictions on which were originally laid down in the tar format. 100 bytes for the length of the file name, 155 for the directory, 12 bytes for the decimal notation of the file size, etc. Well, yes, 640 kilobytes is enough for everyone! Ha! Ha! Ha!It remained to solve several problems. The first is a quick listing of files without the need to unpack the archive completely. The second is the ability to extract an arbitrary file, again, without full unpacking. The third is to make it so that it is still tgz, which can be deployed by any archiver. We have solved each of these problems!
How to start unpacking the archive with a certain offset?It turns out that gz threads can be glued together! A simple script will prove it to you:
cat 1.gz 2.gz | gunzip -
You get the glued contents of the files without any errors. If each file is written as if it were a separate stream, then the problem will be solved. Of course, this reduces the degree of compression, but not very significantly.
Getting a listing is even easier.Let's put the listing at the end of the archive as a regular file. And in the listing, we will also write the file offsets in the archive (by the way, dar also stores the listing at the end of the archive).
Why at the end? When you make a backup of hundreds of gigabytes in size, you may not have enough space to store the entire archive. Therefore, as you create, you merge it into the repository in parts. The great thing is that if you need to get one file, you only need a listing and the part that contains the data.
There is only one problem left: how to get the listing offset itself?To do this, at the end of the listing itself, I added the service information about the archive, including the packed size of the listing itself, and at the very end of the service information, as a separate gz stream, the packed size of the service information itself (these are just two digits). To get a quick listing, just read the last few bytes and unpack them. Then we read the service information (we now know the offset relative to the end of the file), and then the listing itself (the offset of which we took from the service information).
A simple listing example. Different colors are allocated to individual gz streams. Accordingly, at first we unpack the red (just by analyzing the last 20-40 bytes). Then we unpack 68 bytes containing the packed short information (highlighted in blue). And finally, unpack another 6247 bytes to read the listing, the actual size of which is 33522 bytes.
etc/.billmgr-backup root#0 root#0 488 dir
etc/.billmgr-backup/.backups_cleancache root#0 root#0 420 file 1487234390 0
etc/.billmgr-backup/.backups_imported root#0 root#0 420 file 1488512406 92 0:1:165:0
etc/.billmgr-backup/backups root#0 root#0 488 dir
etc/.billmgr-backup/plans root#0 root#0 488 dir
…
listing_header=512
listing_real_size=33522
listing_size=6247
header_size= 68
It sounds a bit confused, I even had to look into the source to remember how I do it. You can also take a look at the isptar source, which, like the ispbackup source, I put on
github .
Well, the story does not end there, of course. You can always look at the fire, parking a woman and how people with the help of some crutches are trying to defeat others.