Experience in building and operating large file storage

Daniil Podolsky (Git in Sky)

The story of what every engineer has to do in his life after he has given birth to a child, planted a tree and built a house is to make his own file storage.

My report is called "Experience in the construction and operation of large file storage." We are building and operating a large file storage for the last three years. At that moment, when I was submitting a thesis, the report was called “At night through the forest. Experience building exploitation blah blah blah. ” But the program committee asked me to be more serious, nevertheless, in fact this is the report “At night through the forest”.

')
This image was invented by a colleague Chistyakov for the “Strike” conference, and there we gave a report “At night through the forest” about the docker, and I about the new SQL databases.

Since I attend the conference, it has become clear to me that what I want to hear is not success stories, but horror stories and the nightmare that awaits us all on the road to success. Because the success of someone else will give me nothing. Of course, if someone has already done something, the very knowledge that it is possible to do it helps me to move, but in reality I would like to know where there are mines and traps.

Another kayfovy image - the book "Roadside Picnic". There is a research institute that explores this area. They have flying bots, they have automatic markers, they have this and that, they have robotic systems, and there are stalkers who wander around this zone just like that, throwing nuts back and forth. So, it just so happens that we work in the segment where stalkers are in demand, not a research institute. We work in the segment where highload is most important. We work in the rogue segment. Highload is, in fact, for the rogue, because the adult guys simply do not allow their servers to exceed 30%. If you decide that your flow watermark is 70%, firstly, you have started a highload, and secondly, you are a rogue.

So. To begin with, what is file storage and why can it, in general, be in our life?

A file is a piece of information (this is its official definition), supplied with a name by which this piece of information can be extracted. But this is not the only piece of information in the world that is provided with a name, why is the file different from all the others? Because the file is too large to treat it as one piece. See: if you want to support, for example, 100 thousand simultaneous connections (this is not so much) and you give files of 1 MB in size. This means that if you want to treat a file as one piece of information, you have to download all 100 thousand files of 1 MB in memory, this is 100 GB of memory. Impossible. Accordingly, you have to do something. Or limit the number of simultaneous connections (for corporate applications this is normal), or treat a file as if it consists of pieces of information, of individual small pieces, of chunks. The word chunk will be used in the report further.

The cornerstone of ~~healthy eating~~ . It was crossed out. The file is the cornerstone of today's exchange of information, everyone knows about it. We all, we can make out as a file just out of habit. Because until recently we had no means of storing information other than files on disk. This will also come later - why this approach is not working well today and it would be better to abandon it, but so far it has not been possible.

File storage is the place where files are stored. In fact, it may be even more important that the files are stored there - this is where the access to the files is from where they are sent.

We understand what file storage is. What is a large file storage? Exploiting a large file repository, I found that this is not a characteristic of the repository itself. For example, Vasya Pupkin has an archive of teenage films on 5 PB. Is it a big file storage? No, because nobody needs it, because Vasya Pupkin cannot watch all 5 PBs at the same time. He watches one little movie. There are a few more characteristics of this storage facility, for example, if it loses it, it will cry and download all over again from the Internet.

Lots of files. We can assume that if a lot of bytes is not big, then a lot of files are big? Not. There are storages in which there are a lot of records. We have bases in which a billion lines and, nevertheless, they are not large. Why? Because if you have a billion lines in the database, you have convenient and reliable means of managing these lines. There is no such tool for files. All of our actively used file systems today are hierarchical. This means that in order to find out what is happening in our file system, we need to go through it, open all the available directories, read them. Sometimes we don’t, even usually we don’t have indexes on the directory, so we have to read it from beginning to end, find the file we need - they all imagine that perfectly.

So, a big one is a description of the situation in which you ended up with your file storage, and not of the storage itself. Very often, and this is my favorite trick, you can turn a large file storage into a normal one by simply transferring it to an SSD. There, much more IOPS and standard information management tools that are used on file systems start working quickly enough so that the management of such file storage is not a problem.

Paradox file storage. From a business point of view, these file storages are not why not needed. When you have a fairly large project that accepts files from users, gives files to users, shows ads to users, everything is understandable in general. And why about half of the project budget is made up of some vague glands, on which some vague bytes lie? Business, in principle, understands why it is, but in fact, file storage is not needed. Business requirements will never say “store files”. The business requirements will say “give away files”. There is a TZ in which it says “store files” - this is the TZ for the backup system. But this is also a lie. When we want a backup system, we don’t want a backup system, we want a disaster recovery system. Those. we again want to read files, not store.

Unfortunately, file storage creators do not understand this. Maybe only the creators of S3 have thought of this simple thing. All the rest are interested in the files being folded, so that they will not collapse in any way, and if the danger of destruction arises, then you should stop all activities, in no case give away the file that can be beaten, and in no case load new files if we have the danger of destroying the available information.

This is a traditional approach, but, nevertheless, it has nothing to do with what we do. File storage is a necessary thing, you cannot do without it, it is necessary and not. And without it in any way, because the files should be somewhere so that you can give them away.

The main source of experience in dealing with file storage in my life is the project Setup.Ru. Setup.Ru is a mass hosting service with some fictions, where sites are generated according to a pattern. There is a template, the user fills it, clicks "generate", from 20 to 200 files are generated, they all add up to the storage. Users upload pictures, various other binary files. In general, this is the unlimited source of all this. Currently, the Setup repository contains 450 million files divided into 1.5 million sites. This is quite a lot. How we came to life like this is the main content of my report. As we moved towards what we have there now. 20 million files per day is the amount of the Setup update at its peak today. In itself, 20 million is already a lot. The first time we ran into problems when we had 6.5 million files, but nonetheless.

In 2012, the Setup file storage was organized very simply. Content generators have been published to two servers in order to ensure fault tolerance. Synchronization - if one of these servers died with us, we took the other one the same, copied rsync from one to the other all this mass of files, and everything was good. For hot content, we used SSD, then still Hetzner. It's all in Hetzner, sorry. This is the question of what we are stalkers. Those. Hetzner is such a ridiculous place, such a zone, from there we take out the witch jelly from time to time, sell it on a black roar and live with it.

What did we see the problem with when we met this scheme in 2012? At that moment, when we have one of the backup servers died, we live for a while without a filer, respectively, while rsync is on, we have to pray and shake with fear. File system statistics are also problems. If the SSD is still able to collect, 60 GB each (64 GB then there were SSD in Hetzner and no others), then by HDD we realized very quickly by the spring of 2012 that we did not know what was on our drives and never know. But then we did not think that this would become a problem.

In the summer of 2012, our next server died. We ordered a new one with the usual movement, launched rsync, and it never ended. Rather, the script that ran rsync in a loop never ended, until it found that all files were copied. It turned out that there are already quite a few files, that it takes six hours to traverse the tree, that the files are copied, plus these six hours, another 12. And in 18 hours the content has changed so much that our replica is irrelevant. "At night through the forest", i.e. no one expected this stick to please us in the eye. So here. Now this is obvious, but then we were very surprised, like: how is that?

Then your humble servant thought up to put the files in the database. Why did he come up with this, where did this stupid idea come from? Because after all we have collected some statistics. 95% of files were less than 64 KB. 64 Kb is even at 100 thousand simultaneous connections - quite lifting size to treat it as with one piece. The remaining files were hidden in the database, large files were hidden in BLOBs. Later I will talk about the fact that this is the main mistake of this decision. I will say why.

It was all done in Postgres. We then believed in master replication, and in Postgres there is no master replication master (and nowhere, in fact, there is no master replication master in any DBMS), so we wrote our own, which took into account the peculiarities of our content and updates of this content, and could function normally.

And in the spring of 2013, we finally faced the problems of what we had written.

Files became by that time 25 million. It turned out that with such a volume of updates, which by that time occurred in the system, transactions take considerable time. Therefore, some transactions that were shorter had time to end earlier than those that had begun earlier, but longer ones. As a result, an auto-increment counter, which we guided in our master replication master, ended up with holes. Those. some files our master replication never saw. It was a big surprise for me personally. I drank for three days.

Then, finally, I came up with the idea that every time we start the master master replication, we should take this last counter off. At first I had to take 1000, then 2000, then 10,000. When I entered 25 thousand in this field, I realized that I had to do something, but by that time I did not know what to do. We stumbled upon the same problem again - the content changed faster than we synchronized it.

It turned out that this master replication of our master works rather slowly and slowly, in fact, not she, insertion in Postgres is slow, insertion in BLOBs is especially slow. Therefore, at some point, at night, when the number of publications decreased dramatically, the base converged. But in the afternoon, she was a little inconsistent all the time, a little bit. Our users noticed this in the following way: they upload a picture, they want to see it right away, but there is none, because they uploaded it to one server, and they request it via round-robin from another. Well, we had to learn our servers, replace the business logic, had to get a picture from the same server to which we uploaded it. This then came back to us with problems with a dead server — when the server is dead, we need to go where this router is and change the routing parameters to it.

In the autumn of 2013, there were 50 million files. And here we stumbled upon the fact that our base, in general, does not pull, because there it was quite difficult to join in order to give the user the latest version of the file that he downloaded. And we no longer have enough of our eight cores. We did not know what to do with this, but colleague Chistyakov found a solution - colleague Chistyakov on triggers made us a materialized view, i.e. upon request, the file from the view in which there was a long join moved to a separate table, from which it was further requested. It is clear how this was arranged, and this all also began to work perfectly. The master replication was already working very poorly by that time, but we crammed this crutch, and everything seemed to be good. Until the spring of 2014.

120 million files. Content has ceased to fit on one machine in Hetzner. We did not manage to take cars in which there would be more than four of three t-screws. Therefore, the following was invented. Small files remained in Postgres, large files moved to leofs. It immediately turned out that leofs is a rather slow repository. We then tried various other cluster storage, they are all quite slow. But it also turned out that none of them are transactional. Surprise, right?

Transaction file system is needed in order to change the entire site entirely. If the user has published a new site, he does not want it to change for half an hour on the file. Even if the site will be published for half an hour, the user wants the new site to be published entirely.

None of the repositories we tested has given us this functionality. We could implement it on POSIX-compatible cluster file storages using symlinks, as is done on a standard file system, but it turned out that none of them provide us with enough IOPS to send our 500 requests per second. Therefore, small files remained in Postgres, and instead of large files and blobs, there were links to leofs repositories. And everything became good again. Now it is clear that we pushed one more crutch under our system, but, nevertheless, everything was fine for a while.

2015, the beginning. On leofs ran out of place. There already were 400 million files - apparently, our users are constantly increasing momentum. We had a place on leofs, we decided to add a couple of nodes, we added them. Rebalancing went on for some time in a row, then it ended up, as it were, and it turned out that the master node now does not know where its files lie.

Well, we turned off these two new nodes. It turned out that now she knows nothing at all. We connected these two new nodes back, and it turned out that now there is no rebalancing at all. This is how it is.

The setup technical team hired a Japanese translator, and we called Tokyo. At 8 am. Talk to Rakuten. Rakuten talked to us for an hour, and forgive me, I will scold her because I have not been treated like this in my life, and I’ve been in business for 20 years ... Rakuten talked to us an hour, listened to our problems carefully, read our bug - reporters in their tracker, after which she said: “You know, we ran out of time, thanks, sorry”. And here we realized that the fur animal came to us, because we have 400 million files, we have (taking into account replication factor 3) content of 20 TB or more. In fact, this is not entirely true, but the real content is 8 TB. Well, what should I do?

At first we panicked, tried to set the metrics on the leofs code, tried to understand what was going on, our colleague Chistyakov did not sleep at night. We tried to find out if any of the St. Petersburg Erlangists wanted to touch it with a stick. The St. Petersburg Erlangists turned out to be reasonable people, they did not touch it with a stick.

Then the idea arose that well, okay, let it be, but there are adult guys, adult guys have storage systems. We take storage, and on this storage we put all our files and behave like adult guys. We agreed a budget. We went to Hetzner. We did not give storage. Moreover, when we thought up that we would take a nine-disk machine, raise iSCSI on it on FreeBSD, iSCSI on the internal Gbps network, forward it all to our sending servers, raise the old repository on Postgres as it was . And they even thought, sort of, in 128 GB of RAM, we climbed for another six months. It turned out that these nine-disk machines can not be connected to our cluster on the internal network, because they are completely different racks. “At night through the forest” - no one expected that this pit would fall under our feet. And no one expected that we are still in the zone. What is the next trick. In short, according to the general Hetzner network, the idea of quasi-storage on iSCSI has not gone off. There was too much latency, our cluster file system was falling apart.

2015, spring. 450 million files. At leofs, the place ended completely, i.e. at all. Rakuten didn't even talk to us. Thanks to them. But the horror, the worst thing happened to us in the spring of 2015. Everything was fine yesterday, and today the disc saturation in the Postgres server has risen to 70% and has not gone down. A week later, he rose to 80%, and now, before switching to our last option, he kept from 95 to 99%. This means that, firstly, the disks will soon die, and secondly, everything is very bad in terms of the speed of return and speed of publication. And here we understood that we should take a step forward. That the situation is such that we will be kicked out of this project anyway, after that the project will be closed, because these 450 million files are the whole project. In general, we realized that we should go for broke.

About solutions that we developed in the spring of 2015 and implemented, which are now functioning, I will tell you a little later. In the meantime, what have we learned since we exploited it all?

First, fault tolerance. Fault tolerance is not storing files. No one needs to know that you have securely hidden 450 million user files if users are unable to access them.

File cache I treated him a little with contempt, until I found out that if you have a really high load, you are not going anywhere, you have to hide the hot set in memory. There hot set is not so big. In fact, the hot set is about 70 thousand sites, and about 10 million files. Those. if in Hetzner it was possible to take a cheap car with 1 TB of RAM, we would hide all this in memory, and again we would have no problems if we were older boys, and not stalkers from the garbage.

Distributed storage systems. If you have money for a large storage system, which itself is synchronized with another such storage system ... When I last checked, the chassis cost about 50 thousand euros. And what kind of disks you can't put in there. Maybe now it has become more expensive, and maybe it has become cheaper, there are a lot of players in this market - IBM, Hewlett-Packard, EMC ... Nevertheless, we tried several cheap solutions - CEF, leofs and a couple of other names. which I do not remember.

If your repository is declared as eventual consistency, it means that you are in trouble. eventual consistency, . strong consistency? , - . , . , , , , . , , , .

. , CEF – , , , 100% .

. , , . – IOPS' 100. POSIX- – , , . , . leofs Varnish, - , .

? , – . – . , – , , .

PostgreSQL BLOB. . , , , - . PostgreSQL BLOB PostgreSQL. , . Postgres, , , . , , , , 4 Postgres, - .

view – . join', CPU , view . – 3-4 , . – , , , view' .

. , , . , . , 2012 . 2012 HighLoad++ , , Postgres. , . , , , .

When we started to mess around with it, we knew that adult boys like VKontakte never delete files. They only mark them as deleted. We thought that we don’t even need to mark them as deleted, everything will be fine and so on. Until Roskomnadzor came to us and asked to remove this, this and this. We deleted the site, we provided for the removal of the site. But it turned out that direct links to counterfeit images are hidden in search engines, so the real nightmare began.

And again hello to Rakuten - Java is better than Erlang. Not in the sense that Java itself is better than Erlang. And in the sense that Java programmers are not smart enough. Java-programmers would get into this cesspool and fix us, maybe leofs.

? . , . , , Hetzner, Hetzner .

POSIX- . , POSIX- – , . , . . . , .

, , , , . , , -.

NoSQL . Aerospike, . Aerospike latency. is he
, .
, , .
. . . Those. transaction id,
, .
dedup. , SHA1, SHA1 ,
. , dedup . I'll tell you more. dedup, ,
. dedup . , .
. , . , ,
.
, LZ4- .

, . . 450 . , 1,5 . , 600 . , . 20 . .

, , . replication factor 3, , . , RAID 0. , 3 . 66%, , .

1,5 – , . , , . , , 1,5 – 2,5 , , .

-, « ». NoSQL , , , 2 . – 5 , . 500- , 404- , 403- . . : « - ».

Aerospike, . , , Aerospike , 2 . , 1 . Aerospike 3 . , . , – , , .

« » – , , , , . , , . 100 – SATA . . , , , , , 60 . , replication factor .

— devops RootConf . 2017 , ? . HighLoad++ , :(

, :

. ? / ;
Archival Disc Blu-ray: / , ;
Badoo / .

- HighLoad.Guide — , , , . 30 . !

Source: https://habr.com/ru/post/313364/

All Articles

Experience in building and operating large file storage

Daniil Podolsky (Git in Sky)

More articles: