Binary (file) storage, a terrible tale with a gloomy end
Daniil Podolsky (Git in Sky)
My report is called "Binary, they are also file storage", but, in fact, we are dealing with a terrible tale. The problem is (and this is the thesis of my report) that now there is not something good, but at least an acceptable file storage system.
What is a file? A file is a piece of data named. What is important? Why is the file not a string in the database? ')
The file is too large to be treated as one piece. Why? You have a service, since we have a HighLoad conference, you have a service that holds at the same time 100 thousand connections. This is not so much if for each of the connections we give a file of 1 MB in size, but we need approximately 100 GB of memory for buffers for these files.
We can not afford it. Usually, anyway. Accordingly, we have to divide these pieces of data into smaller pieces, which are called chunks. Quite often, and sometimes in blocks. And handle the blocks, not the entire file. Accordingly, we have another addressing level, and a little later it will become clear why this affects the quality of binary storages so much.
Nevertheless, despite the fact that there are no good binary repositories in the world, file exchange is the cornerstone of modern mass data exchange. Those. Anything that is transmitted over the Internet is recorded as a file - starting from html pages and ending with streaming video, because on that side we have a file with streaming video and on this side. Often not on this side, but, nevertheless, it is still a file.
Why is that? Why do we base our mass data exchange on file sharing? Because 10 years ago the channels were slow, 10 years ago we couldn’t afford JSON interfaces, we couldn’t request a server all the time, we had too much latency, we needed all the data that is required to display to the user, first download, and then provide the user with the opportunity to interact with this data ... Because otherwise it all lagged so little.
File Storage. In fact, the term "repositories" is extremely unfortunate, because it would be necessary to call them "otdavalischami."
Even 20-30 years ago, they were really storages - you processed the data, you folded it onto a tape, you took this tape to the archive. This is the store. Today nobody needs it. Today, if you have 450 million files, it means that they should all be in hot availability. If you have 20 TB of data, it means that some 1 byte of these 20 TB, any of them, will definitely be needed in the very near future by some of your users. Unless you work in a corporate environment, but if you work in a corporate environment, the word HighLoad to corporate environments is rarely used.
The business requirements never say “store files”, even when it is written ... What is called a backup system - no, this is not a backup system, this is a disaster recovery system, nobody needs to store files, everyone needs to read files - this is important.
Why do files still have to be stored? Because they must be given away, but in order for them to be given they must be with you. And, I must say that this is not always the case, i.e. many projects, for example, do not store html pages, but generate them on the fly, because storing a large number of pages is a problem, and generating a large number of html pages is not a problem, it is a well-scalable task.
File systems are of old type and journaling. What is the old type file system? We wrote data to her, she sent the data to the right place more or less immediately.
What is a journaling file system? This is a file system that writes data to a disk in two stages: first, we write data to a special area of ​​the disk, called “journal”, and then, when we have free time, or when the journal is full, we transfer the data from this journal to where they should lie on the file system. This was designed to speed up the start. We know that our file system is always consistent, so we don’t need to check the file system if we had an unsuccessful emergency restart, for example, a server, but we only need to check a small log. It will be important.
File systems were flat and hierarchical.
Flat is FAT12. You most likely have not met her. There were no directories in it, respectively, all the files were in the root directory and were immediately accessible by offset in the FAT table.
Hierarchical. In fact, organizing directories on a file system is not so difficult on a flat one. For example, in the project where we are solving the problem now, we did it. However, all of the modern file systems you've come across are hierarchical file systems, starting with NTFS, ending with some ZFS. They all store directories as files, in these directories there is a list of the contents of these directories, respectively, in order to get to the file of the 10th nesting level, you need to open 10 files in turn, and the 11th is the one that is yours. On a SATA disk, there are 100 IOPS, you can do 100 operations per second with it, and you have already spent 10, i.e. a file of the 10th nesting level, if all these directories are not in the cache, you will not open it in less than 0.1 seconds, even if your system is no longer engaged in anything.
All modern file systems support access control and extended attributes, except FAT. FAT, oddly enough, is still used and no access control and does not support any extended attributes. Why am I writing here? Because this is a very important moment for me. A terrible tale related to file systems began for me in 1996, when I carefully studied how access control is organized in traditional UNIX. Do you remember the mask rights? The host, the host group, everyone else. I had a task, I needed to create a group that can read and write to a file, a group that can read a file, and everyone else should not be able to do anything with this file. And then I realized that the traditional mask of rights for UNIX does not support such a pattern.
Just a bit of theory. POSIX is a de facto standard, it is now supported by all the operating systems we use. In reality, this is simply a list of calls that the file system should support. What matters to us in all this? Open The fact is that working with a file in POSIX does not occur by the file name, but by some file handler, which you can request from the file system by the file name. After that, you must use this handler for all operations. Operations can be simple read, write, and seek operations, which make it impossible to create a POSIX standard distributed file system.
Why? Because seek is a random move to a random file position, i.e. in reality, we do not know what we read, and we do not know where we are writing. In order to understand what we are doing, we need a handler, which the open operation returns to us, and we need the current position in the file - this is the second addressing. The fact that the POSIX file system supports a random second addressing, and not just a sequential type of "opened and let's read files from beginning to end", or for example, "opened and let's write it, and each new writeable block is added to the end of the file." The fact that POSIX requires that this is not so does not allow (more on that later) to create a good distributed POSIX file system.
What else exists on POSIX file systems? In fact, not all POSIX support the same set of atomic operations, but in any case, a certain number of operations must be atomic. An atomic operation is an operation that occurs or does not occur at one time. Something like transactions in databases, but in fact only reminds. For example, on the ext4 file system, which we should all be familiar with, since we are gathered at this conference, creating a directory is an atomic operation.
The last about the theory. Different things that are not really needed for the functioning of the file system, but are sometimes useful.
Compression online is when we compress a block when writing a block. Supported, for example, on NTFS.
Encryption. It is also supported on NTFS, and on ext4 neither compression nor encryption is supported, there it is organized with the help of block devices that support both. In fact, for the normal functioning of the file system does not require one or the other.
Deduplication A very important point for today's file systems. For example, we have a project with 450 million files on it, but only 200 million chunks - this means that about half of the files are the same, just called differently.
Snapshots. Why is it important? Because you have a file system of 5 TB in size - this means that a consistent copy of it cannot simply be created. Unless you stopped all your processes and started reading from the file system. 5 TB will be read from a cheap SATA disk for about 6 hours according to my estimates, well, 5 by Terabyte per hour. Can you stop your services for 5 hours? No, i guess. Therefore, if you need a consistent copy of the file system, you need snapshots.
Today snapshots are maintained at the block device level in LVM, and there they are supported disgustingly. Look, we create LVM snapshots, and our linear reading turns into random, because we have to read in a snapshot, read on a basic block device. The recording is still worse - we have to read it on the basic volume, we have to write it off to snapshots, we have to read it again from the snapshot. In reality, LVM snapshots are useless.
There are good snapshots on ZFS, there may be a lot of them there, they can be transferred over the network, if you, for example, have made a copy of the file system, then you can transfer the snapshot. In general, snapshot is optional for file storage functionality, but very useful, and later it turns out that it is mandatory.
The most recent, but perhaps the most important thing in this whole theory.
Caching read. Once, when the NT4 installer was launched from under MS DOS, the NT4 installation without running a smartdrive (this is the read cache in MS DOS) took 6 hours, and running a smartdrive took 40 minutes. Why? Because if we do not cache the contents of directories in memory, we are forced to do these 10 steps every time.
Write caching In fact, until recently it was thought that this is a very bad tone, that the only case where you can turn on write caching on the device is if you have a trade controller with a battery. Why? Because this battery allows you to save in memory data that are not on the server, turned off at random. When the server is turned on, you can add this data to disk.
It is clear that the operating system cannot support anything of this kind, it should be done at the controller level. Today, the relevance of this move has plummeted, because today we can use very cheap RAM, which is called an SSD drive. Today, caching write to an SSD drive is one of the easiest and most effective ways to improve the performance of a local file system.
It was all about file systems local to your computer.
High Load is a network. This is also a network in the sense that your visitors come to your network, and this means that you need horizontal scaling. Accordingly, the network file access protocols are divided into two groups: stateless is NFS, WebDAV, and some more protocols.
Stateless - this means that each next operation is independent in the sense that its result does not depend on the result of the previous one. With the POSIX file system standard is not. We have read'a and write'a results depend on the results of seek'a, and he, in turn, on the results of open'a. However, on top of POSIX file systems, there are stateless NFS transfer protocols, for example, and this is his main problem. Why is NFS so shit? Because he is stateless over statefull.
Statefull. Today, statefull protocols are increasingly used in network exchange. This is very bad. The statefull protocol for the Internet is very bad, because we have unpredictable delays, but, nevertheless, more and more often some JSON javascript interface remembers more and more about what ended its previous communication with the server, and orders another JSON, based on what the previous operation ended. From network file systems with the statefull protocol, it should be noted CIFS aka samba.
Duplex bitch fault tolerance. The fact is that traditional file systems rest on data integrity, because their creators were fascinated by the word “storage”. They thought that the most important thing in the data was to keep them and protect them. Today we know that it is not. We listened to the report of the person who is engaged in data protection in data centers on RootConf, he told us firmly that they refuse not only hard disk arrays, but also soft disk arrays. They use each disk separately and some system that monitors the location of data on these disks, the replication of this data. Why? Because if you have a disk array of, for example, five 4 TB disks, then it contains 20 TB. In order for, after an accidental failure, for example, one of the disks has flown out, it must be restored, in reality all 16 TB must be read. 16 TB are read by TB per hour. Those. we only had one disk, but in order to start the array again we need 16 hours to work - this is unacceptable in today's situation.
For today, fault tolerance of binary storage is, first of all, uninterrupted reading and, oddly enough, writing. Those. your vault should not by the first sneeze turn into a pumpkin, which is occupied only by storing the data hidden inside. If the data is lost, God bless them, they are lost, the main thing is that the show goes on.
What else is important to say about network binary storage? The same CAP-theorem, i.e. choose any two of these three:
or your data will always be consistent and always available, but then it will lie on one server;
or your data will always be consistent and distributed among several servers, but it turns out that access to them is limited from time to time;
or your data will always be available and distributed among the servers, but then the fact that you read the same thing from one server and the other is not guaranteed to you at all.
The CAP-theorem is only a theorem, no one has proved it, but in fact this is indeed the case. Fails Attempts are made constantly, for example, OCFS2 (Oracle Cluster Filesystem version 2), which I will mention a little later, is an attempt to prove that the CAP theorem is null.
This is all about file systems. About binary repositories, what's the problem? Let's see.
The easiest way that every system administrator comes to mind who needs to store Tbytes of data and millions of files is to simply buy a large storage system (data storage system).
Why is a large storage system not an option? Because if you have a large storage system and one server that communicates with it, or you could break your data into pieces, and one server communicates with each file, then you have no problems.
If you have horizontal scaling, if you constantly add servers that should give these files or, God forbid, first process, only then give, you will find that you cannot simply put some kind of file system on a large storage system.
When I first got into the hands of DRBD, I thought: fine, I will have two servers, there will be replication between them based on DRBD, and I will have servers that will read from one to the other. It quickly became clear that all the servers are caching reading - this means that even if we have quietly changed something on a block device, a computer that did not change it and knew which cache to disable would never know about it, continue to read the data is not from those places where they actually already lie.
In order to overcome this problem, there are different file systems that provide cache invalidation. In fact, they are engaged in this on all computers that are mounted in a shared storage.
Even with this OCFS2 there is such a problem - the brakes at competitive recording. Remember, we talked about atomic operations - these are operations that are atomic, which occur in one piece. In the case of a distributed file system, even if all our data lies on a single large storage system, the atomic operation to recruit readers and writers requires that they all come to a consensus.
Network consensus is network delays, i.e. actually writing on OCFS2 is a pain. In fact, Oracle is not such an idiot, they made a good file system, they just did it for a completely different task. They did it by sharing the same database files between several of their Oracle servers. Oracle database files have such a pattern that they work fine on this OCFS2. It was unsuitable for file storage, we tried it back in 2008. Even with OCFS2, it turned out to be unpleasant because of time-sharing, i.e. due to the fact that the time is slightly different on all virtual machines that we run even on one host, OCFS2 does not work normally, i.e. At some point, it always happens that the time on this server ensuring consistency has gone backwards, it falls at this place, etc. And why so slowly, I already explained.
And even more-very large storage systems will be quite difficult to obtain for their own use, i.e. for example, in Hesner no big storage system will be given to you. I have a suspicion that the idea that a large storage system is very reliable, very good, very high-performance, is simply connected with the correct calculation of the required resources. You can not just buy a large storage system, they do not sell in the store. You should go to an authorized vender and talk to him. He leaves his head, says: "It will cost you." And you will count thousands on a 50-100 single chassis, i.e. still it will need to be filled with disks, but he will count correctly. Loaded this storage system will be 5-10 percent, and if it turns out that your workload has increased, they will advise you to deliver one more such. This is about life.
Okay, let him. A large storage system is not an option; we found this out with sweat and blood.
We take some cluster file system. We tried several: CEPH / Luster / LeoFS.
First, why so slow? It is clear why - because synchronous operations throughout the cluster. What does “rebalancing” mean? On HDFS there is no automatic rebalancing for data already lying on them. Why is she not there? Because at that moment when rebalancing happens at CEF, we lose the opportunity to work with it. Rebalancing is a well-established procedure that eats approximately 100% of the disk exchange band. Those. Disc Saturation - 100%. Sometimes rebalancing, it does for every sneeze, lasts 10 hours, i.e. The first thing that people who work with CEF do is learn how to tighten the intensity of rebalancing.
In any case, on the very cluster that we are currently using in the project where we have a lot of files and a lot of data, we had to unscrew the rebalancing up, and there we really have a 100% saturation disk. Drives fail under this load very quickly.
Why is rebalancing such? Why does he happen on every sneeze? All these "why" I have remained unanswered so far.
And that problem is atomic operations that have to go through the entire cluster simultaneously. As long as you have two machines in a cluster, you are fine, when you have 40 machines in a cluster, you find that all these 40 machines ... We have 40 2 - the number of network packets that we have to send. Rebalancing protocols and consistency protocols are trying to deal with this, but so far not very successfully. So far, in this sense, systems with a single point of failure with the namenoda are slightly in the lead, but also not very much.
Why not just add all the files in the database? From my point of view, this is exactly what should be done, because if we have files in the database, we have a large package of good tools for working with such things. We are able to work with databases and in a billion lines, and on petabytes, we are able to work with databases and for several billion lines, for several dozen petabytes, it’s all good for us. You want, take Oracle, want - take some DB2, want - take some NoSQL. Why? Because the file is a file. The file cannot be treated as an atomic entity, therefore distributed file systems do not exist well, and distributed databases exist normally.
And a cross on all sorts of ACFS, Lusters, etc., puts what we need to back up files. How do you imagine a backup of 20 TB? Remember, TB per hour. And most importantly - where, how often, how to ensure consistency on such an amount, if we do not have a single file system, and we cannot remove the snapshot. The only way out that I personally see from this situation is file systems with versioning, when you are writing a new file, and the old one does not disappear anywhere and you can reach it by specifying the time you go to see the state of the file system. There must be some sort of garbage collection.
Microsoft promised us such a file system back in the 90s, but it did. Yes, there was a distributed file system for Windows, they even announced it for Longhorn, but then neither Longhorn, nor this file system happened.
Why is backup important? Backup is not fault tolerance - it is protection against operator errors. I myself happened to confuse source and destination in the rsync command and get (magic story!) A server that runs 16 virtual machines, but there are no files with their images, because I deleted them. I had to remove them with the help of the DD command from the virtualok itself. Then it cost. But, nevertheless, we are obliged to provide versioning in our binary repositories, and there is no file system that would normally provide versioning, except ZFS, which is not clustered and, accordingly, does not suit us, not in the world.
What to do? For a start, study your own task.
Do I need to save? If you are able to put all the files on one of its storage systems and process them on one super-powerful server. Now you can have a server with 2 TB of memory and with several hundred cores. If you have enough budget for it all, and you need file storage, do so. It can cost the whole business cheaper.
POSIX. If you do not need a random read or random write, then this is a big plus for you, you can cope with the existing set, for example, HDFS, mentioned earlier, or CFS or Luster. Luster is an excellent file system for a compute cluster, but for a giving cluster it is no good at all.
Large files - are they needed? If all your files can be considered small (small - this is, I remind you, the situation, not the file property), if you can afford to treat the file as a single piece of data, you have no problems - put it in the database - all of you OK. Why did we succeed in that project that I mention here, but I don’t name it? Because there 95% of files are less than 64 Kbytes, respectively, it is always one line in the database, and in this situation everything works fine.
Versioning - is it necessary? In fact, there are situations when versioning is not required, but then backup is not required, these are situations when all your data is generated by your robots. In fact, your file storage is a cache. There is no room for operator error and nothing to lose.
How big should our storage be? If the capacity of a single file system is enough to meet your needs, great, very good.
Are we going to delete files? Oddly enough, this is important. There is such a bike (in fact, it is not a bike) that VKontakte never deletes anything, i.e. as soon as you upload a picture or a tune to it, it is always there, links to this information are deleted, no recovery, i.e. There is no reuse for file space in the same VKontakte. They say I listened to such a report. Why? Because as soon as you try to reuse the place, you immediately have serious problems with consistency. Why is OCFS2 suitable for an Oracle database? Because they do not reuse the place, because when you write new data to the database, they are simply added to the end of the file and that's it. If you want to reuse the space, you run the compact, I don’t know if this is the case in modern Oracle, but that was the case in 2001. You run a compact - this is an offline operation, it provides consistency in that it exclusively owns the file that processes it. Are we going to reuse disk space? Here is the same VKontakte pokes new disks and normal, and I believe that this is necessary.
What will be the load profile? Reading, writing. Many distributed file systems have very poor performance on the record, why? Because consistency, because atomic operations, because synchronous operations throughout the cluster. NoSQL databases have only one synchronous cluster operation. Usually the increment of the record version. The data may not lie, they may come later, but the version of a particular record, all the nodes, everyone should think the same thing about it. And this is not the case for all NoSLQ, for example, Cassandra does not bother with this, Cassandra does not have synchronous operations throughout the cluster. If you're just reading, try one of the cluster file systems, maybe you can do it. These are success stories when people come up and say: “Why did you do all this, just take Luster”. Yes, Luster worked in your situation, but not in ours.
For some combinations of requirements, for some tasks, the solution is right here, and for some combinations it is not, and it really is not. If you searched and did not find it, it means that there is none.
What to do? Here is where to start:
you can go and beg for 200 thousand euros to the authorities for a couple of months and, when they are given, do it well. Just do not beg for 200 thousand, but first go to the Vender and count with him how much you have to beg, and then beg for about one and a half times more;
after all, put all the files in the database - I went this way. We have added 450 million files to the database, but this trick was successful because we do not need any POSIX and we have 95% of the files are small;
You can write your file system. In the end, a variety of algorithms exist, so we wrote on top of our NoSQL database, you can take something else. We wrote our first version on top of Postgresql RDBMS, but here we had some problems, not immediately, but after 2 years, but nonetheless. In fact, it is not very difficult, even POSIX file system is not very difficult to write, take FUSE and go, there are not many calls, all of them can be implemented. But in reality, a well-functioning file system is still difficult to write.
This report is a transcript of one of the best speeches at the training conference for developers of high-load systems HighLoad ++ Junior . Now we are actively preparing an adult brother - a conference of 2016 - this year HighLoad ++ will be held in Skolkovo on November 7 and 8.