📜 ⬆️ ⬇️

Binary (file) storage, a terrible tale with a gloomy end



Daniil Podolsky (Git in Sky)


My report is called "Binary, they are also file storage", but, in fact, we are dealing with a terrible tale. The problem is (and this is the thesis of my report) that now there is not something good, but at least an acceptable file storage system.

What is a file? A file is a piece of data named. What is important? Why is the file not a string in the database?
')
The file is too large to be treated as one piece. Why? You have a service, since we have a HighLoad conference, you have a service that holds at the same time 100 thousand connections. This is not so much if for each of the connections we give a file of 1 MB in size, but we need approximately 100 GB of memory for buffers for these files.



We can not afford it. Usually, anyway. Accordingly, we have to divide these pieces of data into smaller pieces, which are called chunks. Quite often, and sometimes in blocks. And handle the blocks, not the entire file. Accordingly, we have another addressing level, and a little later it will become clear why this affects the quality of binary storages so much.


Nevertheless, despite the fact that there are no good binary repositories in the world, file exchange is the cornerstone of modern mass data exchange. Those. Anything that is transmitted over the Internet is recorded as a file - starting from html pages and ending with streaming video, because on that side we have a file with streaming video and on this side. Often not on this side, but, nevertheless, it is still a file.

Why is that? Why do we base our mass data exchange on file sharing? Because 10 years ago the channels were slow, 10 years ago we couldn’t afford JSON interfaces, we couldn’t request a server all the time, we had too much latency, we needed all the data that is required to display to the user, first download, and then provide the user with the opportunity to interact with this data ... Because otherwise it all lagged so little.



File Storage. In fact, the term "repositories" is extremely unfortunate, because it would be necessary to call them "otdavalischami."

Even 20-30 years ago, they were really storages - you processed the data, you folded it onto a tape, you took this tape to the archive. This is the store. Today nobody needs it. Today, if you have 450 million files, it means that they should all be in hot availability. If you have 20 TB of data, it means that some 1 byte of these 20 TB, any of them, will definitely be needed in the very near future by some of your users. Unless you work in a corporate environment, but if you work in a corporate environment, the word HighLoad to corporate environments is rarely used.

The business requirements never say “store files”, even when it is written ... What is called a backup system - no, this is not a backup system, this is a disaster recovery system, nobody needs to store files, everyone needs to read files - this is important.

Why do files still have to be stored? Because they must be given away, but in order for them to be given they must be with you. And, I must say that this is not always the case, i.e. many projects, for example, do not store html pages, but generate them on the fly, because storing a large number of pages is a problem, and generating a large number of html pages is not a problem, it is a well-scalable task.



File systems are of old type and journaling. What is the old type file system? We wrote data to her, she sent the data to the right place more or less immediately.

What is a journaling file system? This is a file system that writes data to a disk in two stages: first, we write data to a special area of ​​the disk, called “journal”, and then, when we have free time, or when the journal is full, we transfer the data from this journal to where they should lie on the file system. This was designed to speed up the start. We know that our file system is always consistent, so we don’t need to check the file system if we had an unsuccessful emergency restart, for example, a server, but we only need to check a small log. It will be important.

File systems were flat and hierarchical.


All modern file systems support access control and extended attributes, except FAT. FAT, oddly enough, is still used and no access control and does not support any extended attributes. Why am I writing here? Because this is a very important moment for me. A terrible tale related to file systems began for me in 1996, when I carefully studied how access control is organized in traditional UNIX. Do you remember the mask rights? The host, the host group, everyone else. I had a task, I needed to create a group that can read and write to a file, a group that can read a file, and everyone else should not be able to do anything with this file. And then I realized that the traditional mask of rights for UNIX does not support such a pattern.



Just a bit of theory. POSIX is a de facto standard, it is now supported by all the operating systems we use. In reality, this is simply a list of calls that the file system should support. What matters to us in all this? Open The fact is that working with a file in POSIX does not occur by the file name, but by some file handler, which you can request from the file system by the file name. After that, you must use this handler for all operations. Operations can be simple read, write, and seek operations, which make it impossible to create a POSIX standard distributed file system.

Why? Because seek is a random move to a random file position, i.e. in reality, we do not know what we read, and we do not know where we are writing. In order to understand what we are doing, we need a handler, which the open operation returns to us, and we need the current position in the file - this is the second addressing. The fact that the POSIX file system supports a random second addressing, and not just a sequential type of "opened and let's read files from beginning to end", or for example, "opened and let's write it, and each new writeable block is added to the end of the file." The fact that POSIX requires that this is not so does not allow (more on that later) to create a good distributed POSIX file system.

What else exists on POSIX file systems? In fact, not all POSIX support the same set of atomic operations, but in any case, a certain number of operations must be atomic. An atomic operation is an operation that occurs or does not occur at one time. Something like transactions in databases, but in fact only reminds. For example, on the ext4 file system, which we should all be familiar with, since we are gathered at this conference, creating a directory is an atomic operation.



The last about the theory. Different things that are not really needed for the functioning of the file system, but are sometimes useful.


The most recent, but perhaps the most important thing in this whole theory.

Caching read. Once, when the NT4 installer was launched from under MS DOS, the NT4 installation without running a smartdrive (this is the read cache in MS DOS) took 6 hours, and running a smartdrive took 40 minutes. Why? Because if we do not cache the contents of directories in memory, we are forced to do these 10 steps every time.

Write caching In fact, until recently it was thought that this is a very bad tone, that the only case where you can turn on write caching on the device is if you have a trade controller with a battery. Why? Because this battery allows you to save in memory data that are not on the server, turned off at random. When the server is turned on, you can add this data to disk.

It is clear that the operating system cannot support anything of this kind, it should be done at the controller level. Today, the relevance of this move has plummeted, because today we can use very cheap RAM, which is called an SSD drive. Today, caching write to an SSD drive is one of the easiest and most effective ways to improve the performance of a local file system.

It was all about file systems local to your computer.



High Load is a network. This is also a network in the sense that your visitors come to your network, and this means that you need horizontal scaling. Accordingly, the network file access protocols are divided into two groups: stateless is NFS, WebDAV, and some more protocols.

Stateless - this means that each next operation is independent in the sense that its result does not depend on the result of the previous one. With the POSIX file system standard is not. We have read'a and write'a results depend on the results of seek'a, and he, in turn, on the results of open'a. However, on top of POSIX file systems, there are stateless NFS transfer protocols, for example, and this is his main problem. Why is NFS so shit? Because he is stateless over statefull.

Statefull. Today, statefull protocols are increasingly used in network exchange. This is very bad. The statefull protocol for the Internet is very bad, because we have unpredictable delays, but, nevertheless, more and more often some JSON javascript interface remembers more and more about what ended its previous communication with the server, and orders another JSON, based on what the previous operation ended. From network file systems with the statefull protocol, it should be noted CIFS aka samba.

Duplex bitch fault tolerance. The fact is that traditional file systems rest on data integrity, because their creators were fascinated by the word “storage”. They thought that the most important thing in the data was to keep them and protect them. Today we know that it is not. We listened to the report of the person who is engaged in data protection in data centers on RootConf, he told us firmly that they refuse not only hard disk arrays, but also soft disk arrays. They use each disk separately and some system that monitors the location of data on these disks, the replication of this data. Why? Because if you have a disk array of, for example, five 4 TB disks, then it contains 20 TB. In order for, after an accidental failure, for example, one of the disks has flown out, it must be restored, in reality all 16 TB must be read. 16 TB are read by TB per hour. Those. we only had one disk, but in order to start the array again we need 16 hours to work - this is unacceptable in today's situation.

For today, fault tolerance of binary storage is, first of all, uninterrupted reading and, oddly enough, writing. Those. your vault should not by the first sneeze turn into a pumpkin, which is occupied only by storing the data hidden inside. If the data is lost, God bless them, they are lost, the main thing is that the show goes on.

What else is important to say about network binary storage? The same CAP-theorem, i.e. choose any two of these three:

  1. or your data will always be consistent and always available, but then it will lie on one server;
  2. or your data will always be consistent and distributed among several servers, but it turns out that access to them is limited from time to time;
  3. or your data will always be available and distributed among the servers, but then the fact that you read the same thing from one server and the other is not guaranteed to you at all.

The CAP-theorem is only a theorem, no one has proved it, but in fact this is indeed the case. Fails Attempts are made constantly, for example, OCFS2 (Oracle Cluster Filesystem version 2), which I will mention a little later, is an attempt to prove that the CAP theorem is null.

This is all about file systems. About binary repositories, what's the problem? Let's see.

The easiest way that every system administrator comes to mind who needs to store Tbytes of data and millions of files is to simply buy a large storage system (data storage system).



Why is a large storage system not an option? Because if you have a large storage system and one server that communicates with it, or you could break your data into pieces, and one server communicates with each file, then you have no problems.

If you have horizontal scaling, if you constantly add servers that should give these files or, God forbid, first process, only then give, you will find that you cannot simply put some kind of file system on a large storage system.

When I first got into the hands of DRBD, I thought: fine, I will have two servers, there will be replication between them based on DRBD, and I will have servers that will read from one to the other. It quickly became clear that all the servers are caching reading - this means that even if we have quietly changed something on a block device, a computer that did not change it and knew which cache to disable would never know about it, continue to read the data is not from those places where they actually already lie.

In order to overcome this problem, there are different file systems that provide cache invalidation. In fact, they are engaged in this on all computers that are mounted in a shared storage.

Even with this OCFS2 there is such a problem - the brakes at competitive recording. Remember, we talked about atomic operations - these are operations that are atomic, which occur in one piece. In the case of a distributed file system, even if all our data lies on a single large storage system, the atomic operation to recruit readers and writers requires that they all come to a consensus.

Network consensus is network delays, i.e. actually writing on OCFS2 is a pain. In fact, Oracle is not such an idiot, they made a good file system, they just did it for a completely different task. They did it by sharing the same database files between several of their Oracle servers. Oracle database files have such a pattern that they work fine on this OCFS2. It was unsuitable for file storage, we tried it back in 2008. Even with OCFS2, it turned out to be unpleasant because of time-sharing, i.e. due to the fact that the time is slightly different on all virtual machines that we run even on one host, OCFS2 does not work normally, i.e. At some point, it always happens that the time on this server ensuring consistency has gone backwards, it falls at this place, etc. And why so slowly, I already explained.

And even more-very large storage systems will be quite difficult to obtain for their own use, i.e. for example, in Hesner no big storage system will be given to you. I have a suspicion that the idea that a large storage system is very reliable, very good, very high-performance, is simply connected with the correct calculation of the required resources. You can not just buy a large storage system, they do not sell in the store. You should go to an authorized vender and talk to him. He leaves his head, says: "It will cost you." And you will count thousands on a 50-100 single chassis, i.e. still it will need to be filled with disks, but he will count correctly. Loaded this storage system will be 5-10 percent, and if it turns out that your workload has increased, they will advise you to deliver one more such. This is about life.

Okay, let him. A large storage system is not an option; we found this out with sweat and blood.



We take some cluster file system. We tried several: CEPH / Luster / LeoFS.

First, why so slow? It is clear why - because synchronous operations throughout the cluster. What does “rebalancing” mean? On HDFS there is no automatic rebalancing for data already lying on them. Why is she not there? Because at that moment when rebalancing happens at CEF, we lose the opportunity to work with it. Rebalancing is a well-established procedure that eats approximately 100% of the disk exchange band. Those. Disc Saturation - 100%. Sometimes rebalancing, it does for every sneeze, lasts 10 hours, i.e. The first thing that people who work with CEF do is learn how to tighten the intensity of rebalancing.

In any case, on the very cluster that we are currently using in the project where we have a lot of files and a lot of data, we had to unscrew the rebalancing up, and there we really have a 100% saturation disk. Drives fail under this load very quickly.

Why is rebalancing such? Why does he happen on every sneeze? All these "why" I have remained unanswered so far.

And that problem is atomic operations that have to go through the entire cluster simultaneously. As long as you have two machines in a cluster, you are fine, when you have 40 machines in a cluster, you find that all these 40 machines ... We have 40 2 - the number of network packets that we have to send. Rebalancing protocols and consistency protocols are trying to deal with this, but so far not very successfully. So far, in this sense, systems with a single point of failure with the namenoda are slightly in the lead, but also not very much.



Why not just add all the files in the database? From my point of view, this is exactly what should be done, because if we have files in the database, we have a large package of good tools for working with such things. We are able to work with databases and in a billion lines, and on petabytes, we are able to work with databases and for several billion lines, for several dozen petabytes, it’s all good for us. You want, take Oracle, want - take some DB2, want - take some NoSQL. Why? Because the file is a file. The file cannot be treated as an atomic entity, therefore distributed file systems do not exist well, and distributed databases exist normally.



And a cross on all sorts of ACFS, Lusters, etc., puts what we need to back up files. How do you imagine a backup of 20 TB? Remember, TB per hour. And most importantly - where, how often, how to ensure consistency on such an amount, if we do not have a single file system, and we cannot remove the snapshot. The only way out that I personally see from this situation is file systems with versioning, when you are writing a new file, and the old one does not disappear anywhere and you can reach it by specifying the time you go to see the state of the file system. There must be some sort of garbage collection.

Microsoft promised us such a file system back in the 90s, but it did. Yes, there was a distributed file system for Windows, they even announced it for Longhorn, but then neither Longhorn, nor this file system happened.

Why is backup important? Backup is not fault tolerance - it is protection against operator errors. I myself happened to confuse source and destination in the rsync command and get (magic story!) A server that runs 16 virtual machines, but there are no files with their images, because I deleted them. I had to remove them with the help of the DD command from the virtualok itself. Then it cost. But, nevertheless, we are obliged to provide versioning in our binary repositories, and there is no file system that would normally provide versioning, except ZFS, which is not clustered and, accordingly, does not suit us, not in the world.

What to do? For a start, study your own task.


For some combinations of requirements, for some tasks, the solution is right here, and for some combinations it is not, and it really is not. If you searched and did not find it, it means that there is none.

What to do? Here is where to start:


This report is a transcript of one of the best speeches at the training conference for developers of high-load systems HighLoad ++ Junior . Now we are actively preparing an adult brother - a conference of 2016 - this year HighLoad ++ will be held in Skolkovo on November 7 and 8.

, :



- HighLoad.Guide — , , , . 30 . !

, — " ". , :)

Source: https://habr.com/ru/post/313330/


All Articles