📜 ⬆️ ⬇️

DGFS - do-it-yourself fast file system

Sometimes file system tools have to store a lot of information, most of which are static. When there are few files and they are large, it is tolerable. But if the data is in a huge number of small files, access to which is pseudo-random, the situation is approaching a catastrophe.



I believe that a specialized read-only file system, other things being equal, has advantages over this general purpose, because:
')
  1. not necessary to manage free space;
  2. no need to spend money on journaling;
  3. you can not worry about fragmentation and store files continuously;
  4. it is possible to collect all the meta-information in one place and effectively cache it;
  5. sin not to compress the meta-information, since it was in one heap.

In this article, we will understand how to organize a file system, with the objective function of maximum performance at minimum cost.

Task


  1. 100 million small files (~ 8K each).
  2. Three-level hierarchy - 100 top-level directories, each of which has 100 directories of the middle level, each of which has 100 directories of the lower level, each of which has 100 files.
  3. We optimize the build time and the average time to read a random file.

Iron


First of all, the drive: all experiments are performed on a dedicated Seagate Barracuda ST31000340NS with a capacity of 1 Tb.
The operating system is Windows XP.
Processor: AMD Athlon II X3 445 3 GHz.
Memory: 4 GB.

Disk characteristics


Before starting a meaningful job, I want to understand what to expect from the disk. Therefore, a series of measurements of random positioning and disk readings were made.



The presented graph shows histograms of reading times from the disk in arbitrary positions. The abscissa axis shows the duration of positioning and reading. The ordinate is the number of hits in a basket 0.1 msec wide.

Appeals to the disk were made in a regular manner:


Several series of measurements were made, 10,000 in each series. Each series is marked by its color:




And here are the mat. expectations and errors from the previous graph, the x-axis on a logarithmic scale, for example, 1/16 corresponds to 4 (1/2 ** 4).

What conclusions can be made?

  1. In the worst case (random search), the expectation of a single reading is 13 msec. That is, more than 80 random reads per second from this disk can not be done.
  2. At 1/256 we see the cache hit for the first time.
  3. At 1/1024, a noticeable number of readings of ~ 1% begins to fall into the cache.
  4. 1/256 doesn’t differ from 1/1024 anymore, the run-through of the heads is already too small, we see only their overclocking and (mostly) calm.
  5. We have the ability to scale, that is, it is permissible to carry out experiments on a partially filled disk and quite confidently extrapolate the results.

NTFS


Let's try to create a file the size of a disk and get the temporary characteristics of working with it. To do this, create a file commensurate with the size of the file system and read it in random places with regular _fseeki64 / fread .



The graph similar to the previously presented histograms. And two of them, marked as (disc), are taken from the previous chart. Paired histograms, labeled as (file), are derived from measurements of file readings. What can be noted:

  1. Times are close.
  2. Although the file is 90% of the file system (9xE11 bytes), reading from it at full size is, on average, almost a millisecond slower.
  3. Disk cache for a file works better than for a disk.
  4. Tails appeared at ~ 6 msec, if the worst time for full reading was about 24 msec, now it stretches to almost 30. Similarly, for 1/256. Apparently, the metadata fell out of the cache and they have to additionally read.
  5. But in general, reading from a large file works very much like working with a bare disk, and in some cases can replace or emulate it. This makes NTFS honored, because for example, the ext2 / ext3 structure requires a minimum of 4-byte number for every 4-Kb data block:



That is, it is necessary to spend an extra byte for every kilobyte of data, a gigabyte goes to a terabyte, and it's not so easy to cache a gigabyte. However, ext4 seems to have no such problem, or it is masked with extents.

NTFS with hierarchy


Let us recall about our test problem and try to solve it in the forehead with the help of the file system. It immediately turns out that:

  1. It takes a lot of time, it takes about half an hour to create one branch out of a million files, with a tendency to gradually slow down as the disk is filled in accordance with our first schedule.
  2. Creating all 100 million files would therefore take 50+ hours.
  3. Therefore, we will take measurements on 7 million files (~ 1/16) and scale to the complete task.
  4. Attempting to make measurements in more close to combat conditions led to the fact that NTFS collapsed on the 12th million files with approximately the same diagnostics that the main character received in the film “Death to her face”: - “Actually, fragmented cervical vertebra, this is a very, VERY bad sign! ”

First, consider the histogram in milliseconds:



So, here is a histogram of read times of random files.



And now take a look at the same in terms of conditional seek'ov:





Prototype


Let's build a prototype of our read-only filesystem:

  1. It consists of two files - data and metadata.
  2. The data file is simply all the files recorded in a row.
  3. Metadata file - In-tree with key - file identifier. The value is the address in the data file and its size.
  4. The file identifier is the coded path to it. Recall that our data is located in a three-level directory system, where each level is encoded by a number from 0 to 100, and the file itself is similar. This whole structure is packaged as an integer.
  5. The tree is written in the form of 4K pages, primitive compression is applied.
  6. In order to avoid fragmentation, these files have to be created in different file systems.
  7. Simply appending data to the end of the file (data) is ineffective. The larger the file, the more it costs the file system (NTFS, WinXP) its extension. To avoid this non-linearity, you have to pre-create the file of the desired size (900 GB).
  8. It takes this preliminary creation of more than 2 hours, all this time it takes to fill this file with zeros, apparently for security reasons.
  9. The same time it takes to fill this file with data.
  10. Building an index takes about 2 minutes.
  11. Recall only 100 million files, each about 8K in size.
  12. Index file is 521 MB. Or ~ 5 bytes per file. Or ~ 5 bits per kilobyte of data.
  13. A variant of the index with more complex compression (Huffman codes) was also tested. This index occupies 325 MB. Or 3 bits per kilobyte of data.

The following are histograms of times obtained from these data:




Findings:



  1. The quality of index compression almost does not affect the speed of work. CPU time is wasted on the release of pages, but the page cache works better.
  2. Reading the index looks like one reading from a file of the appropriate size (see the first chart). Since 512 MB is a 1/2000 disk, the average seek time should be about 6.5 msec, plus the surcharge for what we read from the file, and not directly from the disk. Apparently, this is it. The index tree is three-level, the top two levels are very fast in the cache, and the bottom level of pages is pointless to cache. As a result, any search in the tree degenerates into a single leaf page reading.
  3. If the data file and the index file are located on the same physical disk, the total search time is two random seek for the full distance. One to read the data and one to go to the index file, which is located either at the very beginning or at the very end of the disk.
  4. If the files are separated by different devices, we have one full seek to read the data and one short look in the index. Given that the index file is relatively small, a natural solution suggests itself - if we need a storage of several (tens) terabytes, we need to prepare a dedicated disk for the index (SSD) and several disks for the data. It can be designed as a JBOD and looks like a regular file system.
  5. This scheme has an important feature - guaranteed response time, which is difficult to achieve in conventional file systems, for example, for use in real-time systems.

Here is the time to ask - well, we have made out a very special case of a regularly arranged file system (three levels with fixed names). And what about the usual data, with the usual paths and names?

The answer is that nothing will change in principle. There will still be one main index. But it will be arranged a little differently.

The above file ID (index key) was the encoded path to it. So it will be, the key is the string full path to the file. Everything else in meaning - size, position, owner, group, rights ...

It is possible that there will be not so many combinations (owner, group, rights) and it will be advantageous to store them with Huffman codes indicating the global table of combinations. But these are details. In principle, the situation is no different from the one we have already tried.

To find a file, you need to specify its full path. To view the contents of a directory, you need to specify the path of the directory and select everything that matches the prefix. To move up the hierarchy, bite off a piece of the path and again look for the prefix.

But do file paths take up a lot of space? Yes, but files in one directory have a common prefix that can be stored only once. The common suffix is ​​also common. To understand the amount of data that needs to be stored, a series of experiments with file system paths were performed, which showed that they were compressed by an average of 10 times using bzip2 (BWT + suffix sort + huffman). Thus, it remains 5-6 bytes per file.

In total, you can count on 10-16 bytes per file, or ~ 2 bytes per kilobyte of data (if the average file is 8K). To a terabyte file, you need a 2 gigabyte index, a regular 64 gigabyte SSD disk, therefore, it can serve 32 terabytes of data.

Analogs


Full counterparts and difficult to produce.
For example, SquashFS , with zero compression can be used in a similar way. But this system repeats the traditional structure, it simply compresses files, inodes and directories. Therefore, all those bottlenecks that we diligently tried to avoid were inherent.

An exemplary analogue is the use of a DBMS for data storage. In fact: we store the contents of files as blobs, and metadata as attributes of a table. Simple and elegant. The DBMS itself takes care of how to efficiently place files in the available disk space. Itself creates and maintains all the necessary indices. Plus a nice bonus - we can change our data without restrictions.
But:


Transactional


Obviously, the disposable file system is not needed by anyone. Time to fill a file system of several terabytes may take tens of hours. During this time, new changes may accumulate, therefore, there must be a method for changing data that is different from a full update. In this case, it is desirable not to stop work. Well, this topic is worthy of a separate story, and over time we are going to talk about it.

Total


So, a prototype of the read-only file system is presented that is simple, flexible, solves a test problem, and at the same time:

Source: https://habr.com/ru/post/198564/


All Articles