📜 ⬆️ ⬇️

Looking for the perfect file storage


Earlier we considered a prototype of a scalable read-only file system. It was possible to show that, using the proposed architecture, it is possible to build a file system of any capacity, with guaranteed access time, commensurate with that for accessing a file within one physical disk.
Next, we will try to figure out whether such an approach can be useful in building a general-purpose file system.

It is worth noting in advance that the author is not a recognized expert in the field of distributed file systems and does not set as his goal to bring happiness to mankind with another ingenious invention. Rather, I want to demonstrate the community a sensible idea and try to shake it in the discussion. In addition, public discussion can save the idea from enclosing patents.

So, we recall the main conclusions

which were made when building a read-only prototype system:
  1. The contents of the file system can be divided into two parts.
    • Actually data, file contents
    • Metadata describing the location of files and related information
  2. Metadata should not be part of the file system structure; it is just data about other data. There should be no special mechanisms responsible for the user-defined structure of the file system.
  3. To describe a file system, its representation is well suited as a tree, in which the key is the path, and the value is the file data (and the appendage from the attributes).
  4. Metadata is easily compressed.
  5. Data and metadata can be stored on various media, including media with different physical structure. So, relatively small in terms of metadata can be placed on faster and more expensive devices, and the actual file data - on something simpler.


Now the goals you want to achieve:


  1. Modularity (scalability) by capacity
  2. Performance modularity
  3. External Interface Modularity
  4. Logarithmic performance degradation with data growth
Under the modularity refers to the ability to increase the corresponding capacity block by block if they become a bottleneck.
')
In general, we are not talking about a distributed network of devices laid on a physical network of an incomprehensible topology. It is assumed that everything looks like a single device that implements the functions of a general-purpose file system. And the internal structure is our purely internal matter.

Little about the problems

.
Modularity is good, but let's start from the end, with logarithmic performance degradation. What does the use of a balanced tree mean?

A tree is a well-known and understandable structure that has long been successfully used in DBMS and file systems. The undoubted advantage of balanced trees is the ability to grow and shrink without degradation, while maintaining acceptable utilization of disk space.

In our case, the file system is the tree, where the key is the path to the file, and the value is its attributes and the file body. Moreover, the body is an analogue of a BLOB and can be located in another address space, for example, in a different storage environment.

The initial impulse was just that - but what if you make a “super-tree”, the “super-pages” of which will be autonomous disk subsystems that, when overflowed, logically fall apart in two, just like in the B-tree. In case of under-fulfillment, two super-pages logically merge into one, unnecessary disk space goes back to the reserve. It is assumed that the transitions within the super-pages are relatively cheap, and between them - are expensive.

Unfortunately, being implemented in the forehead, such a scheme is not viable. The explanation is rather boring, but it is necessary.

Trees imply a paged organization of disk space. Once the pages link to each other, page identifiers must exist. A page identifier is a number that can easily be converted to an offset within a file / device. The identifier space is one-dimensional, even if the pages are on a heap of disks, we must indicate in advance (possibly implicitly) how we are going to cut the address space on these devices.

What follows from this? On the one hand, when writing to the file system, we deal with the one-dimensional order of the file paths, and we do not manage these paths, the user is free to name the files / directories as he pleases.
On the other hand, there is a one-dimensional page identifier space, the order of which is somehow related to the sequence of file creation. We also do not control this order.

There is no natural connection between the order of traversing the tree of files and the order of pages that you have to read, there is no. Of course, correlations can occur, for example, when unpacking from an archive, but it is hardly reasonable to rely on such a link. And the situation when logically similar information is physically strongly divided (the locality of links is broken) is quite normal and very likely.

Here is the phase diagram of reading pages when the test B-tree is thrown over, constructed by inserting it in an arbitrary order:
image
FIG. one

On the abscissa - the number of the current readable page, on the ordinate - read the last time.

And so - a tree containing the same data passed through the “settler”, a buffer-drive, when overflowed which data is sorted before recording:
image
Fig 2

Ideally, when the order of insertion coincides with the natural order of the key, we would have just a diagonal line.

In the case of recording file system metadata, the use of a “settling tank” is impossible and when filling the tree, the pages will be allocated in accordance with, rather, the first option.

What does it threaten with? The fact that when it comes time to cut the super-page, we will face a difficult choice:

On the dimension of the space of identifiers.


An awkward question, why do we consider the page identifier space to be one-dimensional? Partly for historical reasons, but mainly because it suits everyone and there is no reason to change anything. Technically, it is not difficult, with hundreds / thousands of disks, to make two-dimensional page addressing - (disk ID, position on disk) or three-dimensional (pile ID, disk ID, disk space). It is not clear, however, what to do with such an anisotropic address space, but the fact of technical possibility has yet to be comprehended.

It's funny, before the author had to puzzle over the question of how to compact the 2-4 dimensional spatial index more efficiently into a one-dimensional page address space. And now we need to construct the page space in such a way as to place two one-dimensional spaces in it — the paths and the order in which the files are created. A successful design will be able to perform the role of a “sump”, increase the locality of the file system and reduce the number of long / expensive transitions in it.

Let's start, perhaps.


Let's start with the description of the basic elements:

Let us analyze how the filling of our file system.



It's time to look at the big picture:


image
  1. Front end,
  2. I / O module (IO). Accepts requests from the outside, appoints the performer, waits for a response and sends the result. By increasing their number, we scale the external bandwidth
  3. An internal bus (for example, local ethernet) through which IO modules and executive modules communicate.
  4. Another internal tire. Through it, the storage modules communicate with the executive modules.
  5. The executive module implements work with the tree and manages distributed transactions. Increasing their number, we scale the possibilities of parallel execution of requests.
  6. Storage module With their help, we scale the capacity of the entire system.
  7. System registry


It is time to turn our gaze to the real world and see how the problems described are actually solved.


GlusterFS
Consider the DHT mode as closest to the topic of this article.

Sources:
http://www.gluster.org/
http://cloudfs.org/index.php/2012/03/glusterfs-algorithms-distribution/
http://people.redhat.com/ndevos/talks/Gluster-data-distribution_20120218.pdf

Swift

Sources:
http://docs.openstack.org/developer/swift/
http://habrahabr.ru/company/mirantis_openstack/blog/176195/
https://julien.danjou.info/blog/2012/openstack-swift-consistency-anysis

Cephfs

Sources:
http://ceph.com/
http://ceph.com/papers/weil-ceph-osdi06.pdf
http://ceph.com/papers/weil-mds-sc04.pdf
http://ceph.com/papers/weil-crush-sc06.pdf
http://ceph.com/papers/weil-rados-pdsw07.pdf

, , .

CephFS . , , , (, ) .

CephFS . , , , . . , , .

S3- — . . , .

.


? . “ ”(). , . .

PS (DataEast), (2) (2) .

PPS .. « : ».

Source: https://habr.com/ru/post/244491/


All Articles