Storage System for Billions of Records with Key Access

Even an elephant will not sustain so much data.

Formulation of the problem

In one of the past projects I was given the task of writing a system for storing billions of records. Access to the data must be done by key: in the general case, one key corresponds to a set (in practice, up to tens of millions) of records that can be added but not modified or deleted.

The storage systems tested by SQL / NoSQL proved to be poorly adapted to such a number of records, so the client suggested developing a specialized solution from scratch.

Selected approach

After a series of experiments, the following approach was chosen. The data in the database is divided into sections, each of which is a file or directory on the disk. Section corresponds to the value of CRC16-hash, i.e. perhaps only 65,536 sections. Practice shows that modern file systems (ext4 tested) quite effectively cope with so many items within a single directory. So, the added records are hashed by the key and distributed into the corresponding sections.

Each section consists of a cache (a file in which unordered entries are accumulated up to a given size) and an index (a set of compressed files that store ordered by key records). Those. The following scenario is assumed:

Records are accumulated in memory up to a certain limit, and further distributed into sections.
Inside the recording section will be cached if it does not exceed the specified size.
Otherwise, the contents of the index will be fully read (with some reservations, about them below), the contents of the cache will be added to it, and then all this data will be sorted and divided into index files (the cache will be reset).

Each index file has the same name as the key of the first stored record (in practice, it is url-encoded for compatibility with file systems). Thus, when searching for records by key, the index allows not to read the entire section, but only the cache and a small part of the index files in which the records may be located.

Detailed algorithm for adding entries to the section

If the size of the added entries in the amount with the size of the cache file is less than the specified maximum volume, then the entries are simply added to the cache file, and the section processing is complete.
Otherwise, the contents of the cache file as well as the index files are read (excluding files with a size larger than the specified limit) and added to the section being processed.
The section is sorted by key value.
A temporary directory is created in which new index files will be written.
The sorted array of records is divided into equal parts (without breaking the records) of a given size (but with different names, otherwise the file grows to the desired size, ignoring the limit) and written into gzip-compressed index files. Each such file has the name _url_encoded <key> _XXXX, where the key is the key of the first record of the file contents, and XXXX is 4 hexadecimal digits (needed to distinguish files with one key value while maintaining the lexicographic naming order). XXXX is equal to 0000 if there is no file with the same name in the section directory, otherwise 0001, etc.
For all files whose names have collisions (and had to increase XXXX) we create hard links in the temporary directory.
Delete the old directory of the section and rename the temporary one to its place.

How to use

Mastore (from massive storage) is written in Golang and is assembled into an executable file launched in read, write or self-test mode. While running in the write mode, Mastore reads stdin text strings consisting of a key and a value separated by a tab (for binary data, additional encoding can be used, for example, Base85):

mastore write [-config=<config>]

To read records by a given key, use the following command:

 mastore read [-config=<config>] -key=<key>

And to get a list of all the keys:

 mastore read [-config=<config>] -keys

Mastore is configured using a JSON file. Here is a default configuration example:

 { "StorePath": "$HOME/$STORE", "MaxAccumSizeMiB": 1024, "MaxCacheSizeKiB": 1024, "MaxIndexBlockSizeKiB": 8192, "MinSingularSizeKiB": 8192, "CompressionLevel": -1, "MaxGoroutines": 1 }

Source: https://habr.com/ru/post/317874/

All Articles

Storage System for Billions of Records with Key Access

Formulation of the problem

Selected approach

Detailed algorithm for adding entries to the section

How to use

More articles: