In one of the past projects I was given the task of writing a system for storing billions of records. Access to the data must be done by key: in the general case, one key corresponds to a set (in practice, up to tens of millions) of records that can be added but not modified or deleted.
The storage systems tested by SQL / NoSQL proved to be poorly adapted to such a number of records, so the client suggested developing a specialized solution from scratch.
After a series of experiments, the following approach was chosen. The data in the database is divided into sections, each of which is a file or directory on the disk. Section corresponds to the value of CRC16-hash, i.e. perhaps only 65,536 sections. Practice shows that modern file systems (ext4 tested) quite effectively cope with so many items within a single directory. So, the added records are hashed by the key and distributed into the corresponding sections.
Each section consists of a cache (a file in which unordered entries are accumulated up to a given size) and an index (a set of compressed files that store ordered by key records). Those. The following scenario is assumed:
Each index file has the same name as the key of the first stored record (in practice, it is url-encoded for compatibility with file systems). Thus, when searching for records by key, the index allows not to read the entire section, but only the cache and a small part of the index files in which the records may be located.
Mastore (from massive storage) is written in Golang and is assembled into an executable file launched in read, write or self-test mode. While running in the write mode, Mastore reads stdin text strings consisting of a key and a value separated by a tab (for binary data, additional encoding can be used, for example, Base85):
mastore write [-config=<config>]
To read records by a given key, use the following command:
mastore read [-config=<config>] -key=<key>
And to get a list of all the keys:
mastore read [-config=<config>] -keys
Mastore is configured using a JSON file. Here is a default configuration example:
{ "StorePath": "$HOME/$STORE", "MaxAccumSizeMiB": 1024, "MaxCacheSizeKiB": 1024, "MaxIndexBlockSizeKiB": 8192, "MinSingularSizeKiB": 8192, "CompressionLevel": -1, "MaxGoroutines": 1 }
Source: https://habr.com/ru/post/317874/
All Articles