I write a search engine (virtual project). Data storage

Storage - perhaps the most subtle place of such projects. Depending on the tasks it is supposed to provide:
- quick access to data;
- fast data update;
- sufficient functionality with expansion options.
In queuing systems with a large flow of requests, a short processing time for an individual request is the key to the performance of the system.
If the efficiency of the appearance of new data in the access (news systems) is important, then the speed of updating the database comes to the fore.
With the growth of data volumes, combining high-speed access and updates becomes almost impossible.

Using universal DBMS in this situation is almost impossible (whoever disagrees with this thesis - I suggest convincing Yandex or Google of this). The volumes of data processed by me are much more modest (the data deployed on a separate server takes up about two hundred gig), but they also made me look for other solutions.
Using well-known KeyValue NoSQL repositories also does not always help. I tried to update several hundreds of millions of records in Berkelley DB - the result did not suit me, especially considering the need to do this operation daily.
As a result, I chose the following cocktail for myself (the methods of work described below are used in real tasks and have proven their effectiveness):
- ordered linear list with hash keys;
- ordered linear list with an index in the form of a binary tree;
- Berkeley DB with add-in from MemcacheDB.
The first two options are used to store the main index, the last - when the number of records requiring daily updates / additions does not exceed several million and you need to quickly add-delete-overwrite several records.
The indices are quite large, so storing all the data in RAM is impossible.
The fastest option, of course - hash keys. The value immediately get the result. The only drawback of these keys is that they require extra resources for storage. I have voids in the index do not exceed 20% of the total. This is achieved due to the fact that the hash is not calculated by the desired value, but in fact is the ID assigned to this record. ID of retiring records over time are reused.
Unfortunately, such a scheme is not possible everywhere. For example, sometimes it is necessary to perform the reverse operation - by value get an ID. In this situation, the use of hashes may be too resource-intensive. In such cases, I use a key in the form of a binary tree. Its main disadvantage is before a hash - to search for a separate record, log2 disk access is required instead of one. When the index contains more than 500 million entries - at least 28 reads. It is quite expensive. Of course, the system tries to help me by caching the most frequently requested blocks. As a result, the result is quite acceptable. But in order not to rely on the mercy of the system, I decided to make the following storage option combined. The first half of keys will be stored in memory as a hash. Thus, with a small expenditure of RAM, we obtain a decrease in the number of disk accesses. And if we take into account that the data on the disk is ordered, then by hash we once read the key block and then work with it in the borrowing. This option has not yet been implemented, but I think it will be more nimble than a binary tree.

So I try to provide high speed data access. Now you need to solve the update problem.
The process of updating data daily involves four tables, each from half a billion to a billion records. Plus, the input receives about a hundred and fifty million new records, which in the process pass through these tables like a sieve.
For example, one of the operations. Given:
- dictionary [key, ID], about 700 million entries;
- initial data [key, data], about 70 million records.
Keys are text strings. It is necessary to replace the keys in the second table with IDs, and in the first one, add new keys from the second table. Key in the second table is non-unique. For relational, a DBMS cannot be done with one query, plus add the overhead of importing and indexing the second table (the data comes in a text file). How much of this whole process will be carried out is difficult to imagine. Work with smaller volumes gave unsatisfactory results.
KeyValue systems also do not save the situation. Although MemcacheDB lined up quite tasty benchmarks , judging by the amount of data used in the experiment, such results were achieved without going beyond the limits of the RAM, which in my case is impossible (the first table is 7 gig in zip data itself).
To solve this problem, I use the merging of two streams with the subsequent separation of the result (the second table is added to the first one while saving the replacement results). The algorithm is quite simple, it has long been described by D. Knuth and not only by him, and it is quite effective. In various variations, I use this scheme in many other treatments.
It takes me about 50 minutes to complete the above operation (2x2.6GHz Opteron-2218 server, 16 Gig OP, 4x73G scsi disks).
Thus, the task of a daily update is solved in a reasonable time. I will not say that it suits me completely, but in order to increase efficiency, it may take more complex designs for which there is not enough time. There are several paper sketches, but they require a preliminary check “in the metal”, therefore, that is what I am talking about.

Source: https://habr.com/ru/post/91828/

All Articles

I write a search engine (virtual project). Data storage

More articles: