📜 ⬆️ ⬇️

It is a little about database design for the search engine

Without a database, even without several radically different, such a project is impossible. Therefore, I will devote some time to this issue.

So, at a minimum, you will need a database that serves normal “flat” (“2D”) data — that is, some ID is mapped to a data field.
Why is the data field I am considering one? Because:

If you do not ask yourself the task of minimizing the number of lines of code for working with data and a little convenience, then almost any task can be reduced to the one where these points will be sufficient. And in the case of such high requirements for optimality and speed, in my opinion this is fully justified.

The main operations for such database tables are:


I found it optimal to use “page-type” tables, where data is stored, written, read from disk by pages. Each page has a fixed number of entries. Quite simply, if I know in advance the fixed record size - then the table will work even faster, however, even if the record size changes, nothing changes significantly in processing. Update, add to the end are made within the page in memory, then the page is written to disk. In the table file the pages are stored sequentially.
')
The question arises how to update the records in the middle of the table when their size changes - because if the entire table is more than 10-20-200 GB, then copy half of the table into a temporary file, and then it will take hours back? I put this question on the file system by breaking all the pages into blocks. One block - one file on a disk, the number of files of one table is not limited. Each file stores a consistently limited fixed number of pages. Then if I need to update the record in the middle of the table, then I need to change only 1 file of a much smaller and, often, limited amount. Responsibility for not asking the file system to do stupid things like writing first to the beginning, then to the end, then again to the beginning - I took it upon myself. Just so as not to strain the server, I always write in batch, the corresponding functionality is optimized as much as possible, everything happens in memory. Well, of course, the entire system of search engine modules is built on the assumption that it is 1000 entries at the end faster than 1 at the beginning - therefore, at the beginning, it is sometimes easier to make a copy of the table.

Ok, with the usual tables decided. Now the database described is very good, in particular, it processes in the process of searching 35 GB of pieces of text, with an arbitrary selection.

But there are limitations - to keep a correspondence in such a table: for each word there is a list of documents in which the word is encountered (along with additional information) - well, there will be a lot of documents for each word, and accordingly the volume will be huge.

So, what operations should be done with such a database:


How to update such an index? Obviously, if the index is empty and we begin to insert lists of documents in it, starting with the first word and ending with the last, we will only write to the end of the file. Moreover, to write or not to write physically separate blocks for each word - separately on the disk, the developer's case - and in either case, you can simply remember: where the next block ended and its length, save it to the simplest list. Then the procedure for sequential reading will be this: move in the file to the beginning of the list for the desired word, and read successively until the list for the next word begins: 1 seek, and the minimum number of read is a victory (here I specifically do not consider the operation of the file system itself - you can separately do their optimization)

Now, obviously, when we want to add information about newly indexed pages to the index, we can save new information separately, open the current index for reading, and create a new one for writing, and read and write sequentially merging with the information that needs to be added, starting with the first word and ending last. The question of the order in which to put the documents into the list I will consider a little later when I talk about building the index.

The full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

Source: https://habr.com/ru/post/123884/


All Articles