Prehistory
There is a project in which you have to work with a large amount of data. In particular, there is one denormalized table that stores all current offers from existing customers, as well as outdated offers marked is_deleted = 1, awaiting deletion.
The number of entries in this table, until recently, ranged from 30 to 50 million. Normal OPTIMIZE, even under these conditions, did not always work. Therefore, the founding father (Yevgeny Vasilyevich) came up with re-compiling the table in the following way: all relevant (is_deleted = 0) were copied into a table with an identical structure with a prefix added by date and time, and when copying was completed, all that was needed was to delete the original table, and to rename the new table to the original.
This approach worked reliably until it was necessary to increase the speed of searching for offers. And here begins our little story.
')
Wind of change
To increase the speed of the search, oddly enough, it was decided to use the search engine. We chose Solr. Why? Because for our purposes it is good. And not only for our purposes. If there is time, I will definitely write an article dedicated to this search engine.
Everything was fine until, after debugging on the development servers, we rolled out a new version of the site for production. The search engine worked and the client price parsers were launched and worked according to a new scheme well enough, except for some rough edges. But the script for rebuilding the table of all sentences began to fall every night.
The problem was that Solr was located on the same server as muskul. To search for proposals on the site it was normal. But Solr, though a smart animal, but it requires a lot of resources for its work. However, any similar solution would divide the server resources in half between the muscle and the search engine. Accordingly, the script rebuild sentences fell with an error about the lack of space in tmpfs.
Search for a solution
Option zero , unreal. Select a separate server for the search engine. The fact is that the service is not yet so well-known and visited as to buy another very expensive server.
The first option . He concluded in an attempt to somehow optimize an existing script. But we did not achieve success, and almost immediately abandoned it.
The second option is to use HandlerSocket. HS is fast, reliable, and finally fashionable. However, it turned out that HS is not suitable for reading a huge array of data. HS does an excellent job of quickly locating randomly arranged individual records. And when sequentially reading a large array of data in parts, a slowdown occurs at each next step, if you use limit, offset. But this is not the biggest problem when using HS - we had to sample by the condition is_deleted = 0, and this field was not index. And generally speaking, to make it so meaningless. Therefore, the brave HandlerSocket, which has already proven itself for other tasks, this time did not meet expectations.
Fortunately, there was a
third option that I personally had never used before. This is the
native muscular HANDLER . What does he allow to do and how is he good? It allows sequential reading of records according to a certain condition (not even an index field) without losing speed, which usually occurs by calculating limit, offset or between. All you need to do is open the handler, read the first portion of the data with a specific condition (READ FIRST) and then, without changing the conditions, implement READ NEXT while there is at least some data. The sequence of actions causes an association with the C-shny approach, for example, to scan a directory. And the most joyful thing here is that the pointer stays at the place where we left it the last time.
As a result, we have a constant high read speed and reasonable memory usage when reading data across the entire table, even if there are 270 million records in the table. There were exactly as many records by the time we found this solution. Is it a lot or a little? The question is relative. But if the service because of this volume begins to falter, it means a lot.
With limited resources, this solution turned out to be the most advantageous in terms of speed and reliability. I want to say that if you could count the first 10,000 records for one iteration, it means that you consider all subsequent records, no matter how many.
PS
I think that this decision will definitely be useful to someone.
Acknowledgments
I want to say thanks to Yuri Maslennikov, who actually proposed the idea about the handler.