📜 ⬆️ ⬇️

Selection of a random record from a table with 700 * 10 ^ 6 rows

How many of us have come across in practice with this buzzword "Big Data", working in mediocre companies as web developers? Rather, you, like us, develop the same websites every day on the same CMS, often without even thinking about their performance.


However, in the life of a web developer, a day comes when a customer comes in with an interesting task. You pour coffee, drive the cat off the keyboard, and inspiredly start designing.


This is a story about how a couple of ambitious web developers first encountered the task of processing big data.


image



So, what does the customer want


There is a finite set of logins. It is necessary for each new user to generate a random login from this set. Login must be unique for each user, it is formed on the pattern XX999999, where X is the letter of the English alphabet, 9 is a digit from 0 to 9.


The problem was solved on an already existing site that runs on Apache (PHP 5.6) and MySQL DBMS.


The scale of the disaster


To begin with, it was necessary to write an algorithm for generating logins and assess the scale of the disaster.
The generation algorithm itself is very simple.


$alphabet = range('A', 'Z'); $alphabetLength = count($alphabet); for ($i = 0; $i < $alphabetLength; $i++) { for ($j = 0; $j < $alphabetLength; $j++) { $arLogins = []; for ($k = 0; $k < 1000000; $k++) { $k = strval($k); $arLogins[] = '("' . $alphabet[$i] . $alphabet[$j] . str_pad($k, 6, '0', STR_PAD_LEFT) . '")'; } // insert 1 000 000 by single query $strSql = "INSERT INTO logins VALUES " . implode(',', $arLogins); $DB->Query($strSql); } } 

It turned out that in fact it will turn out about 700 million logins. Therefore, the option to generate them on the fly will not work here.


On-the-fly generation algorithm

The simplest algorithm is to generate a random login using a predetermined pattern, and then check whether a user already exists with such a login. If exists, generate the next one, and so on until a free login is found.
This algorithm has obvious problems with this amount of data:


  • The number of users increases, and the number of free logins decreases. This means that the number of iterations that goes to the selection of a free login will increase along with the time for receiving the server response.
  • The number of database queries will also constantly increase.

So you need to memorize them somewhere - a separate table in the database is perfect for this purpose. To celebrate, they created a table, generated logins there, made a single PRIMARY field. The result is such a simple table.


value
AA000000
AA000001
AA000002
...

And then they remembered that the username should be random.


The first steps


Of course, the first thing they decided to try is the well-known ORDER BY RAND() LIMIT 1 . The result made me wait, you could say goodbye to the server indefinitely. The presence of the index turned out to be completely useless in this case.


 SELECT `value` FROM `logins` ORDER BY RAND() LIMIT 1; 

What to do?


It's time to find out from Google the answer to the question “What to do?”.


The first thing Google offers is ways to optimize with ORDER BY , but this is not immediately for us, as they are cool and productive only if you have several thousand records in your database. There were several ways to optimize with JOIN and subqueries, they did not work for the same reason.


All these methods are obviously intended for the case when it comes to optimizing the query execution time from 500 ms to 50 ms, but in our case it was rather about dropping the server in 10 minutes while the query is being executed.


However, all this was honestly tried, stackoverflow was studied from cover to cover, but it didn’t wait until the request was completed, so we can’t judge the performance gain :)


In the very first link, it is proposed to bring all the work on randomization to the side of the PHP server - select the minimum and maximum id, generate a random number between them and voila - we have a random record id.


 SELECT MAX(`id`) FROM `logins`; SELECT `value` FROM `logins` WHERE `id` = <random id>; 

A great option, it should work perfectly fast. However, we are not able to add an integer id to each record, the table has already stepped over 20 GB, and the server resources are not rubber. And even if such an opportunity were there, the logins should be unique, which means that when we give the user another login, it should be immediately removed from the table. What will happen if we stumble upon a non-existent login? We return to the variant with a huge number of cycles.


The next tried variant is random OFFSET and LIMIT 1 . The OFFSET value generates the PHP server and inserts it into the request. It would seem that there is no sorting and randomization on the MySQL side, but OFFSET itself is not so simple. In the case of generating a large value for the offset, MySQL will first sort through all the rows before the offset, and only then return the necessary one. Slightly better than sorting, but in general, you can wait for the result for a very long time. And the choice of the number of records is not such a quick operation.


 SELECT COUNT(*) FROM `logins`; SELECT `value` FROM `logins` OFFSET 123456789 LIMIT 1; 

New approach


All the described methods should have been triggered for each new user registration, they slowed down the registration and, to put it mildly, killed completely positive user experience. It was worth thinking in the direction of the method that would have worked only once and would have no effect on user productivity. Sampling the first row in the table without offset and sorting works instantly, it is logical that if the data is initially stored in the table in a random order, then we will get a random result, as the task requires. It remains to decide how to mix them.


All the same two options come to mind - on the database side and on the PHP side.


“Well, now we can do random sorting in MySQL, the user will not wait,” we thought. Created an empty copy of the table, run the query:


 INSERT INTO `new_logins` (SELECT * FROM `logins` ORDER BY RAND()); 

But it was not there. To sort all the records in a random order, MySQL will unload them all into RAM and then sort them out, and as we found out above - server resources are limited. Yes, and the base to put the clock at 8 on a working server with such a request would not be desirable.
Then we will try sorting on PHP - set_time_limit(0) + console and the trick is done, let it be done for yourself. The generation algorithm implied the insertion of 1 million records for 1 query to the database.


The memory will take not so much, you can make a random sort of a million entries and then insert them into the query in that order. But here we succumbed to perfectionism - the distribution would be far from even. Well, let's go look further, leaving it in case of complete despair :)


Mr. NoSQL


Started thinking towards NoSQL. But as with any commercial project, the time for implementation was catastrophically short, and the time for testing was even less. There was no practical experience of using NoSQL repositories, no matter how much they gave an increase in performance, they could only assume. It would be unpleasant to find on the deadline date that the request execution time was reduced from 10 minutes to 1 minute. So this idea had to be abandoned.


If someone had experience using NoSQL-storage for a comparable amount of data, share in the comments.


A light in the end of a tunnel


Through long experiments and searches, the solution was found. It became obvious that trying to achieve something from MySQL is useless, and using NoSQL is not possible.


But what if you go back to the option of randomizing the values ​​on the PHP server side? We have a varchar(8) field and a PRIMARY index. We cannot generate a random login and select it from the database due to the probability of hitting the “hole” (already a remote login) and subsequent looping, however, we can compare strings. Why not generate a random login and select the ones that are larger than it, while adding LIMIT 1 ? The index here is great to help speed up the sample. We are trying - this time the result did not take long, in less than a second, we get the necessary record. It remains only to exclude one extreme case - if the login that generated the PHP server, the last one in the table. Then we get an empty result and select the first login from the table with one additional query.


 function generateRandomLogin() { $alphabet = range('A', 'Z'); $firstLetter = $alphabet[mt_rand(0, count($alphabet) - 1)]; $secondLetter = $alphabet[mt_rand(0, count($alphabet) - 1)]; $number = mt_rand(0, 999999); return $firstLetter . $secondLetter . $number; } function createLogin() { $randomLogin = generateRandomLogin(); $newLogin = $DB->Query('SELECT * FROM `logins` WHERE value > "' . $randomLogin . '" LIMIT 1')->Fetch(); if ($newLogin) { // if login was found delete it from database $DB->Query('DELETE FROM `logins` WHERE `value`="' . $newLogin['value'] . '"'); return $newLogin['value']; } // if login was last in table, select first $newLogin = $DB->Query('SELECT * FROM `logins` LIMIT 1')->Fetch(); if (!$newLogin) { throw new \Exception('All logins are already used'); } $DB->Query('DELETE FROM `logins` WHERE `value`="' . $newLogin['value'] . '"'); return $newLogin['value']; } 

Conclusion


As always happens in such cases, the solution found now seems more than obvious, to many of you it will seem that way initially. However, when faced with a similar task for the first time, we had to reach this decision in this way.


')

Source: https://habr.com/ru/post/324772/


All Articles