Such a pain! Crowds against the Web - 2: 0. Episode One - High-Speed Queuing

Tov. phpdude has opened an interesting topic for a large audience with the “PHP + MySQL Optimization” blog. I will tell about how I Web 2.0 killed two of my servers (because of my stupidity including) and also plans to kill.

It all started in the fall of 2006, when I decided to open a small site with 70 million pages. And no, they would not be made according to Markov's chains , namely they would be useful. Why 70 million? Because so many .com / .net / .org domains at that moment I could find.

I will not give you a link to the project - I'm tired of you, comrades Habcheloveki, cleaning the mat. If you really want to - look at my previous topic about “earnings from startups”. I'm going to tell in the historical order, so that before I memcached from the Heise DDoS, I’ll probably go to the second part.
')
So, I had an idea - to find the most rarely used word on the Internet. To do this, you need to bypass 70 million sites (the list is honestly scrapped with who.is , I tried to get access to the TLD Zone Access Program , but do not let go). Linguists, by the way, probably already laughed at my idea of finding the rarest word, well, then I didn’t know much about words.

Actually, the question. How to organize a queue? That is, start with aaaa.com and finish zzzz.com - do you need to save it somewhere? Well, MySQL is clear! id, url, status = {'were'; 'were not yet'}

It was not immediately clear to me ... why my computer was so stupid ... I began to perceive 72 million MySQL records with very great enthusiasm. Transactions, whether InnoDB or MyISAM. The problem also arose in the fact that with each next INSERT - they are becoming slower.

Okay, somehow done. We use MyISAM, because InnoDB didn’t digest that much at all. (Maybe I did something wrong, but in fact does not change). Let's go, SELECT id, url WHERE status = 'were not yet'; UPDATE STATUS = 'try' .

Yeah, it's not that simple. There are no transactions; 2-3-10 bots to the same address start breaking (SELECT happens at the same time, everyone receives the same URL and at the same time sets up a new status, discarding all subsequent ones, and go on to that address).

This is where I invented (more precisely, I guessed it for myself - this isn’t the name), the first MySQL optimization focus under high load - for MyISAM without transactions.

Add the rand field, then each thread generates a random number and does UPDATE… rand = '92803423' WHERE (status = 'have not been yet') AND (rand IS NULL) LIMIT 1 , and then SELECT WHERE rand = '92803423' . We get a full ATOMIC - only one thread is guaranteed to receive this record without transactions .

Okay, but the speed was more than fig - to knead 72 million records with hundreds of threads - this is not a joke for the company. In general, mySQL was crawling.

(not about mysql) Then I thought up, as it seemed to me a “brilliant” plan - to put everything into a file right where only one line - one url and we do so - flock, jump (fseek) to a random place in the file - we find ourselves somewhere then in the middle of the line we read up to the \ n character, now we read the translations to the new line up to the first alphanumeric character. Great, read the line, save it, do fseek at the beginning of this line and fill it with \ n, release the flock. Then once an hour we do grep to remove empty lines from the file.

It seemed that it all decided - all operations have almost constant time, even an additive, nothing grows even there 100 million, even a billion, but MySQL linearly increases the operation time with each new element. I did not think that I / O will be sooooo much here. In general, 100 threads for a month of such techniques killed the hard drive on the server. We put a new one, but I had to think of something.

Then I thought that I have the same threads in order and why would they go to the database every time, if they can take 100 url right away, delete them, make them, and then we will apply for new ones. Immediately, another optimization immediately occurred to me - namely, instead of id, url, status + SELECT / UPDATE, I can use a temporary table with id, urls and SELECT / DELETE .

Actually the table looked like this:

id; urls
"one"; "Aaaa.com; aaaab.com; aaaac.com; aaaad.com ...."
"2"; "Aaa ...
"720000"; "zzzzzxxx.com; zzzzy.com; zzzzzzz.com"

This is the first time I used denormalization in MySQL, but there was no choice.

that is, the second field was TEXT, which I then did split (';') in PHP. Why 100, not 1000? Empirically. It was necessary to expect that the script, having taken 100 URLs, could hang on some kind, a segfault could occur and anything else, the server would reboot, so it was necessary to limit the losses. Plus, if I noticed a bug, I had to stop the system - this was done using the kill -9 method.

100 urls were processed in 30 seconds on average, so that the losses in case of what would be a maximum of 10,000 url (100 x 100 threads), but in reality it turned out not more than 1000. For a hobby project, it is normal.

Then you ask me - why did I give such a value so that two threads, God forbid, did not catch the same URL, and did not consider the loss of 10k url a problem? Because in the first case I will arrange a light DoS attack to the remote site, which I didn’t want to do even if it was only 2, and if everything is 100 at the same time?

In general, this organization of the queue more than justified itself - MySQL worked everything out for a very quick time, periodic OPTIMIZE TABLE contributed to this.

Word frequencies

Then the question arose - I needed to keep the cache of what I found, or rather, the list of words that met on the main page and their number ...

... in order to analyze and find the Holy Grail - the rarest word that would be so rare that it would exist only once on all sites and that would disappear after my arrival. It turned out that the word "yoihj" is my nickname. Like your nickname, reader, and nicknames of everyone who registers anywhere. Simply put, it turned out to be not-words, but nicknames, typos, etc.

... so, store the number of words. Went the obvious way.
url (varchar); word (varchar); count (int)

It’s a pity that my school teacher of mathematics wasn’t there - she would quickly explain to me on the head that 72 million sites x 1000 words on each one is 72 billion records, and I just screamed that MySQL 72 million barely pulls and constantly slows down.

In general, the idea is the same - I began to store urls in the database; serialize ($ words). Then it was later replaced by json_encode ($ words), because the bots had already been rewritten in Python and PHP Serialize support, although it was, but slowed down, so JSON took the priority place.

Actually, this is still the way it is kept - both TechCrunch and Heise.de and many others raided.

Of course, this is not all that had to be done - there he invented his SphinxSearch and memcached as a database - by the way it worked well. :) But more about that another time.

UPDATE
I forgot about the finish version (which is now).
All sites in the database - 72 million records, with no index, except PRIMARY KEY = url (varchar),
and the SELECT query ... WHERE url> 'kite.com' LIMIT 100, where kite.com is the last bypass site,
the time of this request is constant over the entire base O (1),
no update is done at all,
we process these hundred in threads when the number of live threads is nearing 20 - Suppose this SELECT gives the last one 'klara.com' - save where the thread 'klara.com' - here we already make SELECT ... WHERE url> 'klara.com' LIMIT 100
Under php, threads are processes, a process account through popen ('ps aux | grep processname', 'r').
Just now the bots are already translated into Python.

Yoi Haji
view from Habra

Source: https://habr.com/ru/post/39945/

All Articles

Such a pain! Crowds against the Web - 2: 0. Episode One - High-Speed ​​Queuing

Word frequencies

More articles:

Such a pain! Crowds against the Web - 2: 0. Episode One - High-Speed Queuing