📜 ⬆️ ⬇️

Surfingbird Pro, PostgreSQL Weird Sites

I promised one user to write this post on February 8, and promises must be fulfilled.

It made me give this promise, of course, not just a desire to tell why surfing (the process of receiving recommendations) on our website gave five hundred hundred that evening, but more general considerations.

Namely - the user name persistently advised us to raise the capacity, otherwise it’s already impossible.
We have enough power. The unfailing self-confidence and self-confidence of the username ... I was saddened, and that is why I decided to write about why sites actually go down in reality.
')
Disclaimer: yes, the sites may lie for banal reasons like power, or physical server failure, problems in the data center, bad code, administrative errors. I want to tell about a little more subtle reasons, about which even programmers may not know or even think, if they did not have to develop web projects.



So here. As usual, if somewhere something lay, then somewhere something was not enough.

One of the first candidates is the lack of processes that process requests. In our case, this is FastCGI, but not the essence.
Moreover, the matter is most likely not in the fact that there are simply not enough of them - the fact is that they suddenly began to work for an unacceptably long time, and do not have time to free themselves.
Why is this happening? Somewhat loaded site is a highly optimized system. It is designed for the fact that most of the data lies in the caches, that database queries are executed quickly, so that the error in js does not suit you ddos ​​:-)
Actually, this is the key - when something from this is broken, the dive is quick. Such things can be hard to catch on tests.

For example, on a test server, it is possible and not to notice that the ajax requests began to be sent one and a half times more - the load there is approximately zero (there are five of you there and the database fits in a gigabyte) and everything works.
It is even easier not to notice that as a result of changes in the code, you now sometimes (maybe once in tens / hundreds of thousands of requests) generate such a request to the database, which is not covered by an index.
Or you made a little mistake when creating a new function, and its result is never taken from the cache - on the test and so it worked. Or you have just one crookedly started memcached from the cluster after careless update packages.

What exactly happens in such cases? When it’s not about the database, everything is simple, there are more requests / they are longer, there are not enough processes, the frontend gives customers a five-hundredth time.
With base and rare request - more interesting. First, after the deployment everything is fine. Then one heavy query appears in the database. You will probably still not notice anything. The process that sent such a request hangs. Such requests begin to appear faster than they are executed, and here they are two, three, 5, and now half of the server cores deal with these requests, and the disk goes to one hundred percent load.
Congratulations - most likely you have already laid down. The base has become too long to respond to ordinary, light requests, and as soon as the process goes there for some reason, it actually hangs. Sooner or later everyone hangs out there.

Even if you have correctly configured timeouts in the application - and this should be done - all the same, some part of your service has fallen off, and the rest began to work much slower, because periodically sticking for the timeout period is not good.

To defeat such a lack of processes by their number is impossible. First, the new ones will also be blunt, only they will create an even greater crowd; secondly, they will abut in memory.
Solving an increase in base power is also not an option, a bad query can require large orders of resources.

In general, only to treat intellectually. And to be prepared for the fact that the more optimized the system, the more fragile - the harder it is to fall, so to speak :-)

What else of the unusual reasons for the lack of processes.
Memory leaks - and, therefore, care in a swap, well, everything is already starting to slow down.
The same swap can be obtained during the launch of the script on the crown. The treacherous script may require more memory as the base grows, maybe even quadratically grow - and the victim machine will feel very bad for this time. Very fun, if the OOM killer comes and puts the process of your base, for example.
Leaks in the number of open files (or sockets) - and all at once died.
The eternal cycle that occurs rarely - it takes the best.
Just a very long cycle, the possibility of which is missed by the programmer and which is not interrupted by force on the counter - to the same place.
Offhand fantasy ended :-)

What happens except the lack of processes?

May not have connections to the database. You have a growing number of servers with frontends and handler processes — it's easy to grow them — your handler pools use persistent connections, in the course of their life they quickly acquire a connection instance to each database - oops.
You have rested against the limit. And he, by the way, is not just standing in your config.
If you are not prepared in advance for the fact that now there will be a shortage of connections, then it will be unpleasant to implement connection pooling on the fly is not very easy.

It may just run out of space on the disk, and the processes will come to an attempt to write something to the log :-) Well, this again, in general, about processes is a frequent reason, yes.

It happens that just for some reason, the request handlers are running, but for some reason, the requests do not reach them. This is how suddenly it is unclear where in your technology stack an irregular elusive bug is detected.

Even during the smooth growth of the load on the service, such a funny thing can arise - your front-end (say, Nginx) is configured so that when it receives a certain amount of Y timeouts from one server for a certain time interval X, it doesn’t try to log in any more during the time Z .
And here you or your administrator have configured these parameters and timeouts in Nginx so that before everything was hurt - and now a significant part of the requests does not fit into the timeouts a bit, and Nginx disables the server behind the server.
Requests from disconnected servers are redirected to more active ones, so the snowball will almost certainly turn off all servers.



A lot of things, in general, it happens. I don’t think I can remember all the reasons for even our failures, and certainly I can’t make any complete list.
It is important that in the bare server performance it happens rarely.
And other cool reasons, I suggest you write in the comments :-)



Now a little about PostgreSQL - the title hints - and about February 8 itself.

That day we launched a new feature, Surfingbird TV . The idea is simple - these are the same recommendations of what you might be interested in, but only among the videos.
Such a lazy TV :-)

The list of categories of each link is stored in intarray , and across the field, of course, a gist index is created. In the request, respectively, there is a part a la AND site_cats && user_cat - that is, so that there is an intersection of at least one interest. user_cat is a randomly selected one of the user's interests. We do not do a sample of the intersection with all interests at once, so that the output is uniform in all interests, not three-quarters - in the three most popular interests in the database, the rest - which is inherited.

All of the videos we had long ago an invisible "secret" category was set up automatically, so it was easy to modify the request for Surfingbird TV, we just added a "filter" there: AND site_cats @> filter_cats . That is - categories of links include categories from the filter.
In order not to produce conditional operators in the code, the default filter was an empty array.
We have no links without categories, so the default condition was trivial.
But! For some reason, such a query fell out of the index and made Seq Scan on the table (that is, read it all line by line and filtered it "hands" instead of the index).

Here we lie down. They rolled back the code, killed bad queries in the database with their hands, wrote a bunch of conditional operators, laid out.

It seems everything is ok. A dozen minutes pass - again we lie. We look at the base - there is about the same. What kind of nonsense, ..., are we upset?
It turns out:
Initial queries of the form ... site_cats && '{42}'::int[] work fine.
Requests ... site_cats && '{42}'::int[] AND site_cats @> '{}'::int[] go to Seq Scan, as it turned out a little earlier.
The queries ... site_cats && '{42}'::int[] AND site_cats @> '{255}'::int[] work fine.
Requests ... site_cats && '{255}'::int[] AND site_cats @> '{255}'::int[] go to Seq Scan again! This is already quite amazing.
And yes ... site_cats && '{255}'::int[] working fine.

Well, at first they inserted a crutch, then they also wrote conditional operators, all abusive comments, and did not go to bed anymore. For this reason :-)

By the way, only the surf went down - because the surf, the return of our button and the rest of the site are separate process pools. Given the very different nature of the load - it helps us a lot.

PS Somewhere on the second third of the post (I wrote around 3 nights) the preview stopped working - well, and in general Habr, because 500. I wonder what the reason is :-)

Source: https://habr.com/ru/post/171793/


All Articles