List of useful ideas for high-load services

In this article, I decided to collect a hodgepodge of tips on how to develop high-load services, obtained in a practical way. For each piece of advice, I will try to give a small justification, without details (otherwise the article would have turned out to be comparable in size to war and peace). Since I will not give a lot of justifications, you should not take this article as a dogma - in each particular case, the advice given here can be harmful. Always think with your head before doing something.

1. Think with your head and check the facts.

It's the most important. There should be no absolute authority for you. If someone says complete nonsense, or says something that contradicts your practice - do not listen to such advice, and no matter how much this person is known and respected. If you are developing a large system and it does not work well, then they will ask you and in this case “we followed the best international practices” is not an excuse. The ability to apply the right technology in the right place and makes you a valuable specialist, and not blindly following someone else's advice - for this just qualification is not required.

2. Simple is better than complicated and “correct”

The systems that you develop should be understandable, first of all, to you and your colleagues, and not to the “spherical horse in a vacuum”. For example, if the finished system to which you want to go with the whole team is not transparent and not understandable to all participants, then you can hardly hope that people will be able to use it effectively. As a rule, various kinds of problems emerge under load even in the best and most efficient systems, and the time to troubleshoot problems is usually short, since at this point a large number of users suffer. Without a good understanding of how everything works, you have little chance to solve the problem quickly. We can recall, for example, how classmates lay for several days — in a well-designed and well-designed system, even serious failures are eliminated in a few hours, not days.

3. Do not forget about monitoring.

Usually, no one forgets that you need to test changes before you upload them. But much less often people think about what happens after deployment, and how the system behaves in production. But in fact, monitoring is a logical continuation of QA for code. Your infrastructure should also have Quality Assurance - you should see problems in advance, and if they do, you can quickly diagnose the causes. Without monitoring this is simply impossible to do.
')

4. Write postmortems (each breakdown has a cost)

When something important has fallen - write a post-mortem in which it will be described:

what happened
why did it happen
how many users have suffered (if it is possible to count, and how much money the company has lost)
what needs to be systematically changed so that this does not happen again (options like “be careful” do not work in practice)

Often, already in preparation for writing a post-mortem, we can conclude that in fact nothing terrible happened. That is, eliminating the causes of the incident will cost the company more than leaving it "as it is." Do not forget about this option - that is, sometimes something will break, and it is normal if it is within the permissible boundaries (for each company, these boundaries are different).

5. Performance is a feature

Despite the fact that the formal definition of a “high-loaded project” does not exist, usually more or less everyone understands what is meant. One of the features of applications under good load is that it is often much easier to optimize a piece of code than to spread the logic across multiple servers. That is, when developing high-load systems it is often necessary to engage in optimization and it is useful to be able to do this. If not only bandwidth is optimized, but also response time, then at the same time you improve user experience. Win-win.

6. Two-phase commit - it is difficult, but inevitable

Unless your system fits entirely on one server, you will inevitably be confronted with the need to do something. “Two-phase commit” - the need to atomically update data in several places, for example, in a database and in a service. A two-phase commit in practice is a myth. Atomic update for more than one server is impossible. There are different ways to achieve consistency in such cases, but the most common are queues - you add data and writing to a separate table in one transaction - the queue for updating the service. Since this is a transaction on one server, it will either go entirely or not at all. Accordingly, even if the update data in the service immediately failed, eventually the data will be consistent.

7. Paginated navigation is challenging.

“What else is difficult? You make the usual SELECT with LIMIT / OFFSET and everything, there is page-by-page navigation! ”- you must have thought. You are right, but this approach works well only if the data does not change (otherwise doubles will occur and, on the contrary, some records will be omitted). In addition, the use of large OFFSET values usually leads to a serious loss of performance, since the database needs to literally select all the requested rows in the number LIMIT + OFFSET, discard the OFFSET pieces and return the remaining LIMIT. This task is linear in time with respect to the OFFSET value, and for large values this construction usually slows down significantly.

Depending on the task, the solution to this problem may be different, but it is almost never simple and unambiguous. For example, you can use minId instead of the page number — you simply skip most of the entries by index. Or, if the pages are returned as a result of the search, you need to be able to ignore the new changes or store snapshot data for the corresponding request somewhere.

8. (Re) Sharding - Challenge

In general, there is no way to effectively scale up the database in order to get a near-linear performance boost. You need to choose a sharding strategy that will be suitable for your project, and you need to choose very well, because the process of resharing, and even more so changing the sharding scheme is one of the most difficult tasks when storing large amounts of data and it is also in general practically not solved.

If you think that projects like CockroachDB are a silver bullet that will solve all your problems, then you are mistaken. If you want to get normal performance from the system, you still need to understand at least in general terms how sharding occurs within the database to minimize communication between nodes (intensive communication between nodes is the main reason why productivity increases are rarely linear when adding new nodes).

9. Good code differs from bad code in how errors are handled.

Contrary to popular belief, a good code is not only determined by how well everything is laid out in packages, whether it is possible to read this code, whether it follows the imaginable and unthinkable style guides, and so on. This is all important, but users of the system (by users here are meant not only the end users of the site, but also system administrators, other programmers, etc.) first of all care that everything works as expected, and when it did not work, it was it is clear what is happening.

Surprisingly, a lot of software does not handle errors at all, or processes them in the style of “oh, something happened,” or they start to DDoS themselves at all with endless retracts. Error handling is difficult, and an adequate response to them is even more so. Errors are very different, and not all of them require, for example, the completion of the program. For example, if you are developing a file system and it stops working altogether (including not allowing you to delete data) when the disk is 100% full, then this is a bad file system. At the same time, you could follow all the "best practices" from famous people and you have a million stars on a githaba - this does not mean anything. Roughly speaking, shit happens, and even among the coolest people, space on the servers sometimes ends unexpectedly, and you should be able to handle this situation.

That's all

Thank you for reading this “article.” It will be interesting to read your opinion, or your own advice, which is based on practical experience, write about it in the comments!

Source: https://habr.com/ru/post/344160/

All Articles