Reliable code with high loads

When it comes to high loads, as a rule, the focus is on performance or scalability of code and architecture.

At the same time, it’s not customary to talk about the reliability of the code itself, although in the harsh conditions of high-loaded projects its quality takes on particular importance. You need really “bulletproof” code that will work correctly even in case of a large number of simultaneous requests to the same data. This article presents a set of recommendations that can help you in writing such code.

"High load" - what is it?

Different people put in the term “high loads” (eng. High load) a different meaning. Usually still imply a large number of requests per second. This is a very relative criterion, because for many sites even a modest 100 requests per second is already a high load. It can create not only the number of requests, but also their quality. Some queries can be very "computationally hard".
')
More than 40,000 requests per second for PHP-FPM come to the Badoo site, so for us the issues related to high loads are more than relevant.

High loads = high reliability?

Imagine that we have a “regular site” with 10,000 hits per day. If there is an error in the code of such a site that affects 0.05% of requests, then it will appear 5 times a day. Most likely, 5 entries in error.log per day will simply be skipped and not noticed.

Imagine the same code working under high load conditions in Badoo. This is 100,000 times more hits than in the previous example. We will receive 5-10 error messages per second, which is very difficult to ignore. If the error concerns 1% of users, then the messages will be 20 times more - we will receive hundreds of messages per second. In addition, we will receive thousands of disgruntled visitors to the site, for whom something does not work, and a large number of calls to our support service.

We write "bulletproof" code

Important to report bugs

The first step to solving a problem is to recognize that it exists. If any operation could not be performed, it is necessary to write about it to the error log. Reporting critical errors to the user is important, but it is equally important that developers know about the problem.

The nature of the problem itself is not critical: even if you have a timeout once in 10,000 calls when connecting to a database, this can also be a symptom of some serious network error, which - so far - rarely manifests.

Check the result of any operations

There are languages like Java, in which any error handling is done in the form of exceptions, so it is extremely difficult to accidentally miss an error. There were no exceptions in PHP “by design”, therefore standard PHP functions simply return false in case of an error. Anyway, in any code in which errors are critical, it is necessary to check the result of the execution of all commands. After all, even fclose or SQL COMMIT query can return an error.

Depending on whether you work in the web environment or in the CLI, the actions in case of errors will differ. In most cases, it is better for your program to simply terminate immediately after receiving any error than to try to continue execution.

On the other hand, often the response errors from internal services are not critical. When displaying the answer, you must take into account the fact that part of the data is not available, and show the user the messages prepared in advance for this case.

Check data checksums

In large projects, data is usually heavily denormalized to increase performance, so there may be discrepancies in the data. To identify these problems, you need to do data integrity checks in the background.

For example, if you have any counters that are stored separately from the data itself, you can check from time to time that the number of rows in the database converges with that counter. If the numbers do not match, then these values should be automatically corrected.

In most cases, you also need to inform developers about existing data discrepancies and investigate the problems identified. The presence of discrepancies in the data means that the user is shown incorrect values, or, even worse, something is stored incorrectly due to errors in the code.

Reliability of data recording in the eventual consistency model

If there is database replication, then so-called “strong consistency” is often not required, when there are strictly identical data on all servers. If you are satisfied that data on different “degrees of freshness” can be on different servers, then in such cases writing a reliable code is greatly simplified. You just need to ensure that you write the data to the N servers, after which the data on the other servers will be updated as soon as the queue reaches them.

In the simplest case, reliable data storage in the eventual consistency model looks like the primary reliable (for example, in a transaction) data storage in the replication queue. After that, other scripts can make an unlimited number of attempts to read this data and put it on other servers until it is successful.

Using such a data storage model ensures that the data will eventually end up on all the servers you need, and at the same time allows you to achieve high performance and scalability. You can have any number of machines to which data is replicated, and there is no lock for reading.

"Memento Mori"

This section partially overlaps with the section on checking all response codes.
When you have thousands of servers, the likelihood of getting errors even in very reliable services increases dramatically. For example, Kernel panic after 7 months of uptime (English uptime) in the Linux kernel, with which we were “lucky enough” to encounter, was the cause of many “crashes” in our internal services.

The main idea is that under high loads, you will often encounter the inaccessibility of any specific services. Any code should have reasonable timeout values when connecting to any services. Suppose the average response time per user request is 100 ms. If some very often used internal service “fell” (with a timeout of 10 seconds), then most requests will be processed 100 times slower than before. If you have a limited number of workflows (which is quite reasonable), then their average number will also increase by 100 times, very quickly reaching the limit. After that, your entire site will be “lying” simply due to the fact that some not-too-needed internal service is unavailable. If there were no restrictions, then not only the site, but also the web machines themselves will “fall”, having thoroughly “plunged into a swap”.

In short, remember death when you write code that will work under high load.

Why do these recommendations work?

We told about the unpleasant consequences that may arise if you do not follow the above tips. How do our recommendations help easy debugging and writing reliable code in general?

Let's look at the points.

Error messages

Everything is obvious here: as soon as an error is encountered in the code, you will know about it very quickly and you will be able to fix it - all error messages are displayed and saved in logs!

Checking the results of operations

How do you even know that there were errors in the program execution process? That's right - checking the result of all operations. If all the code is written in this style (and for any self-respecting program it is), then even if you accidentally miss a few checks, the error will still be revealed. If during the subsequent operations something fails, you will successfully catch it (since you are checking everything), and the missed check will not lead to fatal consequences.

Checksum check

In a web environment, checksums are usually not checked “on the fly” (to reduce response time), but such checks must necessarily take place in the background. For CLI-scripts, as a rule, there are no stringent requirements for the operating time, so in CLI-scripts or demons it is possible to afford to calculate the checksums during the execution of the program.

The presence of data integrity checks reveals synchronization errors between services (for example, whether the number of rows in the database converges with the value of the corresponding counter). Such checks are needed when performing any operations that, in theory, can fail. It is also necessary to check the integrity of data in other cases when the cost of the error is too high.

Data checks may not be limited to counting the number of rows. For example, in our scripts that deploy code to end machines, before calling

system("rm -rf " . escapeshellarg($dir))

check is made for the length of $ dir. If the path is too short, then an error has crept in, and in no case should removal be performed.

Total

The above recommendations are obvious, but in practice they are rarely used when writing programs. We hope our article has convinced you that “high load” is one of those areas where quality code and good architecture are of utmost importance.

Remember our recommendations, follow them when writing high-load systems, and your hair will be soft and silky :).

Image source: vbgcity.ru/sites/default/files/krepost-oreshek.jpg

Yuri Nasretdinov, Badoo developer

Source: https://habr.com/ru/post/158615/

All Articles