5 lessons for developers of highly loaded systems

Since 2010, we have been developing a service for organizing collaboration and process management. Now thousands of organizations and tens of thousands of users work in our Pyrus system. For 4 years, we have gained a good experience in ensuring reliability and want to share it with you.

1. Everything can break

We try to spread straws wherever possible. All database servers are 100% mirrored. In the data centers there are maintenance work for 2-3 hours, at this time, our service should not interrupt work. Therefore, the mirrors are located in different data centers, and even in different countries. In addition, the servers need to regularly install security updates, and they sometimes require a reboot. In such cases, a hot switch to the backup server is very helpful.

The servers are RAID, we do daily backups. We have several application servers, it provides scaling and allows you to update them in turn, without interrupting service. For load balancing, we use the round robin DNS mechanism. We assumed that DNS is the most reliable system of the Internet, because without it no site will open. However, there was a surprise waiting for us.

We hosted a domain zone with a large registrar register.com, it serves more than 3 million domains. As expected, we have 2 independent domain name servers (nameserver), which protects one of them from failure. One morning both refused. The register.com management console was unavailable. Shy complaints of users began to appear on Twitter, which an hour later were replaced by an avalanche-like stream of screaming, crying, wailing and promises to immediately leave this provider. As soon as he turns the server back on.
')
Since then, we have moved our domain zone to Amazon, which provides 4 domain name servers located in different Internet root zones: .com, .net, .org, .uk. This gives an additional level of reliability: even if the entire domain zone .com is for some reason unavailable in DNS, clients will still be able to work with our service.

Conclusion: design the system knowing that sooner or later any component will fail. Remember Murphy: if there is a possibility that some kind of trouble can happen, then it will happen.

2. You do not know where your application bottleneck is

As the load grows, we constantly do 2 things: we buy memory (RAM) and optimize the application. But how to understand which function in the application is not fast enough? By synthetic measurements in the tests on the developer's machine is difficult to judge. It is almost impossible to run the profiler on the combat server - it adds too much overhead and the service begins to slow down.

It is necessary to insert control points into the code and estimate the speed of the application according to the program execution time between them.

So we found out that 1/3 of processor time goes to ... serialization: packing data structures in JSON strings. Having studied alternative serialization libraries, we made an unpopular decision: write our own. Implementation for specifically our tasks worked 2 times faster than the fastest alternative solution available on the market.

By the way, many people mistakenly believe that encryption spends a lot of CPU resources. Previously, this process could actually “eat up” up to 20% of CPU resources. However, starting with Westmere, launched in January 2010, the AES encryption algorithm commands are included in the Intel instruction set. Therefore, switching from HTTP to HTTPS practically does not change the load on the processor.

Conclusion: do not optimize prematurely. Without taking accurate measurements, your assumptions that need to be accelerated are likely to be erroneous.

3. Test all

Once we needed to change the structure of the table in the database. This procedure requires stopping the service, so we planned it at the least busy time - at night at the weekend. Our tests showed the execution time of less than one minute. We stopped the service and started the procedure on the combat server, however, she did not finish her work after one or ten minutes.

It turned out that the procedure in some cases begins to rebuild the cluster index in the table, the size of which at that time was about 1TB. We did not notice this because we conducted tests on a small table. It was necessary, without waiting for the end of the procedure, to start the service. For our luck, all the basic functions worked correctly, although somewhat slower than usual, with the exception of attaching files to tasks. The procedure ended work after a couple of hours and 100% of the working capacity was restored.

Conclusion: test all changes on data volumes close to combat ones. We run about 500 automated tests with each build of the application to ensure there are no fatal errors.

4. Testing speed should be high.

We release app updates every week. A couple of times a year, despite testing, an error creeps into the release, a small but unpleasant one. Usually such errors are detected within 10 minutes after release. In such cases, we release a hotfix patch.

No one likes to roll back releases, but sometimes it is necessary. Corrections need to be done quickly, often we find the cause of the error within half an hour. But in order for the release to hit the combat servers, the source code must go through auto-assembly and automated tests. Our 500 tests are performed for more than 20 minutes, which is quite fast, but we plan to further reduce this time by means of more parallelization.

With slow testing, we could not correct errors so quickly, and without tests, the number of errors would be higher.

Conclusion: do not spare money for resources for developers. Buy productive servers for automated tests, the number of which will constantly grow.

5. Each product function must be used.

Good products require many iterations. New features are constantly added to the product, but it is often necessary to cut out rarely used functions. They do not carry value: they spend the time of developers on their support and occupy too much space on the screen.

A good gardener prunes the young shoots every spring and forms the correct, healthy and beautiful tree crown.

Are there features in your product that no one uses? In Pyrus, we do not know these.

Empirically, we have developed a rule: at least 2% of users use every feature. This means that when we turn off the function, dozens or hundreds of people do not like it. We always provide another way to do the same, but the habit is stronger.

Conclusion: development requires some sacrifice. Imagine how many people dislike every change in Google and Microsoft products.

Source: https://habr.com/ru/post/227595/

All Articles