"Calendar tester" for May. Load service

Load testing is very similar to exercises on civil defense and emergency situations. It is better to understand in advance what this or that situation will look like than to try to navigate in panic. In addition to own tests and the problems collected on production, it is possible to adopt experience of colleagues on the industry. Especially for the Testers Calendar project , Dmitry Vorotnikov, Kontur's tester, based on the example of the emergency of large IT companies, derived several simple but important rules for service testing.

Modified load profile

When people talk about load testing, they usually mean capacity testing. Online stores have Black Friday and Cyber Monday - the time of sales and an extreme increase in the load on all services. In Contour, similar traffic jumps occur in the last days of reporting to regulatory authorities. For whatever reason, the number of visitors has increased, it is impossible to prevent the unavailability of operations, errors or increase the response time. With the help of testing the capacity of the service, we will make sure that users will not viciously tug at the mouse or go to competitors, but they can work comfortably and productively.

Testing with a load profile that replicates the model for the last month, year, or two, you may encounter a problem that Amazon Simple Storage Service had on February 15, 2008 . Data access in S3 is regulated by the AWS Authentication service. Requests to it are encrypted and require large processing resources for processing. Amazon supported as many servers as needed to handle the workload of the previous two years. On the reporting day at 3:30 am, engineers noticed that the number of authentication requests had increased. This overwhelmed the AWS infrastructure and it became impossible to process all requests. To handle the increased load, we had to introduce additional capacity. Until 6:48 all projects using S3 were not available.

Consider the possible change in the load profile when planning the capacity of the service and during load tests.

Irregularity

It is not enough to conduct single or episodic testing, even taking into account possible changes in the load profile. Especially if you are constantly growing number of users or new integrations.

On Wednesday, December 22, 2010 Skype messenger started working with errors. It took more and more time to establish a connection, until finally the service did not stop working at all. The trigger for the problem was the overload of a number of servers processing instant messages. Processing them began to take much more time. This slowdown caused an error in the Windows client, which caused them to crash. A significant part of clients (supernods) supported P2P exchange between other clients. Due to the failure of 25–30% supernod, the rest turned out to be overloaded and refused, further increasing the load. As a result of this cascade failure, the Skype network was unavailable for about a day.

Test and review your service regularly.

By-effect

When you are planning regular testing and developing its scripts, keep in mind that force majeure can increase the load. Spreading the service across multiple data centers will not add fault tolerance if one fails, and the rest will not be able to handle its load. Capacity planning should consider such scenarios. Another way to increase the load is to intentionally make changes to the system, for example, software updates or work.

On February 24, 2009, a new feature led to problems with GMail, which allowed storing letters geographically closer to senders. During the technical work of users of one data center redirected to another, overloading it. This caused a cascade of failures from the data center to the data center, each of which was taking an increasing load. The service was unavailable for two and a half hours. This story got the nickname Gfail .

Two related conclusions can be drawn from this post-mortem. The first is that the service and its environment are constantly changing, which means that you need to test the scenarios for performing technical work or disabling services and servers. The second is to take into account the test results when working and update them before changes and disconnections.

Unwillingness to refuse

To avoid downtime and add the nines to the availability metric, you need to know how much your service can withstand, regularly update this knowledge, maintain its availability and use it when making various changes. When designing and developing a service, we apply a variety of measures that provide protection against overloads, cascading failures, power cuts, equipment failure and data loss. This diversity, unfortunately, is not a panacea for errors in implementation, configuration, and the human factor.

On July 19, 2010, the site of the large American online retailer American Eagle Outfitters became unavailable due to the failure of the main repository. It does not matter, because there are backups! However, switching to backup storage led to its failure. Not scary, because there were still backups on magnetic tape. They were restored for a long time, then they tried to launch the site on the backup site, but it also failed. The reserve site was not ready, although it should have been prepared in advance. Despite a wide range of protective measures, it was possible to restore the ability to take orders in 4 days. Another 4 days required a full recovery.

After stress testing and identifying the capabilities of the service, do not forget to test the protection mechanisms, failover and your backups.

And finally, keep in an accessible place information about all the falls of the service, the chronology of the fall and analysis of the reasons that led to it. This will help the entire team of developers and testers to better study the file and find new approaches to its solution.

List of calendar articles:
Try a different approach
Reasonable pair testing
Feedback: as it happens
Optimize tests
Read the book
Analytics testing
The tester must catch the bug, read Kaner and organize a move.
Load service
Metrics in QA service
Test security
Know your customer
Disassemble backlog

Source: https://habr.com/ru/post/358234/

All Articles

"Calendar tester" for May. Load service

Modified load profile

Irregularity

By-effect

Unwillingness to refuse

More articles: