10 obvious steps to prepare the infrastructure of an online store for Black Friday

Despite the fact that we prefer to write about microservices, Kubernetes and others from the field of cloud native, we are well aware of another world - a much more real, if you look “in bulk”, for example, at online stores (even very successful ones). There is no automatic provisioning and scaling, complex load balancing and other beautiful technical solutions. But there is “Black Friday”, which is already tomorrow, which means there is almost no time left for preparation. Of course, our recipe # 1 for high-quality preparation for it (and at the same time receiving a whole bunch of additional advantages) is migration to microservice architecture and Kubernetes, but suppose that for some reason this option is not suitable (until tomorrow it will not be implemented ).

This article is a list of more or less quick actions to optimize a typical online store infrastructure (examples from nginx, Apache, PHP, MySQL are considered) in order to prepare it for high loads. They can be very obvious to experienced system administrators, but they will certainly be useful for those who have not yet deeply immersed in these issues, and their relevance is rapidly increasing. So, we will try to squeeze the maximum out of what is in the infrastructure, or at least take note of the main issues that we should attend to before the next bursts of load.
')

1. Break the infrastructure and set up monitoring

Without tracking tools even basic load indicators (CPU, memory, disk I / O, Internet channel occupancy, etc.), optimization actions will be largely “blind”.

If the entire infrastructure is represented by one very heavy monolith, when the key services (web server, DBMS, auxiliary services) are not divided into separate servers / virtual machines , then you need to start with its appropriate reorganization. This will make the collected statistics much more visual and useful: switching between the main graphs of different virtual machines, it becomes clear which components of the infrastructure do not have enough processor or memory.

Note : An example of the additional advantage of such a distinction is that the resources that end up for the DBMS will not lead to a complete inaccessibility of the site (when the web server does not respond at all), but to issue an informative error (or even minimally functional stub - see below for details). To do this, however, you still need to correctly configure the interaction of the application / web server with the DBMS: connecting to the DBMS and receiving data should not be too long, because otherwise, it will lead to the appearance of a set of processes of the web server that are “stuck” (pending) (and, possibly, exceeding the maximum number of its processes) and clogging up the resources of the corresponding server / VM.

How is this solved?

In the simplest case, for Apache, PHP and MySQL, it is necessary to choose a reasonable max_connections value close to the maximum possible number of simultaneous web server processes + other access sources (cron scripts). If this value is too small, then even resource-demanding requests to the DBMS will cease to be executed, and if too much - the RAM can end at once. Partly this will help the use of the legendary script mysqltuner.pl .

Additionally, to combat the "hanging" processes, you should pay attention to the mysql.connect_timeout directive in PHP ( /etc/php5/apache2/php.ini ) and the wait_timeout system variable in MySQL. Of course, in addition to setting timeouts in services, it is necessary to provide for their correct processing by the application, so that if they are triggered, the correct stub is output or some other actions are taken.

Obvious and ready (fast in deployment, well documented) Open Source solution for statistics and monitoring of key system indicators (at least you need to quickly find out about the end of resources) - Zabbix . Like all other products, it is not perfect, and there are a number of alternatives (for example, such ) , but it is excellent for solving problems (and, perhaps, we can say that it was largely strengthened as a de facto standard) .

In a good way, of course, in addition to graphs of general system indicators, it is useful to set up more useful business values (from the number of processed requests for dynamic content to the number of orders made).

An alternative way is ready-made services that can be easier and more convenient to install and use, but you have to pay for it (an example is okmeter developed in Russia) .

It is useful to monitor the availability of key pages (the correct response of the web server and a reasonable response time) to monitor the basic values collected by the selected system in order to learn about problems with the online store until sales decline. In the mentioned Zabbix such opportunities, of course, are provided .

2. Check load

The logical consequence of the implementation of the monitoring and statistics system is its “battle check”, i.e. load testing. For these purposes, there are also many open source tools: from classic and simplest ab to more complex solutions like Apache JMeter (here is a brief overview of some of them) . For the simplest and fastest testing, you can resort to the above-mentioned utility ab, and as a more versatile, integrated tool, we advise the Russian development, Yandex.Tank (see the example of its use from the creators) .

Ideally, you need to test everything that is really critical for an online store business: the main page, sections of a catalog, viewing a product, its order (with user registration), etc. Testing one page (for example, the main one) will help to identify global problems (commonly used SQL queries or the fact that the total load is so high that dynamic page rendering stops working in principle), however, will hide more specific problems associated with SQL queries to specific tables (or just specific requests), the use of additional mechanisms and services in the code (for example, sending email to the user or receiving data from some third-party service / cache), etc.

The statistics / monitoring system set up at the previous stage will help in carrying out load testing, since You will be able to look at the graphs of resource consumption (and based on this data, assess the potential of the infrastructure by the maximum traffic), and also make sure that the monitoring triggers when the situation becomes critical.

The relevance of the following steps will largely depend on the results of load testing.

3. Understand the static

If suddenly you have all the requests serviced by the conditional Apache (and practice shows that this really happens), add before it (as a frontend) a lightweight web server - for example, of course, nginx - for quicker delivery of static content (images, JS, CSS, font files, videos, etc.).

Even if you think that you are already set up, make sure of this by looking at the access logs of the web server of the application (Apache). As a rule, simple constructions with grep or egrep enough, selecting GET requests with a successful status (not 404) and cutting off URLs you have for dynamic content (for example, ending in / or .html , or not having at its end something like \.[^.]{2,4} ). Very often it turns out that not all statics are given away by nginx , and this is more of Apache's constantly running processes (approaching the MaxClients limit) and additional resources consumed.

Another nuance with statics is its volume in the scale of the site’s Internet channel. The indicator is checked using data from your statistics (graphs with network traffic). Even if at the supposed (high) load of the existing channel should be enough to return all the static, it will be good practice to optimize graphic files and minimize JS / CSS files at least on the key pages of the online store. Google PageSpeed Insights will help detect / check these and other "client" problems.

If there is a lot of traffic and it is impossible to expand the channel (or it is too laborious / expensive), it is worth considering the option of introducing a CDN.

4. Locate bottlenecks

Time to deal with problem areas in the work of the web application. The short way here is to register a trial account with New Relic (it is issued free of charge for 2 weeks), installing its agent part on the web server side, initial collection of load data (you can speed up this process by launching load testing, possibly less “aggressive” ) and their analysis.

Pay attention to the slowest transactions (especially the most common).

What specifically does it slow down (runs noticeably longer than the rest)? If these are specific parts of the code, how can they be optimized? If SQL queries, then work with the DBMS (see clause 6) . If external services - reduce their use to a minimum and / or, if possible, cache the results (directly with user requests in the application code or by calling cron for additional scripts that periodically receive the necessary data from the services and save them locally).

5. Optimize your web server

General code performance can be enhanced by aids, such as accelerators for PHP , which work in caching compiled code (opcode / bytecode) of executable application files. For older PHP (5.x), APC is still relevant (even if officially recognized as “dead”), and for more recent versions (7.x) - the built-in Zend OPcache , which should also be configured.

Another area of global optimization in the case of PHP is sessions . As before, many people use the standard setting session.save_handler = files , that is, saving sessions to files. With a large number of site visitors (which means both sessions and files), this leads to an unreasonable load on the disk, which is miraculously removed by switching to NoSQL storage like memcached, which places all sessions in RAM.

The most radical way to optimize the work of a web server is to cache pages once per n minutes generated by the programming language interpreter. This feature is relevant for those pages that: 1) can be unchanged for at least some time (for example, it is allowed that the “random goods” block is generated by the engine once every 1-2 minutes for all users), 2) the same for all users. The second condition is not so critical in the sense that caching can be configured to simply exclude authenticated users. In this case, the finished (cached) pages will be displayed to all visitors, the “guests” (and there are always most of them, and usually overwhelming), and dynamic pages will be “honestly” generated for authenticated users. The cache can be configured in nginx, but there is a more specialized open source solution for these needs - Varnish HTTP Cache .

Competently setting the caching of the main groups of pages on a website often brings a fantastic result, because the answer to the main part of requests is to issue statics — ready data from the RAM without having to use a web server (with an interpreter of the programming language) and DBMS. Verify this with load testing after setting up a caching server, but pay special attention to setting caching conditions so that no critical functions of the online store (adding a product to the cart, its subsequent order, registering users ...) do not break.

6. Optimize DBMS

Optimization of DBMS performance can be divided into two parts: 1) the overall configuration on the side of the DBMS itself, 2) work on their own circuits and queries.

The first item involves setting buffers and other global parameters in the DBMS. The most important ones for MySQL can be seen on this wonderful slide from the presentation by Peter Zaitsev (CEO Percona):

A more detailed analysis of these system variables and some others (critical for MySQL / MariaDB performance issues) can be found, for example, in this article from MariaDB .

The second area — the database schema and SQL queries used — usually begins with analyzing slow queries (MySQL) or specialized tools to detect performance problems, such as the one mentioned in New Relic. The lion's part of the problems, as a rule, is solved by adding the correct indexes and thoughtful analysis of SQL queries (with their subsequent optimization or changes in the scheme).

An additional solution for quick data retrieval, which is still long to get from the DBMS, can be caching the results of frequent / complex operations, for example, in NoSQL-storage. As in other cases, we must not forget at the same time to monitor the relevance of the data in the cache.

7. Optimize the engine

If your online store uses a popular framework or engine, remember that this means having a community and documentation - from the manufacturer or users / enthusiasts. Find online best practices for optimizing your solution and implement them where possible. The tips offered in such practices usually cover different areas: from engine / application settings and the use of off-the-shelf tools (performance panel, optimizing plug-ins, etc.) to making changes to the infrastructure.

If the engine uses minor functionality that creates a noticeable load on the server or DBMS, turn it off at peak times: it’s better that the user does not receive some functions than does not receive anything at all.

8. Upgrade iron

Depending on the data from the statistics system, load testing results and optimization carried out (for example, analysis of problems in the DBMS revealed that huge buffers are needed) - add resources: increase the number of CPU cores and the amount of RAM, translate the sections demanding to the disk (DBMS ) on SSD, move individual components to separate servers / virtual machines, expand the channel (if the CDN turns out to be expensive or does not help, since the problem is not only static).

9. Make a competent stub.

When nothing helps (not all optimizations were completed or they turned out to be insufficient, there are not enough hardware upgrades, the amount of traffic exceeded reasonable expectations) - let the users see an informative stub with some minimum useful information: basic goods, contacts data, etc. Write it in the settings of the web server and, possibly, provide for its return in the application (for cases of timeout and other problems).

Of course, this is much worse than cached pages for unauthenticated users, which can also be given as static, but still much better than a browser message about the inability to connect or a sadly familiar error for many people:

10. Call professionals

If all this fails to be implemented qualitatively for any reason (or does not help as we would like), do not forget about the possibility of going beyond the technical aspect of the problem - contact those who are more competent in the right questions: friends / acquaintances or companies specializing in the relevant field. Sometimes this may be a more effective business decision.

Summary

Despite the presence of some specific technical instructions, the article is more of an overview, seeking to point out those important points that should be taken aback when solving the difficult task of preparing an online store (or other web site) for high loads and attendance spikes caused by Black Friday and similar events. By the way, we talked about some of the problems identified at the 2017 Highload Junior Conference, which was held as part of RIT ++ 2017 (report “ TOP Infrastructure Errors That Prevent High Loads ” by Andrey Polovov and Andrey Kolashtov).

We hope that at least some of these tips will help you withstand the upcoming test. As popular wisdom says, "forewarned is forearmed." And another wisdom recommends “to prepare a sleigh in the summer”: if not for this Friday, then for the period of New Year and Christmas sales, you can definitely have time to properly prepare - do not over tighten!

You hold on tomorrow! You all good, good sales!