Online Store Under Load - Web Cluster Quality Metrics

Written, transferred to the hosting and running a large online store. Creative ideas - lined up for implementation. However, you are very concerned about the issues of ensuring the stability of the solution and customer satisfaction - customers, even in the case of load peaks, should not wait in front of the white page loading screen for 30 seconds and receive a message like “Unfortunately, the system is overloaded or an internal error has occurred. Pour yourself some coffee and try again. ”

We will define a simple and effective strategy for achieving and maintaining a high level of customer service for an online store — using free analysis and monitoring tools.
The article will be useful both for projects based on Bitrix (including web cluster ones ), as well as for other solutions on nginx + apache-php / php-fpm + MySQL.

Records fixing in logs - can and should be

For web projects with attendance in units of millions of hits per day, you can and should set up logging of all customer calls to both pages and resources (images, etc.).
It is extremely important to understand that the system can save information about each customer request and enable this feature - simply (this is a regular tool). If an online store is deployed on a web cluster, it is easy to configure remote logging to a dedicated server for this purpose, using, for example, syslog-ng.
Logging all customer requests slightly, at a percentage, reduces configuration performance — however, you retain complete information and control over the quality of customer service, almost all cases of errors and hangs of an online store are recorded and available for further analysis and adjustment of the development and system administration process. Otherwise - you risk, move blindly and, one may say, do not control the situation.
Unfortunately and quite often during operation it is often lazy to include this opportunity (or they don’t know about it), and the project manager learns about the freezing of the web solution from Twitter, from colleagues or Boss :-) And you can’t find the reasons - the error facts were never fixed and "the train is gone."
')

What you need to fix

First of all, you need to modify the standard log format nginx, apache, php-fpm - and add the key performance indicators described below. It is important to record the following data for each hit (except standard data, such as the request URL, response code, etc.):

Request time

For nginx this is "$ request_time". With a two-level configuration of nginx + apache or nginx + php-fpm, it is useful to record in the nginx log also "$ upstream_response_time".

For apache, this is "% D".

For php-fpm, this is "% {mili} d".

Peak memory usage when processing a request

If you use php-fpm, it is useful to save in the "% {bytes} M" log.

HTTP response code

Typically, the HTTP response code is stored in the default logs. It is important to keep this information, errors in customer service usually have a response code of 40x or 50x (we will analyze the response codes in more detail below).

Separate logs of performance and log settings

Sometimes it is convenient to create separate logs of "performance" that preserve information about the time spent on script processing and peak memory usage, living alongside the "standard" logs. We often use the following options:

nginx:

log_format main '$remote_addr - $remote_user [$time_local] "$host" "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" -> $upstream_response_time';

apache:

LogFormat "%t \"%r\" %>s %b child:%P time-> %D" timing

php-fpm:

access.format = "%R # %{HTTP_HOST}e # %{HTTP_USER_AGENT}e # %t # %m # %r # %Q%q # %s # %f # %{mili}d # %{kilo}M # %{user}C+%{system}C"

Offline log analysis

So, you have organized the process of recording information about client access in logs (in the case of a web cluster, logs are centrally collected on a separate machine using, say, syslog-ng). All errors and page hangs (even due to the slow channel between the client’s browser and the web server) were logged. Now we need to learn how to interpret them and initiate the elimination of the causes of their appearance.

Common Error Types

In case of errors in PHP code or the language core itself, the precompiler (Segmentation Fault) or the operating system, as a rule, the status 500 (Internal Server Error) is given. In this case, in most web projects, the maintenance work page is displayed to the visitor (or, if not configured, a technical error message :-)).
During apache or php-fpm overload in a two-tier configuration: long requests to the database or in the code call the external socket, or the number of workers on the backend (the number of apache / php-fpm processes) does not correspond to the real load — the frontend (nginx) will fix in their logs exceeded the return time of the page backend or error 502 Bad Gateway, 504 Gateway Timeout. In this case, if a load balancer is not used, the routine maintenance page is usually shown to the visitor. When using a balancer (for example, upstream in nginx) - the time it takes to return a page to the client will simply increase (which will also be recorded in the logs).
If the PHP script runs out of memory, error 500 (Internal Server Error) will go to the log.

Sometimes, however, you need to deliberately give an error code from the application - in this case, you can simply exclude these entries from the log during processing.

In general, we now know that the facts of the client's page display on routine maintenance and other errors and freezes are recorded in the log and, importantly, it became easier to determine the cause of the error.

Useful log reports

Now you can analyze what happened with the service of our customers for the day. Attention - the script listings listed below are adapted to a specific format of log files, so they will most likely need to be adapted to your project, which should not cause difficulties.

Histogram of HTTP responses on frontend and backend

Total hits: 265532
200 : 264654, 99.67% ***************************************************************************************************
207 : 34, 0.01%
302 : 830, 0.31%
401 : 7, 0.00%
403 : 1, 0.00%
404 : 1, 0.00%
500 : 16, 0.01%

#!/bin/bash

TOTAL=`cat /var/log/nginx.access.log | wc -l`

echo "Total hits:" $TOTAL

cat /var/log/nginx.access.log | awk '{ hits[$14]++; } \
END {for (z in hits) {printf("%6s: %8s,%6.2f% ",z,hits[z],hits[z]/total*100);{s="";a=0;while(a++<int(hits[z]/total*100)) s=s"*";print s} } }'\
total="$TOTAL" - | sort -n

A similar tiny bash / awk script can be adapted for apache / php-fpm logs. We see that over 99% of the hits were successful (code 200), however there were 16 errors that developers need to understand (in the PHP logs) and look for ways to eliminate them - these are 16 pages on maintenance work :-). Our goal is 0 response codes 50x.

Bar graph of page execution times on frontend and backend

Total: 386664
0 ms: 276203, 71.43% ***********************************************************************
100 ms: 81466, 21.07% *********************
200 ms: 13155, 3.40% ***
300 ms: 4282, 1.11% *
400 ms: 2183, 0.56%
500 ms: 1373, 0.36%
600 ms: 968, 0.25%
700 ms: 721, 0.19%
800 ms: 586, 0.15%
900 ms: 470, 0.12%
1000 ms: 398, 0.10%

#!/bin/bash

TOTAL=`cat /var/log/nginx.access.log | wc -l`

echo "Total:" $TOTAL

cat /var/log/nginx.access.log | awk -F'->' '{ $2=$2*1000; zone = int($2/100)*100; hits[zone]++; } \
END {for (z in hits) {printf("%8s ms: %8s,%6.2f% ",z,hits[z],hits[z]/total*100);{s="";a=0;while(a++<int(hits[z]/total*100)) s=s"*";print s} } }' \
total="$TOTAL" - | sort -n

We see that the overwhelming number of customer requests were served with a time less than 500 ms. However, with 398 hits over a second you need to thoroughly understand. Visually, it is convenient to look at the columns of asterisks (the asterisk is set if the value is> 1%) and take them into account only.

Why pages or services can work a few seconds:
- The caching of the results of calls to the database is not used, or the access to the database itself is not optimized.
- A heavy analytical request is being carried out, for example, the number of orders of such and such goods for such and such a period. In this case, there is no error; it is better to exclude such a request from the analysis (for example, they may come from the admin domain admin.myproject.ru).
- In the script code there is an appeal to an external resource that is frozen (RSS feed, external authorization, etc.).
The general rule here is to handle such “risky” external resources either in a separate stream (cron) or asynchronously from the client's browser using ajax - caching the response results. Those. even if the external resource is unavailable - the web page should surrender to the visitor quickly, in less than a second (and carefully load parts of the page).

Backgam histogram of consumption by memory scripts - php-fpm

Total hits: 265562
0 KB: 25, 0.01%
4000 KB: 16, 0.01%
6000 KB: 67094, 25.26% *************************
7000 KB: 123746, 46.60% **********************************************
8000 KB: 61102, 23.01% ***********************
9000 KB: 3453, 1.30% *
10000 KB: 1263, 0.48%
11000 KB: 890, 0.34%
12000 KB: 826, 0.31%
13000 KB: 917, 0.35%
14000 KB: 1129, 0.43%
15000 KB: 1125, 0.42%
16000 KB: 936, 0.35%
17000 KB: 798, 0.30%
18000 KB: 631, 0.24%

#!/bin/bash

TOTAL=`cat /var/log/php.access.log | wc -l`

echo "Total hits:" $TOTAL

cat /var/log/php.access.log | awk -F# '{ zone = int($11/1000)*1000; hits[zone]++; } \
END { for (z in hits) {printf("%8s KB: %8s,%6.2f% ",z,hits[z],hits[z]/total*100);{s="";a=0;while(a++<int(hits[z]/total*100)) s=s"*";print s} } }' \
total="$TOTAL" - | sort -n

It can be seen that most of the web solution scripts consume 6-8 MB of memory, which is a little. Often, however, when developing “in a short time”, you get scripts that consume 500MB or units of gigabytes of memory - which causes server hangs, swapping and general destabilization of the system. Clients can, at the same time, “catch” a few pages on maintenance work. It is important to analyze this histogram and set TK to developers to optimize the amount of memory consumed by pages within, say, 64 MB. Thus, gradually, the system will consume a consistently predictable amount of memory and you will be sure that it will not suddenly hibernate for several tens of minutes.

Maximum values

It is also useful to get statistics on:
- top of the slowest pages per day
- top of the most voracious pages per day

#!/bin/bash

echo "Max today slow times in php access logs: "

cat /var/log/php.access.log | awk -F'#' '{ print $10" "$2" "$1" "$6" "$7" "$8}' | sort -nr

Learn more about hanging PHP scripts.
In php-fpm there is a great opportunity to log scripts that run for more than N seconds and make backrays to the log:

Mar 11 03:42:39 ec2-* [11-Mar-2012 03:42:38] [pool www] pid 26452
Mar 11 03:42:39 ec2-* script_filename = /var/www/html/index.php
Mar 11 03:42:39 ec2-* [0x0000000001971ea0] mysql_query() /var/www/html/bitrix/modules/main/classes/mysql/database.php:173
Mar 11 03:42:39 ec2-* [0x00000000019718a0] Query() /var/www/html/bitrix/modules/tasks/classes/general/task.php:1145
Mar 11 03:42:39 ec2-* [0x000000000196d078] __GetAllowedGroups() /var/www/html/bitrix/modules/tasks/classes/general/task.php:797
Mar 11 03:42:39 ec2-* [0x000000000196cb78] GetFilter() /var/www/html/bitrix/modules/tasks/classes/general/task.php:1442
Mar 11 03:42:39 ec2-* [0x0000000001968348] GetRootSubquery() /var/www/html/bitrix/modules/tasks/classes/general/task.php:730
Mar 11 03:42:39 ec2-* [0x0000000001967a08] GetFilter() /var/www/html/bitrix/modules/tasks/classes/general/task.php:1473
Mar 11 03:42:39 ec2-* [0x00000000019677b8] GetCount() /var/www/html/bitrix/modules/tasks/classes/general/task.php:1505
Mar 11 03:42:39 ec2-* [0x0000000001964998] GetCountInt() /var/www/html/bitrix/components/bitrix/tasks.filter/component.php:105

We often use this functionality because It shortens the search time for a reason at times.

We "close" system administrators on developers

Having set up the collection of key statistics on hits from online store customers, you need to create a permanent business process to refine and optimize the web solution:
- Once a day, statistics are sent to all participants in the process and to you
- When exceeding the predetermined figures automatically begins the search and elimination of errors / optimization code of the web project

As target criteria you can take:

- The number of hits to clients on the maintenance work page (ie, the number of errors 50x) - 0
- The number of hits in the public part, served faster than 0.5 seconds - 100%
- The number of hits in the admin panel (where "heavy" analytical samples can be performed), faster than 10 seconds - 100%. The average hit time in the admin is 0.5 sec.
- The maximum amount of memory used by the script - less than 128MB, the average - 32MB.

Results

By setting up the collection and processing of key statistics of a high-load web solution within one to two days, setting target indicators for customer service quality and setting up sending statistics — you created a permanent business process of monitoring and ensuring the sustainable quality of the project — which, of course, will have a positive effect on satisfaction online store visitors and conversion. In the case of a short-term output of indicators from the "green zone" due to lack of computing power, continuous development of the project's functional, errors, etc. - all services involved in the development and operation of the service will immediately find out about it and the system will quickly return to a stable state :-), even if you are on vacation.

PS

In the next article we will look at how to organize online collection and analysis of key indicators of a high-load web project using pinba, nagios, awk. Sometimes, to learn from the statistics by the end of the day that customers saw 50,000 pages about routine maintenance is simply unacceptable :-) and it is necessary to respond quickly within 5-10 minutes. However, daily statistics, even in this case - will be very useful. Good luck!

Source: https://habr.com/ru/post/139721/

All Articles