
The better the development process is, the less often the performance problems in the release will appear. On the other hand, they cannot be completely avoided for the banal reason that during the development process “assumptions” were made regarding the operating conditions of the web service, and life constantly makes its own adjustments.
On how often such problems appear and how quickly they fix a lot - the satisfaction of the user with the service, the reputation of the developer, etc. How can you deal with performance problems?
One of the scenarios is a
passive reaction , i.e. problem solving as they come. In this case, the support service accumulates complaints before the onset of a “critical mass” and then attracts developers. Developers spend some time searching for and fixing problems, then the web service starts processing tasks again quickly.
The main disadvantage of this option is that users manage to fully “enjoy” the efficiency of the web service before they are taken up by the developers. But the problem after that has yet to be found. Another drawback is that developers should work on the deadline “for yesterday”, which is also not impressive.
')
Another option is a
pseudo-active implementation . Some utilities are put in place to monitor everything and everything, then the main shaman with a tambourine from time to time looks at the charts and tries to identify the fact of a problem by them.
This option is not much different from the first, because the shaman gets tired of looking at boring graphs and numbers, and more often it all comes down to the same first option. But even if the shaman has time to recognize the problem before the avalanche of complaints, it still still needs time to find and fix the problem.
A different scenario is required that would allow “keeping a finger on the pulse” with minimal costs and promptly localize the source of the problems.
Proactive monitoring
Proactive monitoring means two things:
- Active notification of problems, whether by phone, sms, jabber, icq or soap;
- Clear action plan for localization and troubleshooting.
It is convenient to build a notification on a ready-made monitoring system, of which there are many in the world. We shall not consider the Nakolenochny ones, for they are the lot of enthusiasts. But from the professional I would like to mention
cacti ,
nagios and
zabbix . However, only a hot and mutually beloved
Zabbix (or did I miss something in
cacti ?)
Is suitable for
notification , and
nagios is not very adapted to storing historical data, which is very useful for analyzing problems.
The use of alerts will seriously unload local shamans, because instead of staring at hundreds of schedules for a day's flight, it’s enough to carry a phone in your pocket. The only question is what and how to monitor.
If we talk about performance, and we are not touching on other topics now, then it’s enough to monitor only one parameter - the average response time. If it exceeds the value specified in the requirements for the web service, then we have a problem and we need to deal with it, and moreover quickly. To do this, at some point, spend a little time and plan your work on the localization of problems.
The plan should begin with the preparation of a request processing scheme. An example of such a scheme could be:

There are two types of components in this diagram - optional and mandatory. Required ones, such as a web server and a processor, are always present in the processing of a request and they are indicated by a solid line. But the components like file system, memkes and muscle can be optional, therefore they are highlighted with a dashed line.
Having such a scheme, you should write a list of possible problems for each component and the methodology for their elimination, so that developers need to be involved in exceptional cases. An example of such a list would be the following:
# | Symptom | Cause | Reaction |
---|
one. | The average wait time for a free handler has exceeded the allowed value. | The influx of users | Horizontal scaling |
2 | Average time to read / send request exceeded adequate value | We have exceeded network resources | Tariff plan change |
3 | Average request processing time exceeded the allowed value | Problems with the code \ base \ etc. | It's time to attract developers |
In fact, the third clause consists of many sub-clauses, some of which are decided administratively and only in rare cases should developers be involved.
The advantage of having such a plan is obvious - most of the problems are solved, if not immediately, then at a certain time. However, you need to work hard both to draw up a plan and to refine the system, but more on that later.
To be continued...