The report will be about what to do with the project after we launched it. You have planned the architecture of the project, you have thought about how the infrastructure will work for it, thought out how you will balance the load, and finally launch it. What to do next? How to support, how to make the project continue to work, and how to make sure that nothing finally falls?
First, a little about yourself. I am from ITSumma, our company is engaged in round-the-clock support and administration of websites. We currently employ 50 people, we were founded in 2008, we support more than 1000 servers, which are visited every day by more than 100 million people. ')
There are large projects on support, there are small, mostly Russian projects, and based on the experience of supporting these projects, I would like to talk about what should be done.
If we talk about support, what does it consist of? In my opinion, it consists of three components:
project monitoring and response to alerts that come,
backup and backup organization (we share these things),
the organization of service and support, in fact, the response to what is happening with the project.
If we talk about monitoring, we all understand that monitoring is necessary. You need to understand how and why to use it. We can divide the use of monitoring into three stages, how they come to mind, how people come to them, companies come.
At the first stage, monitoring is viewed simply as a project status alert system. Alerting about problems on the server is like the easiest step; notifications about problems related to the application logic - as a more complicated thing, when we observe that something has become wrong with the site, with the project and we understand this already from some indicators of what is happening on the site; and alert on issues related to business performance.
If we talk about the requirements for the warning system itself on any high-loaded project, then we must understand that, first of all, the monitoring system must be independent of the project itself. It should not be located where the project is located; it should not be associated with it. If the project falls, the monitoring system should not fall.
The monitoring system should inform you as quickly as possible about what is happening on the project. If you had an accident, or if it seems to you that an accident will happen soon, notifying you that something is happening should not come after a long time, you should know about it as soon as possible. And you must be sure that you reliably receive information about what is happening with the project. Those. You must understand that when the critical indicators that you set for the project are reached, or when some kind of accident occurs on the project, you will receive a corresponding notification.
If we talk about alert levels. The first level is a problem on the server, when we simply, in a very basic way, look at some physical indicators, indicators related to the software on the operating system, and basic software. The second is problems at the application level, when we look at how the application subsystem interacts with other subsystems, how they interact with external services, and what happens there. And the third level of alert is about problems at the level of business logic.
Problems on the server. The simplest things you can monitor there are:
statistics on the load on the central processor, where we look at the load on the processor itself and on indirect indicators for this load - load average - moving average;
statistics on the load on the disk subsystem;
statistics of memory usage and swap.
If we talk about all the indicators on the server that we can monitor, these three things can first of all tell us that something is wrong with the server. CPU grows with the growth of traffic or non-optimal code. The load on the disk subsystem most often occurs when there is a suboptimal interaction with the database. The use of RAM increases dramatically, and the memory goes to the swap, if your traffic grows, if you again have some kind of non-optimal code.
Statistics for server software. We set alerts for drops and surges in the number of requests to the server. We look externally at checks on the availability of the site, and if we set external checks for accessibility, we should check not only the response time of pages and the response code, but often errors occur on the development side, when instead of http 502, the server gives http 200, and the page , it seems, gives an error, but for the monitoring system, for search engines, it continues to be considered a valid, normal page, in which the text of the error is simply written. Therefore, we recommend:
in addition to checking the time and response code, look at the size of the page’s response when it usually doesn’t change over time and if something happens due to a big-big crash, the size often drops sharply downwards;
With the help of CasperJS tools that have already appeared, look at the time of directly rendering the page. Very often we see how on some external services (in the last year several times with this the service of comments Cackle sinned), the rendering slowed down dramatically, i.e. the pages are generated, which you check on the server, but in the browser people only have a piece of some pictures, until the end they are not immersed.
Alert problems at the application level. First of all, you should monitor the number of application level errors. Yes, there are logs that can be viewed, but more often on any projects, sooner or later, there is not enough time to deal with the number of notification messages about which the information comes from the application, in order to make some new features. Instead of dealing with this thread, one of the simplest methods is to monitor the number of such messages per minute. Those. we see that we have about 100-200 such messages per minute, then as a result of some regular calculation, this number rises sharply to 2000-3000, i.e. we are not interested in a separate line, we are interested in the total large number of these alerts.
The number of calls to subsystems / "nodes". Suppose you have a project that effectively uses the caching system and rarely accesses the database. It is possible and worth monitoring the number of calls to the database, selects, inserts, updates, delites and continue to look at the changes. In the process, if you do not have sharp jumps in traffic, this number will keep you approximately equal. If something happens to you with the caching system, you will see a spike in such requests and realize that something is wrong and should be investigated. This should be done to any subsystems, and further, based on the information that you collect, put some additional alerts.
If you have any external services, I understand the databases, services located on other servers on your site, external API, then you should monitor the interaction with these services. Especially with external APIs. We all think that this is a problem for us, and in large companies they do everything cool and good, but our experience of interacting with the APIs of clients tells us that bugs happen there very often, the response time increases dramatically, and you can assume that The site is starting to slow down at you, but in reality this has dramatically deteriorated the interaction with some external API, which we didn’t even suspect that it could deteriorate. Accordingly, we put on monitoring the time of interaction of those important requests that we need, and with the jumps in response time in these requests, we begin to investigate what happened to our application, whether it has become worse.
There is such a cool thing, which for some reason is very rarely done (which is surprising) - this is monitoring of business logic.
When we monitor the server, we have a million indicators that we can monitor, and we can think about one of the indicators that it is not important now: “I will not close it with an alert, I cannot understand what it can to be critical, I cannot understand what to respond to. ” And sooner or later, when this indicator comes up, it may happen that you do not have an alert for such an accident, and the accident happened. All these millions of indicators, in the end, turn into problems with what this application is for. The application exists, relatively speaking, starting with the amount of money earned per day, the number of users who did some actions, to simple pieces - the number of purchases on the site, the number of orders placed on the site, the number of posts on the site. All these things are pretty easy to monitor, especially if you are a developer, and put on alerts from this point of view. Then, even if we missed monitoring and notifications from the server software, even if we began to slow down the hardware, and because of this, users began to fall off on some of their actions, we understand that this drop in traffic or actions occurred for some something for a reason, and we can already investigate the graphs ourselves, investigate the causes of what is happening and figure it out.
The second thing you should do often is emulation of user-defined application logic. We often monitor server things, we monitor hotel scripts for what they answer. But, for example, it is very cool to monitor, if we have a login to the site, if our user comes and registers, then fills out a form, receives an email with a registration link, clicks on this link and visits the site. We have at least five places where something can break. If we monitor each of these five places separately, we are not sure that this whole sequence works as a whole. Create a script that will go to the site, fill out a form, post the data, then create a script that checks the mail and clicks on the link that was there, just check that the cookie has recovered - this work is technically for two days, and the control that is from this will turn out to significantly exceed the cost of creating this script. Therefore, I propose that this thing is mandatory for most of the critical functions on the site.
Another thing that we do not quite correctly see in monitoring is when alerts are set to critical indicators that are already close to a terrible, terrible disaster. People put alerts on CPU utilization at 99% for 5-10 minutes. The moment when such data come to you, you technically can practically do nothing, i.e. you already have a loaded server, you already have a fallen application, and you need to decide in a rush, in a fire, what to do with it. At the start of the project, after some launch time, and after you can analyze the nature of your traffic, you can understand the nature of the load. For example, if your processor is 30% loaded, you do not need to set the alert to 90%, you do not need to set the alert to 99%, you need to choose some key points for which you need to make new decisions. Those. 50% alert, and you realize that you usually had a 30% load that you are used to, and now 50%. This is not scary from the point of view of the server’s work, but it’s time for you to think about what further growth will be and when you will reach 70%, etc. In such cases, you will have enough time, when you can decide what to do, if you can still live with it, or it's time to think about how to change the architecture, maybe buy some new hardware, maybe you need to do something with the code, if lately something happened, because of what he began to answer longer, etc.
Monitoring as an analysis system. If we said before that monitoring is an alert system, then when choosing a monitoring system, we must understand that it is not only an alert system, but also an analysis system that:
a) must store as much data as possible;
b) should provide the ability to quickly select this data for the desired period;
c) be able to quickly display them in the form we need.
Among other things, this gizmo should have the ability to compare data of the same type across different servers, if we say that we have a multi-platform, multi-server system. Because if we have five servers of the same type, and we look at each one in monitoring separately, we may not notice that one server began to deviate from the others by 30%, but if we group statistics on all five servers, then we will see it.
The monitoring system should have a comparison of current data with historical data, because if we have a slight recession or a slight increase over time, just looking with your eye on the graph, you can not pay attention to it. If we choose two lines - one current, and the second, for example, as it was a month ago, we will see a real difference, and we can already understand what is happening.
Fast sampling of a large range of historical data. The implication is that we actually have a huge amount of data being recorded, and monitoring itself is a highly loaded application. If the monitoring stores this data, it should be able to unload this data.
In a report on RootConf, my colleague and I looked at examples of monitoring systems that are currently available, and if we talk about what can be used, then more or less ready is the classic Zabbix and Graphite. But what is interesting, what am I talking about the third part with a quick sampling of a large range of historical data - we have, if you look at our clients, somewhere around 70% sits on Zabbix and about 20% sits on Graphite, and any The node with Graphite is concerned with the fact that it is almost 100% clogged on the processor. Those. The system is good for recording data, it is good for displaying data, but there are regular problems with the fact that monitoring simply cannot draw what they want from it. Those. the system should do it quickly, but you cannot get it quickly.
An additional thing, if you choose a monitoring system, would be very useful - this is a complex aggregation of metrics. Most often, some of the metrics on the project can be relatively noise-free, this noise must somehow be minimized, a moving average should be obtained, the percentile should be looked at by requests, in what amount of time 99% of requests were fulfilled, group requests by the minimum time and etc.
In addition, monitoring should be thought of as a decision-making system.
Those. we have monitoring as an alert system, we have monitoring as a system for analyzing why we had a problem, and to understand how to avoid this problem, and most importantly we succeed, we must use the monitoring system as a piece, which will allow us:
a) understand how the accident happened and what to do so that it does not happen again;
b) look at how the system will evolve in the future;
c) look at how to do so in order to avoid the mistakes that were already.
If we talk about affordable solutions, from really good ones that work on a small project and do not cost too much in terms of implementation, I believe that you need to use SaaS services. Three services are listed above, they do not have to be used, they are all about the same. For a small number of servers, they are inexpensive. And most importantly, they will not spend your time on their integration.
Complicated data collection. I would now pay attention to Graphite, but you have to suffer a lot when deciding how to store data there. If you have a very complex system, if you have very complex data, you can think about how to make your system or customize open source heavily, but you need to understand that this is a very, very big contribution to the investment in the development of this system.
We use our own system, because we have historically established that the infrastructure is very heterogeneous, we have high demands on how it should be maintained, and how problems should be distributed within the company. We are developing this thing since 2008. We have two developers who are constantly engaged in this, and we have very, very many requests from admins how to change this thing further. The only reason we use this thing, not Graphite or Zabbix, is that we understand what we can get from the monitoring system. We understand how to modify it, and whether we can or cannot do it. But besides everything, you need to understand that you get used to the monitoring system, and the moment you choose it, the longer you use the same system, the longer after that you don’t want to get off of it, even if she is not happy.
We turn to the second part. Let's talk about backup and backup.
First of all, I would like to talk about backup. In fact, this thing is very often among our clients and among those whom we see, it seems to be a simple thing that works by itself and is not even worth thinking about. However, when backing up, you need to understand that:
Backup creates load on the server.
For your project, the regularity of backups can play a big role. For example, you constantly have a huge number of users perform some kind of action, let it be a service where people process photos for money. If your backups are done once a day, the data disappeared, and you recovered from 20 hours ago, you still get a huge amount of problems with the fact that users will say: “OK, but where is my data for the last 20 hours? ".
You need to understand that the recovery time from the backup also plays a role. Those. you can do a dump, which is done quickly, but in some cases this dump can take a very long time.
Backup process:
By itself, a hard procedure.
Backups because of this, it is best to create c backup machines, without creating a burden on prodakshinovyh machines.
You need to understand what you need to reserve. A classic thing - back up often the base. "But we have a lot of static generated by users, we will better generate it once a day." But what is the point every few hours to back up the database, if when you recover from it, you will not have static, and users will be just as dissatisfied as before? Those. they will leave, relatively speaking. You have a second Vkontakte service, you dropped the base, you dropped the server, and you picked up a backup copy of the base, you have links to the photos, but there are no photos themselves ...
The classic thing. A backup without a regular recovery procedure is not a backup. Very often we see, when people have organized a backup, they load it, and they think that this is enough. In four out of ten cases from the backup copy, which is, it cannot be restored without checks. The organization should have a regular recovery procedure, a regular exercise plan, according to which you start to restore from a backup copy at a given date, check for how much you can do, how relevant your data will be, and how well the site will work after restoring from a backup copy .
Backup copies of databases.
We use the slave server only as a backup in case of a failure of the master server. There is a classic story of the fall of Sberbank a few years ago, when they used the slave server as a backup and did nothing else with it. Those. there were no backups, there was only a slave, but you need to understand that the slave accepts all requests, and if a person came to the master server and said: “Remove everything from the master server”, the slave will receive exactly the same command.
On some bases, I can not say for Postgres, I saw such things on MySQL, it is possible to make a slave with a delayed update. This is a small hack that sometimes helps. Those. you make a delayed update slave somewhere for 30 minutes, and you already have time when you understand that you accidentally made a drop on the master database, you can abruptly interrupt the replica and switch to the slave instead of doing a long recovery procedure from base.
Hot backup services for regular external backup. Those. we have a slave to protect against accidents, and a hot-backup to protect against human factors.
We store at least one copy within the same area where we are located in order to download it quickly, and we store at least one copy on an external site for recovery in case of a global accident.
You cannot keep a copy in the same place where you expect an accident to happen, because when the data cent in Hetzner drops completely, you cannot recover if it falls for several days. All these few days, you, even with the right backup, even doing the right procedures, will just wait until hosting is up. At the same time, you need to understand that accidents in hosting services are not always a question of 10, 15, 20 minutes. Amazon has had four accidents over the past five years, the longest of which lasted 48 hours, the next lasted 24 hours, the next lasted 16 hours, the next lasted about 10 hours. Every time they said that “we are terribly sorry, we will refund you the money that the hosting cost at this time”, but it is clear that this is not comparable with the losses that were in fact. Therefore, to keep backups ideally in another place, where you know that if this platform falls, you can recover.
Content Difficult-complex topic, especially if you have a lot of it. If you don’t have much of it, do regular rsync, snapshot. If you have a lot of it, then it is best to implement this thing immediately at the application level. Let you upload a photo, some file that is uploaded or processed by the user. After issuing the content data to the user, make sure that this file is currently duplicated to the backup copy. Then you will not need to regularly copy the entire pack of changes once a day, once an hour, etc. One action, but simple, it becomes invisible to the user, because the content is already given to him. Just the background file is poured to another place, to another site, which is not dependent on the main one.
And the very frequent thing that we see is that the base is backed up, the content is backed up, and the configs are not backed up. As a result, the data center crashes, the server crashes, you need to recover, we are recovering from backups, we have backup codes, static backup, we have everything cool, but we have a very, very complex web server configuration, and the next few hours the command of this the project is engaged in recovering the config from memory or trying to restore the config two years ago.
There is a backup, there is a backup, these are different things.
Site selection for reservations.
The backup site should not be connected to the current data center. We very often see when the backup server is taken to the main server in the same place, i.e., relatively speaking, the same Hetzner, where you have one machine in the rack and the second machine in the rack acts as a reserve. Yes, it will protect against server crashes, but it will not protect against crashes in the entire data center as a whole. If we choose a backup site to the main one, most often people want to make it cheaper because it does not receive traffic. You need to understand that after switching to the backup, you can be in a situation where you stay with her for a long time. And if you think that it is possible to live on it for an hour, but it will not be possible to live on it for a day or more, it is better to think about whether to do so.
. , . Those. , , , , , . , , . :
— - .
— , , , , . , .
The thing that people often use is so-called. hybrid cloud You have a platform that is under production, and you have a minimal configuration in the clouds that can only accept replication from the production site. If an accident happens on the production site, the clouds scale up to become the same as the site, and traffic switches to them. In the process of a long time, it will be somewhat more expensive than iron production, but in the process of using it in regular mode, when you do not have accidents, you save time and money on this reserve and at the same time do not risk that you can get into switching problems.
Important things when booking. First of all, you need to understand how long it will take you to switch. You need to understand how consistent the data you have synchronized to the reserve. And you need to understand the risk of downtime when switching.
A very, very complicated procedure that people do not like ... Checking recovery from backups is, in principle, simple, but you also need to regularly check the possibility of switching to a backup site. You cannot make a backup site and wait only for the moment when you can switch to it. There are a million options, because of what the reserve may not work. The simplest thing is that they simply did not plan something when designing a backup architecture. Yes, if we switch without a crash, and there is something wrong, the project will fall, then it will be at least a conscious risk. At night or at some quiet time when it can be done. This is better than if, after switching at the time of the accident, you find out that you are missing something, some files are not synchronized, some kind of base is lagging behind, etc.
And it is important to understand that all these things do not work, if there is no team that will react to these things and do these things.
What should have this team. The team must be inside you. It may be a little outside, but still there must be some people who even inside understand how it all works, and what it should be. They must understand the inside of the product and must be able to localize the problem. They should be involved in the team of your company, simply because most often, according to the law of meanness, accidents occur at night, and in whatever time zone you might be when a person wakes up from a call from a manager or partner, or from an SMS about an accident, he should not score, just hang up, and understand that trouble is happening, and need his help.
All this does not work without organization. Yes, this team is there, but you need to understand that the whole team in the support process very often works in chaotic conditions. An accident arrived, which had never happened before, this accident was corrected. It is necessary to understand how this accident happened, how it was fixed, and how to make it so that it no longer exists. If we fixed the accident and did not understand how to fix it, we actually did not fix it and get another one the same. If we had an incident that was not monitored, we must fundamentally understand that an accident happened, it was not monitored, then the first steps to be taken are monitoring the server, monitoring the application so that such an accident no longer exists.
Such teams do not like the bureaucracy at the time of dealing with the problem, at the time of solving problems, but then you need to seek to formalize what has been done, to formalize all these procedures.
How to work on supporting such a thing at the launch of your project. At the start, it will be just SMS to key people and phone calls to people who are awake and constantly watching how it works. Then, it may be the duties of people within the team, additional people. To start is suitable for two after two for 12 hours. But people will not last long, simply because those who will work on the night shift, they may well have wives, children, families. And when they work from 8 in the evening to 8 in the morning (I say this from life experience), at 10-11 in the morning their same wives, children, families, friends, start to call and say: “Vasya, let's go further ". And it is very hard at this moment to sleep and understand that you just have no life left.
From experience: each of the people who receive alerts must have at least three phones of critical people to alert them to call and report that there is a problem. One phone may not be taken, two phones may not be taken, but I do not remember cases when they did not pick up the phone with three phone calls.
Time spent on servers. You may have 20 servers and spend much more time on one of them than the others. Perhaps the team itself does not see this, perhaps you do not see it, but on these servers there is something to deal with.
The thing we have done recently for ourselves is that we began to record SSH sessions. This is not for control, but in order to understand, as a result of accidents, what is the reason. Do not remember how I was doing in a panic the last half hour, trying to quickly perform a surgical operation, but just to look again, find what was done, write down, formulate and understand.
More about the work. This is more from the category of jokes. If several key people fly on the same flight in an airplane, the servers will definitely fall. You need to make sure that someone from key people always remains in access for servers.
People who receive SMS. Ideally, they have at least two ways to communicate. Those.receive phone calls, be with a laptop, etc. - at least two ways to communicate with you. Because cellular communication doesn’t work everywhere, phones are lost, the phone can fall, two SIM cards - already more reliable protection ... Ideally, a redirect will be set up on the same SIM cards to be unavailable to another person who will respond to this. And the last - the life of such people is hard, love them and appreciate.
Instead of conclusions.
Just make a project and run - does not work. Those. , , , , , — . , . , , , , - .