This text is a transcript of a very old, but not losing relevance speeches Nicholas.I would like to talk about why, in general, we monitor, about the content. The motivation is simple - if we have a resource for 1 minute (HeadHunter), we decided that this affects 30 thousand users in the afternoon on weekdays. For comparison, this is 15 HighLoad ++ audiences this year. Oleg said that there are about 2,000 people here, so these are 15 such audiences. The task of the exploiters, in my opinion, is not to optimize monitoring, but to make their website work. Those. business problem to solve. What tasks need to be addressed?

')
- The first is to find out what has broken, i.e. good / bad - these two states of the system should be able to be distinguished.
- The second is to quickly find out where it broke, in order to quickly run to fix it.
- After you have made any changes, returned everything as it was, it worked, you need to check that it works the same way as before the package.
- It is also possible to download several such tasks that are not entirely urgent, such as raking up incidents - this is capacity planning, i.e. understand how much resources you have to serve new and new users.
- Plan optimizations to free up these resources. When you understand that your iron resources will soon run out, either buy more, or understand that we can tighten up here, and then you will not have to buy iron.
- And, no less important, is to check that your optimization has worked.

A little restrict scope. We will talk about what metrics you need to look at in order to understand your system. We will try to understand how to draw them better, we will talk about graphics, we will try to understand how to draw them correctly so that we can understand everything quickly. Because on the chart with 10 thousand lines nothing is understood.

We will not talk about how to draw charts, how to tune your monitoring, how to shard monitoring, how there is a proxy, everything. We will not talk about specific alerts and methodologies, i.e. workflow, who is responsible for what, what KPIs are in use ... I gave a report at the spring conference, here is a link. There are slides, plus or minus, everything is clear.

So let's go. What are we dealing with? Simplified HH.ru looks like this. From above - the browser, our user, he comes to look for work. Falls to frontends. There nginx, then static is given there, cached as far as possible, then the request comes to our collector server, we call it front-end. Since we have SOA, we need to collect the page of the pieces. The front, as time, it is engaged. It makes requests to different services, templates and gives the page to the user. Then the front comes to services, but since everything must be scalable, fault tolerant, it goes through balancers. Our internal balancers are nginx + haproxy. Next, figuratively, I painted three different services, in fact, there are about 40-50. These services make some kind of business logic, go to some of their repositories, the vast majority are postgres, they also communicate with memcached, we even have a little cassandra, etc.

What do we want to understand? We want to understand how the user sees our site. Moreover, since we have many users, we want to understand everything at once. Those. who sees how, what percentage does not see well, what percentage sees well. We want to see what is happening in each subsystem, because if something is broken here, you physically cannot read 40 different logs, each of them has 3-10-20-30 replicas.
What resources are spent on is to be understood, because any monitoring of the state of an application is to understand where control is in the application at any given time. Since If we want historical data, then we need some stages of the application, etc. I'll tell you more about this later. And all this is not enough to understand in the form of a light bulb "good" / "bad." We must understand in time, because sometimes we must be able to compare with how it was an hour ago, a minute ago, last Monday, Thursday, two weeks ago.

He promised by layers, we will by layers. The first layer is the browser. What can I get from the browser? There is an opportunity to remove any timings - how many dns requests were made, how much our network connection was serviced, these are tcp connect, tss handshake, how many requests were sent to the server, how many the server thought, how many data was transferred, how many pictures were immersed, static, everything. And finally, the code that works in the browser can be confusing - how much can all this be removed.

I’m not a front-endander, so I’ll tell you about this layer. There is a Navigation timing API. Practically all front-fenders know about him, about him, I think, there is a report at this conference. The bottom line is that we remove the timings from the browser and shoot them with a get-request to our server. On the server, you do not need to process anything in realtime, just hang up in nginx to return the 204th status, and write the entire get request to the url, and then parse.

As a result, we get something like this. Don't ask me why green is 20 seconds. But we see, at least, interesting metrics for the exploiter - how the server works for us from the point of view of a particular browser, how the channel works for us, i.e. how much is all transferred. We see that it takes an acceptable time, there are no jumps - that's all, we have solved our task. Then ask front-vendors what metrics they need, draw for them, or they will draw, not the point.

Next, the frontend layers begin. We in the nginx log see everything that has flown to our data center. Why can not fly? Can not fly, because your domain is not resolved from some place. It can not fly, because you simply lack connectivity. In this case, you can do something - again, front-ends know, there is some mechanism, to shoot such things on some other host - I won’t tell you exactly, but there is, know about it, and that’s enough.

What do we want to get on this layer? We want a complete picture of how the site works. Any bugs, no bugs? If so, how much, what part of the users they affect - is it one user, 10%, 100%? Is the site fast or slow? Again, on what scale? Does one tupit or tupit at all, or tupit at all from Voronezh? How many requests do you have per second? Because, there are cases when bots come, and then your rps increases. It is necessary to know, it will help you to restore your system. Or you went to the channel - there is a failure, there is nothing. Or you have a third failure, because Rostelecom has unearthed something on the border, there, with Ukraine ... And it happens that the channel is dull, there are delays somewhere, losses — I would also like to answer this question.

All this is in the nginx log. But not in the standard log format. You need to add request_time there - this is the time from receiving the first byte from the client to sending the last byte of the response to the client socket. Accordingly, somehow this metric reflects even the characteristics from the channel to the client - it is fast, slow. Because we still have all the dudes on the modems. Also, if you want to distinguish problems with the channel from the problem with the backend, there is upstream_response_time - this time spent waiting for the backend, one, several, is not important. This does not include any delays to the client, and this is just the time you spent to calculate the page. Optionally there are all sorts of useful options. I will not read, this is for understanding what is happening there.

How to look at it? We look at it like this. This is called in our terminology "traffic light". Y axis - requests per second. We have a little more than 2000 of them (some time ago I was preparing this presentation). The green area is rps, which were processed for less than 500 ms. Yellow - those that worked 500 ms and a second, red - more than a second, it is a long time. And black - we added mistakes here, they are affectionately called "oil." This rps pyatisotok. Accordingly, the transfer for the client is included here, because this is request_time, which we talked about backwards.

Upstream_response_time, firstly, it is less than rps, because something out of the cache has surrendered, something has gone somewhere. Accordingly, we have upstream_response_time, as we see, there are fewer red ones, because the transfers are gone, i.e. in fact, we work even better, but we have five hundred dollars with us.

We can only leave five hundred on the same chart. We understand that we had two spikes. One more than 400 rps, and the second - a little more than 100. Very convenient for understanding. Those. we, in principle, practically answered all the questions with two graphs. And in a situation when we are only interested in the server, then one.

Five hundred, we can lay out url'am in realtime. We can see that all urls can be five hundred times immediately, in proportion to what we have rps. And we can see specifically the url, if any handler, some service broke, or a specific query from the database was dulled.

Also there is such a schedule, it duplicates pyatisotki. We look at it according to different logs. The colors are the same in order to understand in the legend which one is five hundred. We have an API for a mobile application, there is a main site HH, and there is a mobile site. Separately, we allocate bots - these are requests that fall into the limit of the number of requests per second in nginx, then we give them the 429th status. Here it is drawn. We draw all the statuses, and here we only output the 429th, because anything has happened, it starts to blunt, we do not filter everything there, for example, the satellite was once young, it was often used to us until we banned it. But to see the fact that bots came is very useful.

Then we can look at the traffic by url'am. This is only static traffic, the speaker is not included here. We see - the release was released, CCS, JS were updated, there was a surge, users started downloading the latest versions, then he calmed down and switched to normal mode. It's cool to understand what is happening, because if you see an unfamiliar surge in traffic, you are: “Damn, what is this?”. And then everything is immediately clear.

I will tell a little about urlah. You can configure the parsing of logs so that you will say: "I know such and such URLs - vacancy, job search, etc.". But this is all outdated instantly, because the application develops when developers rename urls or somehow redo them, no one will tell you about this, you will go to watch something on urla, and there pumpkin.
What need to do? It is necessary, first, to try to normalize urly. Those. remove all arguments, try to isolate some hashes or id from the URL, replace all this with some placeholders, no matter what, and build a dynamic top. We have a dynamic top in the amount of $ upstream_response_time, i.e. The urls that take the most real CPU time are in the top. You can rps, you can make separately for errors. But you definitely need to cut off the bottom, because requests from bots, as a rule, collect all your five hundred points, all four points, etc.

Also on the logs. If you have several front-ends, and you collect metrics from everyone, you can aggregate it and see how balancing works for you - well, bad, where traffic flowed. In this example, front2 was taken out of the cluster, there are some works, and several times. We see that, first, the rest of the servers flowed from the load balancer. Accordingly, the total rps was not affected in any way, i.e. we have done it painlessly for users. But such a picture is extremely useful.

Farther. Let's go down. We understand how it works, now we solve the problem “to quickly understand what is broken.” At the front we also have a log.

This is one of the types of lines, looks like. Here we are for each handler, in this case vacancy.Page, we write all the stages in milliseconds, how much we occupied. A stage is some kind of code that is being executed. All stages go sequentially. Let's say a session is a trip to the session service and getting a session, we take a cookie, we do something there, we get a session, we know more about the user. Then with this data we go to page - this is a parallel campaign for all services that need to be found for this handler. Those. inside this stage there are parallel queries. But we are interested in total. Xsl. We are all xsl template. Postprocess is the imposition of translations on all sorts of ... there the page is going, there are place holders, then a special process goes through the page, replaces place holders with translations.

Here we have a picture from this log. It can be seen, this is a stack of 95th percentiles of each stage. The percentile has its advantages and disadvantages, but in this case it is convenient. First, we understand that bummed. In the middle of the chart, a blue shot is blunted session service. If another stage fires, we will immediately find out. That is, in principle, for each page there is such a picture. You can somehow aggregate across all pages, across all handlers. But this is the taste and color.

Convenient reception is still - sum up the response time. It would seem that the stupid metric "sum up the response time." Actually, cool stuff. We, summing up the response time on the topmost layer, we get in some units, I call it “resource seconds per second”, how much and, in general, what the cluster does. Those. if we divide them by handlers, we get that 20-30% is blue - this is vacancy.Page. Those. cumulatively, regardless of anything or the number of requests, we are engaged in the delivery of a vacancy page. Next is job search etc. If we get any bjaka, which will be considered an hour, but in one stream, we will not see it here, and rightly so. Those. This is the real answer to the question of what the cluster is doing.

Templates The front also writes which template overlaps and how much.

It makes sense to make a graph out of this, we just get what template the CPU takes in seconds, because template making is a CPU task. Accordingly, we can choose what to optimize. We are not guided by rps, i.e. how many times a specific template was called, superimposed, etc., and how much in total it gave the weight and how much, in principle, the CPU can be released.

Here is an example. True, I'm here telling you about what template we have, but we’ve got to optimize, we’ve found out that we can optimize the menu generation piece that other templates include. Here it is optimized, so collapsed. Those. if after this task optimization goes into the prod, we understand that the picture has not changed, it was necessary to roll it back and do nothing else, because if there is no effect, the task can be thrown out. There is an effect here. Cool, we appreciated how much. We are on the basis of the heels of cars removed from the cluster of frontiers.

Next on the balancers.

There is the same nginx, the same pictures, but we have SOA, we have many services, the most usable schedule for us is the answer to the question “which service is five hundred?”. In this case, this is the top 5. Dynamic. Those. if the service comes out, which has 2 rps, and it starts giving us 10 thousand five hundred per second, it will come out automatically here. Accordingly, we see green - the session service pours about 70 requests per second. We received the answer, at least we can already go to read the logs, if we have nothing on this service.

But let's see what can be pulled out of each service?

Again stage. In principle, nothing except stages is necessary, because debugging on logs is still difficult. Those. These are separate logs. If the programmer needs information on business logic, he writes it separately, and the agent pulls out of the log only what interests him. We are interested in the stage, we are interested in where the time goes. In this case, the request for the session was fulfilled 10 ms, 6 of which were spent on the campaign to the HHid service. This is a service that knows the user and knows nothing more. It enriches it, i.e. the result of the response of this service, HhSession enriches from the database, it took the request to the database to take 3 ms.

Accordingly, the log is, there is a picture. We see that the request “base latency” is constant, and we see that on this background, the Hhid response time jumps. Those. if the base suddenly blunts, we will see a surge at the base, i.e. we get answers to our main questions - where to go next in the case of a fakap.
My whole story is based on the fact that I want to quickly understand what happened. Those. This is the optimization of the exploiter’s work.

According to the base. Those. You said: "Tupit base." What can be done on the base?

First, there is, as it is not trivial, the average base response time. This is taken from postgres, from pg stat statements. Because nothing more intelligible can be taken from there. Therefore, the average time. But it is significant. Here are all four of our bases. This is a particular piece. We see that everything is working fine, but at night one blunts. This is a master, and then we will try to understand what happened to him. Those. if suddenly you come, and you are told that the base is stupid, you look at this chart, and everything is the same as it was, you immediately say: "No, everything is fine with the base." And it will be true.

If suddenly on the basis of the CPU jumped, in principle, from the pg stat statements we can understand exactly which query was calculated. In this case, we see a rejection of red - this is a request to the database, hh_test, and is there all its text? and with holder 's. He is circumcised here, but you can just copy it and see where it comes from, remove, eject, anything.

The base also has a weak point - it drives. Naturally, everything is there, all the data.

Who loaded the wheels? In pg stat statements there are counters, how many were waiting for a response from the disk. Accordingly, we see specific two requests that kill the disk. We come, we deal with them, if necessary. If this is a normal, regular situation, then we just know what the peak is. Answering the question “what was it?” Is the most important thing. Those. If there are any emissions in your system that you do not understand, you need to collect more detailed information about your system.

And everything would be simple, if there was no network between all this. We can say that therefore, including, we remove all timings from all services, even if they are repeated. Because the measurement error in a web application can be quite substantial. Because, for example, you have some asynchronous application that is built on an event loop, somewhere measures how much it waits for an answer and at the same time thinks something, the control has not returned to its loop, then it will count a lot of things .

Questions Suppose we see in the session service log that he went to hhid 150 ms, we all have request_id, i.e. we can tie them together according to the logs and match two specific requests, and say that it was a request - the session saw so much, so much hhid. We go to hhid, he says: "I answered 5 ms, I know nothing." Developers come and say: "Your network is tupit." Well, damn it, got it! And we decided to think about what to do. At the same time, to ping all machines from all machines is somehow strange, to say the least. And even more so, before the first fakapap it is not clear from what machines what to do. It is necessary to exclude the network.

There is such a thing TCP RTT - Round-Trip Time. This is the time from the sending of the segment to the TCP network to the receipt of acknowledge. But this time should not be compared with ping. It should be considered as a trend, because TCP is not an easy thing, to say the least.

I stole a slide from the presentation - this is some kind of course on the networks of an American university, it says here that in fact one TCP connection has only one RTT timer. Those.
in the first case, we measured the actual time from sending the segment to the ACK. All is well.
If we remove the error on the number of bytes transferred, then we approximately get a normal RTT. In the second case, RTT counted us sending 2.5 segments and receiving 2 ACKs. Because TCP has selective acknowledge and so on. There are a lot of bells and whistles, plus to everything, if any order of sending segments breaks there, and there the meat begins.
But still, we decided to try to measure it. We remove RTT of all connections from all hosts once a minute, but stupidly remove, aggregate around. We aggregate between two IP lokalki. If this IP is not from LAN, we say “Azer” and aggregate. If the IP is in LAN, then for each IP we have a metric from each host. And we also break down by listen_port, if listen is on our side. We are trying to see what can be said on these metrics.
Here is an example of a picture. This is an RTT between replica and master. The replica is in the remote data center. On this day, there were works on data centers, the channel was completely extinguished twice. First, we see that TCP RTT with ping is less than a millisecond of 7 ms, but it's cool that it is constant, i.e. trend still reflects us. This picture shows two failures. At least, they will tell us: “The network has disappeared”. We look at these and say: "Really disappeared." And after the recovery, we see that the network has dulled twice, everyone wanted to blunt, etc. And then everything came back to normal, which is also cool, we saw it. In principle, it became clearer, i.e. The metric is incomprehensible, because we do not know that under this RTT actually hides inside, but there is a trend, and this, I think, is 100 times better than nothing.
And about the OS. It's all trite, like everyone else, except that I emphasize that everyone is looking at the swap and not looking at the swap i / o. Swap i / o is much more useful. If someone sits in your swap and does not read or write anything from there, then let him sit, don't mind.
But having a CPU, it's hard to say what kind of garbage. Because we come to work at 10 am, we look at the schedule, what is it?
We also remove system metrics for each process from the user. The most useful of them are CPU, Disk i / o, Swap, Swap I / O and up to the heap (we are tired of the problem of too many open files), we remove for each process how many descriptors are opened, sockets, files are not important, and what limit.
And we immediately see that at night there was a backup on the master, and pbzip ate CPU. In principle, a simple metric, just a good granularity, and we get all the answers.
Let's finish and summarize. I wanted to tell you about what you need to think not about how many points there are, how many hosts to monitor, what ports they have to check once a minute, but make good correct graphs for your system, and they can increase the speed of finding out the reasons for you fakapov.It is important not to stop, if ... Suppose we see the whole picture at the front end, we could not go further along the logs, but we realized that after we understand whether there is a problem or not, it takes us 10 minutes during the day to read the logs. But we monitored there and saw that this time was reduced by 10 times. Further, we see problems with base. We used to be: “We have postgres consulting DB admins are outsourcing!”. We call and say: "Guys, look, what's with the base?". Now we, at least, understand whether there is anything on the basis of what. Because any communication between people is long, the computer is faster, and if you make the right metrics ... At least make the metrics not so that they give you answers, but do to exclude the most likely problems. And the more detailed the metric, the simplerbecause if we are all averaged there, take percents, histograms, etc., nothing will be clear, and you will still get into the logs. Try to make it so that you don’t need to go to the logs.That's all I wanted to tell you. Thanks for attention!
Contacts
NikolaySivkon.sivko@gmail.comsivko@hh.ruHeadHunterCompany Blog okmeter.io Company BlogThis report is a transcript of one of the best speeches at the conference of developers of high-loaded HighLoad ++ systems. Now we are actively preparing a conference in 2017 - this year HighLoad ++ will be held in Skolkovo, on November 7 and 8.
The topic of monitoring is raised on HighLoad ++ every year, here, for example, some of the reports of 2017:
Well, Nikolai himself is going to read a philosophical and systematizing report on the topic “ Operation of container based infrastructures ”.
Also, some of these materials are used by us in an online training course on the development of high-load systems HighLoad.Guide is a chain of specially selected letters, articles, materials, videos. Already, in our textbook more than 30 unique materials. Get connected!