CPU Load: when to start worrying?

This note is a translation of an article from Scout’s blog. The article gives a simple and visual explanation of such a thing as load average . The article is aimed at beginners Linux-administrators, but perhaps it will be useful and more experienced admins. Interested welcome under cat.

You are probably already familiar with the concept of load average . Load average is the three numbers displayed when the top and uptime commands are executed. They look like this:

 load average: 0,35, 0,32, 0,41

Most intuitively understand that these three numbers denote the average values of processor utilization over progressively increasing time intervals (one, five and fifteen minutes) and the smaller their values the better. Large numbers indicate too much server load. But what values are considered marginal? Which values are “bad” and which values are “good”? When should you just worry about average load tasks, and when should other things be thrown away and solve the problem as quickly as possible?
For a start, let's see what load average means. Consider the simplest case: suppose we have one server with a single core processor.

Traffic flow analogy

A single core processor is similar to a single lane road. Imagine that you are driving traffic on a bridge. Sometimes, your bridge is loaded so hard that cars have to wait in line to drive through it. You want to let people know how long they will have to wait to move to the other side of the river. A good way to do this would be to show how many cars are waiting in the queue at a particular point in time . If there are no cars in the queue, driving drivers will know that they will be able to immediately cross the bridge. Otherwise, they will understand that they will have to wait for their turn.
So, Bridge Manager, what kind of notation will you use? How about this:

0.00 means that there are no cars on the bridge. In fact, values from 0.00 to 1.00 mean no queue. A driving car can use the bridge without waiting;
1.00 means that there are just as many cars on the bridge as it can accommodate. It is still going well, but if there is an increase in the flow of cars, problems are possible;
Values greater than 1.00 indicate a queue at the entrance. How big? For example, a value of 2.00 indicates that there are as many cars in a queue as it travels along a bridge. 3.00 means the bridge is fully occupied and expects twice as many cars as it can accommodate. And so on.

load average = 1.00

load average = 0.50

load average = 1.70
Here is the base value of the CPU load. “Machines” are processed using processor time gaps (“cross the bridge”), or are queued. In Unix, this is called the length of the execution queue : the number of all processes running at a time, plus the number of processes waiting in the queue.
You, as the bridge manager, would like the process machines to never wait in line. Thus, it is preferable that the processor load is always below 1.00. Bursts of traffic are periodically possible when the load exceeds 1.00, but if it constantly exceeds this value, this is a reason to start worrying.

So you say 1.00 is the ideal load average?

Not really. The problem with the value of 1.00 is that you have no stock left. In practice, many system administrators draw a line at 0.70:

The rule of thumb is "Supervision required": 0.70. If the average load value constantly exceeds 0.70, you should find out the reason for this behavior of the system in order to avoid problems in the future;
The rule of thumb is “Fix it Immediately!”: 1.00. If the average system load exceeds 1.00, it is urgent to find the cause and eliminate it. Otherwise, you risk being woken up in the middle of the night and it definitely won't be fun;
The rule of thumb is “Right now 3 nights !!! SOZANAKH ?? !! ”: 5.00. If the average CPU load exceeds 5.00, you have serious problems. The server may hang or work very slowly. Most likely, this will happen at the worst possible moment. For example, in the middle of the night or when you make a presentation at a conference.

What about multiprocessor systems? My server shows load 3.00 and everything is OK!

Do you have a four-processor system? It's okay if load average is 3.00.
In multiprocessor systems, the load is calculated relative to the number of available processor cores. 100% load is indicated by the number 1.00 for a single-core machine, the number 2.00 for a dual-core, 4.00 for a quad-core, etc.
If we return to our bridge analogy, 1.00 means “one fully loaded lane”. If there is only one lane on the bridge, 1.00 means that the bridge is 100% loaded, but if there are two lanes, it is only 50% loaded.
The same with processors. 1.00 means 100% single-core processor utilization. 2.00 - 100% dual-core loading, etc.
')

Multicore vs. multiprocessing

Which is better: one processor with two cores or two separate processors? In terms of performance, both of these solutions are roughly equal. Yes, about. Here there are many nuances associated with the size of the cache, switching processes between processors, etc. Despite this, the only important characteristic for changing the system load is the total number of cores, regardless of how many physical processors they are on.
Which leads us to two more practical rules:

"Number of cores = maximum load". On a multi-core system, the load should not exceed the number of available cores;
"The cores are the cores in Africa." The way kernels are distributed across processors is unimportant. Two quad cores = four dual cores = eight single core processors. Only the total number of cores matters.

Let's bring it all together

Let's look at the average load values using the uptime :

 ~$ uptime 09:14:44 up 1:20, 5 users, load average: 0,35, 0,32, 0,41

Here are the indicators for a system with a quad-core processor and we see that there is a large stock of load. I will not even think about it until load average exceeds 3.70.

What average value should I monitor? For one, five or 15 minutes?

For the values that we talked about earlier (1.00 - fix it immediately, etc.), time intervals of five and 15 minutes should be considered. If the load on your system exceeds 1.00 in a one-minute interval, everything is fine. If the load exceeds 1.00 in the five or 15-minute interval, you should start taking action (of course, you should also take into account the number of cores in your system).

The number of cores is important for correctly understanding load average. How do I know him?

The cat /proc/cpuinfo displays information about all the processors on your system. To find out the number of cores, feed its output to the grep utility:

 ~$ cat /proc/cpuinfo | grep 'cpu cores' cpu cores : 4 cpu cores : 4 cpu cores : 4 cpu cores : 4

Translator's notes

Above is a translation of the article itself. Also a lot of interesting information can be gleaned from the comments to it. So, one of the commentators says that it is not important for every system to have a production margin and not to allow load values above 0.70 - sometimes we need the server to work "all the way" and in such cases load average = 1.00 - that the doctor prescribed.

PS

Habrayuzer dukelion added a valuable comment in the comments that in some scenarios, to achieve the maximum efficiency of the hardware, it is worth keeping the load average slightly higher than 1.00 to the detriment of the efficiency of each individual process.

Pps

Habrayuser enemo in comments added the remark that the high load average can be caused by a large number of processes that are currently performing read / write operations. That is, load average > 1.00 on a single-core machine does not always mean that your system does not have a stock on processor load. A more careful study of the reasons for this indicator is required. By the way, this is a good topic for a new post on Habré :-)

PPPS

Habrayuser esvaf in the comments is interested in how to interpret the load average values in the case of using a processor with HyperThreading technology. There is no definite answer at the moment. This article argues that a processor that has two virtual cores with one physical core will be 10-30% more productive than a simple single-core one. If we take such an assumption for the truth, I believe that when interpreting the load average it is worthwhile to take into account only the number of physical cores.

Source: https://habr.com/ru/post/216827/

All Articles