Monitoring System Delays with JHiccup

About JHiccup

JHiccup is a simple program that allows you to measure the delays of the operating system in terms of the final application. It was written by Azul CTO - Gil Shadow to measure OS delays.

Why delays are so important

We live in the days of network applications. Most programs running on our computer regularly go to the Internet. If we launch the browser and open google.com, then 50–60 requests will occur.

56 requests to open google.com

If we are talking about more complex sites, then the number of requests will be in the hundreds. And the delay of any of these requests may delay the rendering of the entire site.

Having considered the example of sites, we can easily draw an analogy with a client-server application or popular micro-services. If in the micro service call chain one of the micro services returns an answer later than usual, this can slow down all the logic. For example, a slow response to the price of a product from the database may slow down the whole process of buying in an online store.

Therefore, when we talk about program performance and delays, not only the best and average results, but also the worst results are important.

Why do we need OS monitoring

Often, developers receive complaints from customers or other systems that the application has processed the request for too long. Unfortunately, such a problem can be difficult to reproduce locally or even notice in a real environment.

Many developers in this case immediately begin to look for problems in the code. For this is used for example logs, metrics and profiler. But the correct performance analysis should start from the bottom up, starting with the level of iron and wasps and ending with the program.

Most wasps are not real-time operating systems. So they can not give guarantees for a certain time the execution of operations. This means that the performance of programs running on such wasps can be very different during the time the program is running. Simply put, the program may not even get CPU time at some point. And then it doesn't matter what code is executed in the program.

Here are some reasons why a program can “sleep” without being able to perform useful actions:

The OS can perform internal garbage collection;
Another resource-intensive application may use a CPU or other resources;
The OS can run on top of the hypervisor and not knowing this to be not the only OS running on this hardware.

Why jHiccup?

There are many utilities and metrics that allow you to see the workload of different system components with varying degrees of detail. The problem is that there are a lot of such metrics and for each programmer you need to answer two questions:

Could this metric be the cause of the delay?
Is the specific value of this metric anomalous?

jHiccup allows you to look at the system from the point of view of the application. jHiccup is a small application with a simple function: an endless stream falls asleep and asks the OS to wake him up after a certain period, for example, 1 second. If the OS was busy after 1 second and could not wake up the stream, then the application will see this by comparing the wake-up time with the estimated wake-up time (sleep time + 1 sec). We can build a graph where we will see system delays during program execution.

On the X axis, the program execution time in seconds. On the Y axis, the wake up delay in milliseconds (how long the program waited for the OS to allow it to run)

Knowing the time of complaint about the slow response of our system to the client and seeing the schedule of delays in the awakening of our program, we can tell if it was the cause of the delay or not.

In the previous graph, we looked at delays relative to the program execution time. In addition, it is convenient to take all the delays and sort them in ascending order. This will give us an idea of the distribution of delays and their likelihood.

The delay in milliseconds on the y axis and its probability on the x axis

Some features of JHiccup :

does not suffer from the problem of Coordinate Omission well described by Gil Tene in his video

Inside jHiccup uses a histogram as a data structure. A regular histogram breaks the entire delay interval (for example, from 1 ms to 1 sec) into segments and counts how many delays fall into a certain interval. This allows you to present data on delays in a more compact form than just a list of observed values (1.55ms, 2.6ms, etc.).

In fact, jHicuup uses a special histogram implementation - HDR-Histogram, which has the following properties:

Histograms have high resolution. We can see not only 95% and 99% of the worst results, but also much more detailed data (99.999%).
Stores data in a compact form that allows you to measure performance over time. For this, the size of the histogram segment increases exponentially. This makes it more compact to store anomalous values.

Library HDR-Histogram is widespread. You can find implementations in different languages. Different systems for collecting metrics began to support hdr-historgram as one of the internal formats, due to its compactness and accuracy.

Why such accuracy?

In the graphs above, we saw data on the 99.9999% case. For many, the question arises whether such accuracy is necessary and whether the data should be considered further than 95% or 99% of the percentile. Let's look at two examples. In both examples, we take the probability of the anomalous delay P (A) as 5% and 1%, respectively. We need to answer the question, what is the probability that the user will see the anomalous request P (B):

We have seen google.com making about 60 requests. For example, consider the site of an online store where to purchase, you must perform 200 requests. In the case of P (A) = 5%, P (B) = 1– (0.95 to the power of 200) = 99.997%. In the case of P (A) = 1%, P (B) = 1- (0.99 to the power of 200) = 86.6%
Suppose we have 10 micro-services. And each is called twice during the execution of a certain script, that is, there are 20 calls. In the case of P (A) = 5%, P (B) = 1– (0.95 to the power of 20) = 64.15%. In the case of P (A) = 1%, P (B) = 1- (0.99 to the power of 20) = 18.2%.

As we see, it is not enough to consider data only up to the 95th or 99th percentel.

JHiccup usage example

You can download jhiccup from http://www.azul.com/downloads/jhiccup/ or https://github.com/giltene/jHiccup .

./jHiccup -d 4000 /usr/bin/java org.jhiccup.Idle -t 300000 #  4     ,    300 .       5  (    -i). #  hiccup.170617.1120.3461.hlog ./jHiccupLogProcessor -i hiccup.170617.1120.3461.hlog -o hiccup.170617.1120.3461 #  hiccup.170617.1120.3461  hiccup.170617.1120.3461.hgrm

The file hiccup.170617.1120.3461 can be viewed using the excel file jHiccupPlotter.xls.

To view hiccup.170617.1120.3461.hgrm, you can use the online application https://hdrhistogram.imtqy.com/HdrHistogram/plotFiles.html . It is also convenient for comparing several hdrm files (for example, during different system loads or from different servers).

Comparing the graph of the performance of our program (for example, the delay of http responses) with the resulting hdr-diagram, we can understand whether the whole system worked slowly in a certain period or only our program.

We ran jhiccup as a separate process. Another way is to run javaagent with our program.

 java -javaagent:jHiccup.jar="-d 0 -i 1000 -l hiccuplog -c" MyProgram.jar -a -b -c

In this case, the jhiccup will wake up and save information about delays during the entire program execution time.

There are one important difference in these two launch methods. In the first case, jHiccup runs on a separate JVM in another on the same JVM. That is, in the second case, we will see delays associated with the operation of the JVM (for example, the GC pause), on which the main application is running.

In jHiccupPlotter.xls, you can add SLA lines to the chart.

I see two convenient applications for SLA:

Source: https://habr.com/ru/post/331436/

All Articles