Gorilla: fast, scalable in-memory time-series database

This is a translation of the review of the article “Gorilla: A fast, scalable, in-memory time series database” Pelkonen et al. VLDB 2015

Facebook guys made a high-performance monitoring data engine. I liked the review of this article in the blog "The morning paper" - especially about compression algorithms, and here is the translation.

Style - author.

The number of errors on one of the Facebook servers went off scale. The problem was discovered a few minutes after its appearance, when the automatic notification in Gorilla, the in-memory database for the time series, worked. One group of engineers took up the solution to the incident, while another intended to find its immediate cause. They launched a correlation search engine in the time series written over Gorilla, and began to look for metrics correlating with errors. It turned out that copying the release binaries to Facebook web servers (rather a routine operation) caused an unnatural decrease in memory usage on the site ...

Figure 1. When searching for the main reason for increasing the frequency of errors on the site, the search engine for the correlation of time series in Gorilla detected anomalous events that in time corresponded to the appearance of errors, namely a sharp decrease in the memory used when copying the release binary.

18 months before the article was published, Gorilla helped Facebook engineers identify and debug several similar problems on production.

“An important requirement for the operation of such high-load services is accurate monitoring of the performance and performance of the underlying system, as well as the rapid identification and diagnosis of problems as they arise. Facebook uses a time series database to store system measurement data points and provides quick query functionality to them. ”

Since spring 2015, Facebook's monitoring systems have generated over 2 billion unique measurements of time series (12 million measurements per second, more than 1 trillion per day). Since the spring of 2015, Facebook monitoring systems have generated over 2 billion unique metrics (12 million measurements per second, i.e. more than 1 trillion per day). The following characteristics were required from Gorilla:

Storage of 2 billion unique time series available by string key.
The insertion speed is 700 million data points (time stamp and floating point value) per minute.
Data for the last 26 hours that must be stored in memory and quickly issue on request.
Up to 40,000 requests per second at the peak.
Successful reading in less than one millisecond.
Time series support with details up to 15 seconds (4 data points per minute).
Two copies of memory (not combined) for disaster recovery
Continue reading in case the server crashes.
Support quick scan of all data in memory.
Manage long-term growth (up to 2 times a year) of these time series!

To meet these performance requirements, Gorilla is built as a TSDB completely in RAM and working as an end-to-end cache when recording monitoring data in HBase. To keep 26 hours of data in memory, Gorilla uses a new time series compression algorithm, which gives compression 12 times. Data structures in memory allow you to quickly and efficiently scan all data while maintaining the possibility of obtaining data from individual time series in constant time.

“The string key is used to identify unique time series. Each set of time data is tied to a separate host in Gorilla, by sharding all monitoring data based on these unique string keys. Thus, we can scale Gorilla by simply adding new hosts and adjusting the sharding, so that new data falls on an extended set of hosts. When Gorilla was launched in production 18 months ago, all our monitoring data for the last 26 hours was placed in 1.3 TB of RAM and evenly distributed on 20 machines. Since then, due to data growth, we had to double the cluster size twice, and now they are running on 80 machines in each Gorilla cluster. This process was simple due to the architecture of independent nodes and the focus on horizontal scalability. ”

The in-memory data structure is based on the unordered map structure from the standard C ++ library. She proved to be effective and it gave no blocking problems. Gorilla saves data to GlusterFS, a POSIX-compatible triple-replication distributed file system. "HDFS or other distributed file systems would also work."

For more information about data structures and how to handle errors in Gorilla, see sections 4.3 and 4.4 of the original article [ pdf ].

Here I want to focus on those techniques that are used in Gorilla to compress time series in order to fit all this data into memory!

Compress time series in gorilla

“Gorilla compresses data points within one time series without additional compression between different time series. Each date point is a pair of 64-bit values representing a timestamp, timestamp, and its corresponding metric value. Timestamps and values are compressed separately, using information about previous values. "

As for timestamps, the key observation is that most sources write data points to the logs at fixed intervals (for example, one mark every 60 seconds). From time to time, the data point may be recorded a little earlier or later (for example, for a second or two), but this gap is usually limited. In our case, every bit on the account, so if we can represent consecutive timestamps, taking up as little space as possible, this gives us a memory gain ... Each data block stores 2 hours of measurements. The block header contains the initial time stamp of a two-hour window. The first timestamp in the block (after the start of the two-hour window) is saved as delta from the moment the block starts, using 14 bits. 14 bits is enough to hold just over 4 hours, so we know that we definitely will not need more.

For all subsequent timestamps, we compare the deltas. Suppose that the block start time is at 02:00:00, and the first timestamp is 62 seconds later - 02:01:02. The next data point is at 02:02:02, after another 60 seconds. If we compare these two deltas, the second delta (60 seconds) is 2 seconds shorter than the first (62 seconds). So we write down minus 2. How many bits should we use to write “-2”? Less is better! We can use bit tags to indicate how many bits the value itself is encoded. The scheme works as follows:

Delta delta calculation: D = (tn - tn-1) - (tn-1 - tn-2)
Deciphering values in accordance with the table:

The specific values of the time intervals were selected by selecting the series of series in real time from the production systems and selecting the ranges that showed the best compression ratios.

So we get the following data stream:

“Figure 3. shows the result of the time stamp compression in Gorilla. We found that 96% of all timestamps can be compressed to 1 bit. ”

(i.e. 96% of all timestamps occur at regular intervals, so that the delta is zero).

Figure 3. Distribution of compressed timestamps by ranges. Taken from real data - 440,000 tags from Gorilla.

We are discussing timestamps for so long, but what about the metric values themselves?

“We found that the value in most time series does not change significantly compared with the adjacent data points. In addition, many data sources store only integers. This allowed us to charge an expensive prediction scheme [25] to a simpler implementation that simply compares the current value with the previous value. If the values are close to each other, then the sign, the exponent, and the first few bits of the mantissa will be identical. For this calculation, we use the XOR of the current and previous values instead of applying a delta encoding scheme. ”

Figure 4. A visual representation of how a XOR with a previous value often has zeros at the beginning and at the end, and for many rows, nonzero elements are grouped together.

The values are then encoded as follows:

The first value is stored without compression. For all subsequent values apply XOR with the previous value. If the result of XOR is zero (i.e., the same value), then the bit with the value “0” is stored. If the XOR result is not zero, then there will probably be a series of zeros before and after the “significant bits”. We calculate the number of leading zeros and the number of zeros at the end. For example, the result is 0 × 0003200000000000, there are 3 leading zeros and 11 zeros at the end, so the bit with the value '1 ′ is stored, more ...

If the previous value stored has the same or fewer zeros at the beginning and end, we know that all the significant bits of the value we want to keep fall within the range of significant bits of the previous value:

In the example above, the control bit "0" is saved, and then the significant value of the XOR operation (that is, 032) is stored.

If the significant bits of the current value do not fall into the range of significant bits of the previous value, then the control bit with the value “1” is stored, followed by 5 bits indicating the number of leading zeros, 6 bits with the length of the significant part of the XOR result and finally the significant bits of the XOR result.

“Approximately 51% of all values are compressed to one bit, since the current and previous values are identical. About 30% of the values are compressed with the control bits' 10 ′ with an average compression size of 26.6 bits. The remaining 19% are compressed with the control bits' 11 ′ with an average size of 36.9 bits due to the additional overhead that is needed to encode the length of the leading zero bits and the significant bits. ”

Systems built on top of Gorilla
Small delays in reading from Gorilla (more than 70 times faster than the previous system, which it replaced) allowed the Facebook team to create a number of tools on top of it. These include horizon charts; Aggregated rolaps, which are updated for all completed blocks every two hours; and the correlation search engine, which, as we saw in the first example.

“The correlation search engine calculates the Pearson correlation coefficient (PPMCC), which compares the test time series with a large set of time series. We find that the Pearson correlation ability to find a correspondence between equally looking time series regardless of scale helps significantly automate the analysis of cause-effect relationships and answer the question “What else happened when my service broke down?”. For us, this approach gives satisfactory answers to such a question and it turned out to be easier to implement than the similar approaches described in the literature [10, 18, 16]. To calculate the Pearson correlation, test time series is distributed to each Gorilla host along with all the keys of the time series. Then each host independently calculates the N most correlated with the test time series, orders them according to the absolute value of the Pearson correlation, and returns the values of the time series found. In the future, we hope that Gorilla will allow using more advanced data-mining methods for our monitoring data, such as those described in the literature on clustering and anomaly detection [10, 11, 16]. ”

Summing up the lessons learned, the authors highlight another 3 further rules:

Prioritize new, recent, not old data. That is why something is not working now - a more pressing question than why it did not work 2 days ago. Low latency of read monitoring data is very important - without this, more advanced tools built on top of Gorilla would not be practical. High availability is more important than resource efficiency.

“We found that building a robust, fault-tolerant system is the most time-consuming part of the project. Although the team created a prototype of a high-performance, in-memory compressed time series database (TSDB) in a very short amount of time, it took a few more months of hard work to make it resilient. However, the advantages of resiliency were visible when the system successfully experienced both real and simulated failures. ”

Source: https://habr.com/ru/post/327038/

All Articles

Gorilla: fast, scalable in-memory time-series database

Compress time series in gorilla

More articles: