📜 ⬆️ ⬇️

Redis streams as pure data structure

The new data structure Redis 5 called “streams” (streams) has aroused great interest in the community. Sometime I will talk to those who use threads in production and write about it. But now I want to consider a slightly different topic. It starts to seem to me that many present the flows with some kind of surrealistic tool for solving terribly difficult tasks. Indeed, this data structure * also * exchanges messages, but it will be an incredible simplification to assume that the functionality of Redis Streams is limited only to this.

Flows are a terrific template and a “mental model” that can be used with great success in system design, but in fact flows, like most Redis data structures, are a more general structure and can be used to heap other tasks. In this article, we will present streams as a pure data structure, completely ignoring blocking operations, recipient groups, and all other messaging functionality.

Streams are CSV on steroids


If you want to write a number of structured data elements and think that the database will be overkill here, you can simply open the file in append only mode and write each line as CSV (Comma Separated Value):

 (open data.csv in append only) time=1553096724033,cpu_temp=23.4,load=2.3 time=1553096725029,cpu_temp=23.2,load=2.1 

It looks easy. People did it a long time ago and still do: this is a reliable template, if you know what's what. But what will be the equivalent in memory? Much more advanced data processing becomes possible in memory, and many limitations of CSV files are automatically removed, such as:
')
  1. It is difficult (inefficient) to perform range requests.
  2. Too much redundant information: almost the same time in each record, and the fields are duplicated. At the same time, deleting data will make the format less flexible if I want to switch to a different set of fields.
  3. Element offsets are simply a byte offset in the file: if we change the file structure, the offset will become wrong, so there is no real concept of a primary identifier. Records are essentially impossible to present in any way unequivocally.
  4. Without the ability to collect garbage and without rewriting the log can not delete records, but only mark them as invalid. Log rewriting usually sucks for several reasons, it is advisable to avoid it.

At the same time, such a CSV log is good in its own way: there is no fixed structure, the fields can vary, it is trivial to generate, and it is rather compact. The idea behind the Redis streams was to preserve dignity, but overcome limitations. The result is a hybrid data structure that is very similar to Redis sorted sets: they * look like * a fundamental data structure, but use several internal representations to get this effect.

Introduction to streams (you can skip if you are already familiar with the basics)


Redis streams are represented as delta-compressed macro nodes connected by a base tree. As a result, you can very quickly search for random entries, get ranges, delete old elements, etc. At the same time, the interface for a programmer is very similar to a CSV file:

 > XADD mystream * cpu-temp 23.4 load 2.3 "1553097561402-0" > XADD mystream * cpu-temp 23.2 load 2.1 "1553097568315-0" 

As you can see from the example, the XADD command automatically generates and returns a record identifier, which monotonously increases and consists of two parts: <time> - <counter>. The time is in milliseconds, and the counter increases for entries with the same time.

So, the first new abstraction for the idea of ​​a CSV file in append only mode is to use an asterisk as an ID argument for XADD: this is how we get a record identifier from the server for free. This identifier is not only useful for pointing to a specific element in the stream, it is also associated with the time when the record is added to the stream. In fact, using XRANGE, you can perform range queries or extract individual elements:

 > XRANGE mystream 1553097561402-0 1553097561402-0 1) 1) "1553097561402-0" 2) 1) "cpu-temp" 2) "23.4" 3) "load" 4) "2.3" 

In this case, I used the same ID for the beginning and end of the range to identify one element. However, I can use any range and COUNT argument to limit the number of results. Similarly, there is no need to specify full identifiers for a range, I can simply use only unix-time to get items in a given time range:

 > XRANGE mystream 1553097560000 1553097570000 1) 1) "1553097561402-0" 2) 1) "cpu-temp" 2) "23.4" 3) "load" 4) "2.3" 2) 1) "1553097568315-0" 2) 1) "cpu-temp" 2) "23.2" 3) "load" 4) "2.1" 

At the moment there is no need to show you other API features, there is documentation for this. For now, let's just focus on this usage pattern: XADD to add, XRANGE (and also XREAD) to extract ranges (depending on what you want to do), and let's see why threads are so powerful as to call them a data structure.

If you want to learn more about threads and APIs, be sure to read the tutorial .

Tennis players


A few days ago, a friend and I, who started studying Redis, modeled an application to track local tennis courts, players and matches. The way of modeling players is completely obvious, the player is a small object, so we need only a hash with player:<id> type keys player:<id> . Then you will immediately realize that you need a way to track games in specific tennis clubs. If player:1 and player:2 played each other and player:1 won, we can send the following record to the stream:

 > XADD club:1234.matches * player-a 1 player-b 2 winner 1 "1553254144387-0" 

Such a simple operation gives us:

  1. Unique identifier of the match: ID in the stream.
  2. There is no need to create an object to identify a match.
  3. Free range requests for paging matches or viewing matches at a specific date and time.

Before the advent of streams, we would have to create a sorted set by time: the elements of the sorted set will be match identifiers, which are stored in a different key as a hash value. This is not only more work, but also more memory. Much, much more memory (see below).

Now our goal is to show that Redis streams are a sort of sorted set in append only mode, with keys in time, where each element is a small hash. And in its simplicity it is a real revolution in the context of modeling.

Memory


The above usage example is not just a more seamless programming pattern. Memory consumption in threads is so different from the old approach with a sorted set + hash for each object, that now some things start to work that previously it was impossible to implement at all.

Here are statistics on the amount of memory to store a million matches in the configuration presented earlier:

   +  = 220  (242 RSS)  = 16,8  (18.11 RSS) 

The difference is more than an order of magnitude (namely, 13 times). This means being able to work with tasks that were previously too expensive to perform in memory. Now they are quite viable. The magic lies in the representation of Redis streams: macro nodes can contain several elements that are very compactly encoded in a data structure called listpack. This structure takes care, for example, of encoding integers in binary form, even if they are semantically strings. In addition, we apply delta compression and compress the same fields. However, it is still possible to search by ID or time, because such macro nodes are connected in the base tree, which is also designed with memory optimization. All together, this explains the economical use of memory, but the interesting part is that, semantically, the user does not see any implementation details that make the threads so efficient.

Now let's count. If I can store 1 million entries in about 18 MB of memory, then I can store 10 million in 180 MB and 100 million in 1.8 GB. With a total of 18 GB of memory, I can have 1 billion items.

Time series


It is important to note that the example above with tennis matches is semantically * very different from * the use of Redis streams for time series. Yes, logically we still register some event, but there is a fundamental difference. In the first case, we log and create records for rendering objects. And in the time series, we simply measure something that happens outside, which in fact does not represent an object. You may say that this distinction is trivial, but it is not. It is important to understand the idea that Redis streams can be used to create small objects with a common order and assign identifiers to such objects.

But even the simplest version of using time series, obviously, is a huge breakthrough, because before the advent of streams, Redis was practically powerless to do anything here. Memory characteristics and flow flexibility, as well as the ability to limit capped threads (see XADD parameters) are very important tools in the hands of the developer.

findings


Streams are flexible and offer many uses, but I wanted to write a very short article to clearly show examples and memory consumption. Perhaps to many readers such a use of threads was obvious. However, conversations with developers in recent months have left me with the impression that many have a strong association between streams and streaming data, as if the data structure is good only there. This is not true. :-)

Source: https://habr.com/ru/post/444996/


All Articles