📜 ⬆️ ⬇️

Variable Length Data - DataSizeVariable (DSV)

Hello!

I have long wanted to write an article. I myself do not like long texts with a small amount of useful information, so I will try to make this one as rich as possible.

A generic topic is efficient data packing, serialization and deserialization of objects.
The main goal is to share your thoughts on this and discuss the structure of the DSV data.
')
Problem :
The currently known binary serialization mechanisms (2013-09-19 18:09:56) have insufficient flexibility or redundancy of the occupied space. For example:
QString s1 (“123”); -> 4 bytes of data size = 0x00000003, 3 bytes of payload = “123”, efficiency = 3/7;
U32 val1 (123); -> 4 bytes of data (0x0000007B), 1 byte of which is significant = 123 (0x7B), efficiency = 1/4.

Possible Solution:
Level 1 - natural numbers:
DataSizeVariable (DSV) - variable-length data with minimal redundant consumption of memory. The DSV format describes the preservation and restoration of non-negative integers in the range [0; ∞]. This provides theoretically unlimited scalability and binary data compatibility by the same rules from 8-bit controllers to servers and clusters.

The essence of the format is that the most significant bit of each (8-bit) byte is a sign of its wideness , the remaining bits (7 bits) are informational. Thus, after analyzing it, we can determine the boundary of the current value (number). If it is equal to "0", then the number is over, if "1" - in the next byte the number will continue. Data is placed from more significant parts to less significant, from left to right (big-endian), one byte. All the useful bits of a number are packed in 7 bits in one byte of the DSV format and, if necessary, a sign of the value expansion (most significant, 8th bit) is added. Examples:

image

Level 2 - objects:
At this level, the number in the DSV format is used for objects of arbitrary size.
Object1_SizeDataInDSV, Object1_Data, Object2_SizeDataInDSV, Object2_Data, ...

Level 3 - the sequence of objects:
At this level, a number in the DSV format is used to indicate the size of a sequence of objects whose elements are objects in the DSV format.
Sequence1_SizeDataInDSV, Sequence1_Data (Object1_SizeDataInDSV, Object1_Data, Object2_SizeDataInDSV, Object2_Data, ...), Sequence2_SizeDataInDSV, Sequence2_Data (...), ...

In this way, you can build a hierarchy of objects like XML.
Since the DSV format is binary as opposed to XML, direct and inverse conversions are 10 ... 1000 times faster and take 2 ... 5 times less memory (due to the lack of the need to convert data to text form and back).

If someone knows projects similar in functionality, please prompt.
If someone is interested in the implementation, here is a link to the source code of the mini-library.

Source: https://habr.com/ru/post/194492/


All Articles