In the first article about the structure of a QVD file, I described the general structure and dealt in some detail with the metadata, in the second , the storage of columns (characters). In this article I will describe the format of storing information about rows, summarize, tell you about plans and achievements.
So (remember) the QVD file corresponds to a relational table, in the QVD file the table is stored as two indirectly related parts:
Character tables (my term) contain the unique values of each column in the source table. I talked about them in the second article.
The row table contains the rows of the source table, each row stores the indices of the values of the column (field) of the row in the corresponding symbol table. This is what this article will be about.
On the example of our tablet (remember - from the first part)
SET NULLINTERPRET =<sym>; tab1: LOAD * INLINE [ ID, NAME 123.12,"Pete" 124,12/31/2018 -2,"Vasya" 1,"John" <sym>,"None" ];
In the table of rows of our QVD file, this label will correspond to 5 lines - always an exact match: how many rows in a table, how many rows in the table of rows of a QVD file.
A string in the string table consists of non-negative integers, each of these numbers is an index into the corresponding symbol table. At the logical level, everything is simple, it remains to clarify the nuances and give an example (to disassemble - our table is represented in QVD).
The string table consists of K * N bytes, where
The table of lines begins with the offset "Offset" (tag metadata) relative to the beginning of the binary part of the file.
Information about the row table (length, row size, offset) is stored in the common part of the metadata.
All rows of the string table have the same format and are a concatenation of "unsigned numbers". The length of the number is minimal enough to represent a specific field: the length depends on the number of unique values of a specific field.
For fields with a single value (as I already wrote), this length will be zero (this value is the same in each row of the source table and is stored in the corresponding symbol table).
For fields with two values, this length will be equal to one (the possible index values in the symbol table are 0 and 1), and so on.
Since the cumulative length of a row of a table of rows must be a multiple of a byte, the length of the "last character" is aligned to the byte boundary (see below when we parse our table).
Information about the format of each field is stored in the metadata section dedicated to this field (let's stop a little more below), the length of the field's bit representation is stored in the "BitWidth" tag.
How to store missing values? Refraining from reasoning on the topic of "why," I will answer as follows: as far as I understand, the following combination corresponds to NULL values
Accordingly, all other indices in the column with NULL values are increased by 2 - we will see in our example a little lower.
The order of the fields in the row of the table of rows corresponds to the bit offset of the field, which is stored in the "BitOffset" tag of the metadata section of the field.
Let us examine our example (see the metadata in the first part of this series).
ID field
NAME field
Let's look at the real "zeroes and ones" - I will give fragments of a QVD file in the form of a binary representation "in hexadecimal format" (so compact).
First, the entire binary part (highlighted in pink, the metadata is cut off - it hurts a lot of them ...)
Compact enough, agree. Let's take a closer look - the symbol tables are located immediately after the metadata (metadata, by the way, ended up with a newline and a null byte in the file - technically this happens, null bytes after the metadata should be skipped ...).
The first symbol table is highlighted in the figure below.
We see:
The first unique ID field value is
The remaining three unique values are of type 5 (integer with string) - the values "124", "-2" and "1" (it is easy to see in the lines).
In the figure below, I selected the second symbol table (for the "NAME" field)
The first unique value of the field "NAME" - type "4" (the first byte allocated) is a string ending in zero.
The other four unique values are also the lines "12/31/2018", "Vaysa", "John" and "None".
Now - a table of rows (highlighted in the figure below)
As expected - 5 bytes (5 lines of one byte).
First line (corresponding to line 123.12, "Pete" of our table)
The string value is byte "02" (binary 000000010).
Separate it (remember the description above)
The second line (124.12 / 31/2018) in the string table
The value is byte "0B" (binary 00001011)
Well and so on, let's look quickly at the last line - there we had it, "None" (i.e. NULL and the string "None"):
The value is byte "20" (binary 0010000)
IMPORTANT I can not find an example confirming this, but I came across files that contained the final index -1 for NULL values. Therefore, in my programs I consider NULLs all fields whose total index is negative.
At the end of the QVD format parsing I will briefly focus on the important nuances - the long lines in the row table store the fields in right-to-left order, where the field with zero bit offset will be the most right (as I described above). But the byte order is reversed, i.e. the first byte will be the rightmost (and will contain the "right" field - the field with zero bit offset), the last - the first (ie, contain the most "left" field - the field with the maximum bit offset).
It is necessary to give an example, but not overload with details. Let's look at such a label (I quote a fragment — to get long rows in the string table, you need to increase the number of unique values).
tab2: LOAD * INLINE [ ID, VAL, NAME, PHONE, SINGLE 1, 100001, "Pete1", "1234567890", "single value" 2, 200002, "Pete2", "2234567890", "single value" ... ];
In summary, the information about the fields (squeeze from metadata):
The row table consists of rows of length 3 bytes, respectively, in a row of the row table the field data will logically decompose as follows:
The logical sequence is converted into physical by permutation of bytes in the reverse order, i.e.
Let's look at examples, here's what the first row of the table of rows looks like (highlighted in pink)
Field values
That is, the first line contains the first characters of the corresponding character tables.
In general, it is convenient to start parsing it from the first line - it usually contains zeros as an index (so the QVD file is built that the values from the first line fall into the symbol table).
Let's take a look at the second line to fix
Field values
Ie the second line contains the second characters from the corresponding symbol tables.
I will share a little experience - how I technically "read" QVD.
The first version was written on python (I will improve it and put it in github).
It quickly became clear the main problems:
Some of these problems can be solved by changing the language (from python to C, for example). The part demanded some additional actions.
The current fast enough implementation looks like this - the general logic is implemented on python, and the most critical operations are moved to separate C programs, launched in parallel.
Short
I don’t want to give figures on the performance - they will require binding to the hardware, at a qualitative level it turns out to copy the QVD file into the ORC table at about the speed of copying data over the network. Or, in other words, to take data from QVD is quite realistic (at the household level).
I also implemented the logic of creating QVD files - it works quite quickly on python (apparently, I haven’t reached large amounts yet - no need. I’ll go - I will rewrite it in a similar way to the “reading” variant).
What's next:
For a long time I was walking around the QVD files, it seemed that "everything is complicated there." It turned out that it was difficult, but not very good, to be served as a good push by the github, which I mentioned in the first part (a kind of catalyst). Next was the case of technology. To yourself and everyone to note (another confirmation) - in programming everything can be done, the question is in time and motivation.
I hope I am not very tired of the details, I am ready to answer questions (in the comments or in any other way). If there is a continuation, I will definitely write.
Source: https://habr.com/ru/post/457102/
All Articles