QVD files - what's inside, part 3

In the first article about the structure of a QVD file, I described the general structure and dealt in some detail with the metadata, in the second , the storage of columns (characters). In this article I will describe the format of storing information about rows, summarize, tell you about plans and achievements.

So (remember) the QVD file corresponds to a relational table, in the QVD file the table is stored as two indirectly related parts:

Character tables (my term) contain the unique values of each column in the source table. I talked about them in the second article.

The row table contains the rows of the source table, each row stores the indices of the values of the column (field) of the row in the corresponding symbol table. This is what this article will be about.

On the example of our tablet (remember - from the first part)

SET NULLINTERPRET =<sym>; tab1: LOAD * INLINE [ ID, NAME 123.12,"Pete" 124,12/31/2018 -2,"Vasya" 1,"John" <sym>,"None" ];

In the table of rows of our QVD file, this label will correspond to 5 lines - always an exact match: how many rows in a table, how many rows in the table of rows of a QVD file.

A string in the string table consists of non-negative integers, each of these numbers is an index into the corresponding symbol table. At the logical level, everything is simple, it remains to clarify the nuances and give an example (to disassemble - our table is represented in QVD).

Row table format

The string table consists of K * N bytes, where

K - the number of rows in the source table (value of the "NoOfRecords" tag metadata)
N is the byte length of the string of the symbol table (the value of the "RecordByteSize" tag of the metadata)

The table of lines begins with the offset "Offset" (tag metadata) relative to the beginning of the binary part of the file.

Information about the row table (length, row size, offset) is stored in the common part of the metadata.

Row table row format

All rows of the string table have the same format and are a concatenation of "unsigned numbers". The length of the number is minimal enough to represent a specific field: the length depends on the number of unique values of a specific field.

For fields with a single value (as I already wrote), this length will be zero (this value is the same in each row of the source table and is stored in the corresponding symbol table).

For fields with two values, this length will be equal to one (the possible index values in the symbol table are 0 and 1), and so on.

Since the cumulative length of a row of a table of rows must be a multiple of a byte, the length of the "last character" is aligned to the byte boundary (see below when we parse our table).

Information about the format of each field is stored in the metadata section dedicated to this field (let's stop a little more below), the length of the field's bit representation is stored in the "BitWidth" tag.

Storing null values

How to store missing values? Refraining from reasoning on the topic of "why," I will answer as follows: as far as I understand, the following combination corresponds to NULL values

the "Bias" tag of the corresponding field takes the value "-2" (in total, I came across two possible values of this tag - "0" and "-2")
the field index for the line where this field is NULL is 0

Accordingly, all other indices in the column with NULL values are increased by 2 - we will see in our example a little lower.

The order of the fields in the row

The order of the fields in the row of the table of rows corresponds to the bit offset of the field, which is stored in the "BitOffset" tag of the metadata section of the field.

Let us examine our example (see the metadata in the first part of this series).

ID field

bit offset 0 - the field will be "rightmost"
bit length 3 - the field will occupy 3 bits in the row of the string table
Bias is "-2" - the field has NULL values, all indices are increased by 2

NAME field

bit offset 3 - the field is located “to the left” of the ID field at 3 bits
bit length 5 - the field will occupy 5 bits in the row of the string table (aligned to the byte boundary)
Bias is "0" - the field has no NULL values, all indices are "fair"

Presentation of our tablet

Let's look at the real "zeroes and ones" - I will give fragments of a QVD file in the form of a binary representation "in hexadecimal format" (so compact).

First, the entire binary part (highlighted in pink, the metadata is cut off - it hurts a lot of them ...)

Compact enough, agree. Let's take a closer look - the symbol tables are located immediately after the metadata (metadata, by the way, ended up with a newline and a null byte in the file - technically this happens, null bytes after the metadata should be skipped ...).

The first symbol table is highlighted in the figure below.

We see:

The first unique ID field value is

type "6" (first allocated byte) - a floating number with a string (see the second article)
after the first byte of the next 8 bytes, this is a binary floating number
after them comes a string representation - very convenient (no need to remember - what was the number), ending in a zero byte

The remaining three unique values are of type 5 (integer with string) - the values "124", "-2" and "1" (it is easy to see in the lines).

In the figure below, I selected the second symbol table (for the "NAME" field)

The first unique value of the field "NAME" - type "4" (the first byte allocated) is a string ending in zero.

The other four unique values are also the lines "12/31/2018", "Vaysa", "John" and "None".

Now - a table of rows (highlighted in the figure below)

As expected - 5 bytes (5 lines of one byte).

First line (corresponding to line 123.12, "Pete" of our table)

The string value is byte "02" (binary 000000010).

Separate it (remember the description above)

right 3 bits (binary 010, in our opinion, this is 2) - this is the index in the symbol table of the "ID" field
we have the field "ID" contains NULL, so the index is increased by 2, i.e. the total index is 0, which corresponds to the symbol "123.12".
the next 5 bits (binary and decimal 0) are the index in the symbol table of the field "NAME", it does not contain NULL, therefore this is the index "Pete" in the symbol table.

The second line (124.12 / 31/2018) in the string table

The value is byte "0B" (binary 00001011)

right 3 bits (binary 011, in our opinion it is 3) - this is the index in the symbol table of the "ID" field
we have the field "ID" contains NULL, so the index is increased by 2, i.e. the total index is 1, which corresponds to the symbol "124".
the next 5 bits (binary and decimal 1) are the index in the symbol table of the "NAME" field, it does not contain NULL, therefore this is the index "12/31/2018" in the symbol table.

Well and so on, let's look quickly at the last line - there we had it, "None" (i.e. NULL and the string "None"):

The value is byte "20" (binary 0010000)

the right 3 bits (binary and decimal 0) is the index to the symbol table of the "ID" field
we have the field "ID" contains NULL, so the index is increased by 2, i.e. the total index is -2, which corresponds to the value NULL.
the next 5 bits (binary 100, decimal 4) is the index in the symbol table of the "NAME" field, it does not contain NULL, therefore this is the index of "None" in the symbol table.

IMPORTANT I can not find an example confirming this, but I came across files that contained the final index -1 for NULL values. Therefore, in my programs I consider NULLs all fields whose total index is negative.

Longer rows in the string table

At the end of the QVD format parsing I will briefly focus on the important nuances - the long lines in the row table store the fields in right-to-left order, where the field with zero bit offset will be the most right (as I described above). But the byte order is reversed, i.e. the first byte will be the rightmost (and will contain the "right" field - the field with zero bit offset), the last - the first (ie, contain the most "left" field - the field with the maximum bit offset).

It is necessary to give an example, but not overload with details. Let's look at such a label (I quote a fragment — to get long rows in the string table, you need to increase the number of unique values).

 tab2: LOAD * INLINE [ ID, VAL, NAME, PHONE, SINGLE 1, 100001, "Pete1", "1234567890", "single value" 2, 200002, "Pete2", "2234567890", "single value" ... ];

In summary, the information about the fields (squeeze from metadata):

ID: width 8 bits, bit offset - 0, bias - 0
VAL: width 5 bits, bit offset - 8, bias - 0
NAME: width 6 bits, bit offset - 18, bias - 0
PHONE: width 5 bits, bit offset - 13, bias - 0
SINGLE: 0 bit width (one value)

The row table consists of rows of length 3 bytes, respectively, in a row of the row table the field data will logically decompose as follows:

first 6 bits - field "NAME"
the next 5 bits is the PHONE field
further 5 bits - the "VAL" field
last 8 bits - ID field

The logical sequence is converted into physical by permutation of bytes in the reverse order, i.e.

the "ID" field completely occupies the first byte (which is the last in logical sequence)
the "VAL" field occupies the lower 5 bits of the second byte
the "PHONE" field occupies the upper 3 bits of the second byte and the lower 2 bits of the third byte
the "NAME" field occupies the upper 6 bits of the third byte

Let's look at examples, here's what the first row of the table of rows looks like (highlighted in pink)

Field values

ID - binary 00000000 decimal 0
VAL - binary 00010, decimal 2, subtract 2 due to bias - we get 0
PHONE - binary 00010, decimal 2, subtract 2 due to bias - we get 0
NAME - binary 000000 decimal 0

That is, the first line contains the first characters of the corresponding character tables.

In general, it is convenient to start parsing it from the first line - it usually contains zeros as an index (so the QVD file is built that the values from the first line fall into the symbol table).

Let's take a look at the second line to fix

Field values

ID - binary 00000001 decimal 1
VAL - binary 00011, decimal 3, subtract 2 for bias - we get 1
PHONE - binary 00011, decimal 3, subtract 2 for bias - we get 1
NAME - binary 000001 decimal 1

Ie the second line contains the second characters from the corresponding symbol tables.

Efficient format parsing

I will share a little experience - how I technically "read" QVD.

The first version was written on python (I will improve it and put it in github).

It quickly became clear the main problems:

character tables can only be read "in a row" (it is impossible to read character number N without reading all the previous characters)
real files "do not fit" into RAM
of the slowest operations (except for working with files) - bit operations (unpacking the string of the string table)
performance sags heavily on wide QVD files (when there are many columns)

Some of these problems can be solved by changing the language (from python to C, for example). The part demanded some additional actions.

The current fast enough implementation looks like this - the general logic is implemented on python, and the most critical operations are moved to separate C programs, launched in parallel.

Short

character tables are written to files, indexes are additionally created for text fields, thus it is possible to read the character number N
Work with QVD and files with character tables is implemented through memory mapped files (this is faster)
first, parallel (with a limit on the number of processors) files with tables of characters (and indexes) are created
further parallel (with a similar limitation) the rows of the string table are read and csv files are created (in HDFS)
the final step is to convert these files to an ORC table (using Hive)
C implements the creation of files with symbol tables and the creation of a CSV file for a range of lines

I don’t want to give figures on the performance - they will require binding to the hardware, at a qualitative level it turns out to copy the QVD file into the ORC table at about the speed of copying data over the network. Or, in other words, to take data from QVD is quite realistic (at the household level).

I also implemented the logic of creating QVD files - it works quite quickly on python (apparently, I haven’t reached large amounts yet - no need. I’ll go - I will rewrite it in a similar way to the “reading” variant).

Future plans

What's next:

I plan to put the Python version of the code in github (this version allows you to "explore" the QVD file - watch metadata, read and write characters, strings. The version is as simple as possible and obviously slow - without files for character tables, with sequential reading, using standard libraries for working with bits, etc.)
I think about doing something for pandas (like read_qvd ()), holding back the fact that it will be slow on the python, as well as the fact that not every QVD will "fit" into memory, therefore
I think about making a QVD file a data source for Spark - there shouldn't be this problem with not getting into memory (and the language there - scala is more close to the hardware)

Instead of epilogue

For a long time I was walking around the QVD files, it seemed that "everything is complicated there." It turned out that it was difficult, but not very good, to be served as a good push by the github, which I mentioned in the first part (a kind of catalyst). Next was the case of technology. To yourself and everyone to note (another confirmation) - in programming everything can be done, the question is in time and motivation.

I hope I am not very tired of the details, I am ready to answer questions (in the comments or in any other way). If there is a continuation, I will definitely write.

Source: https://habr.com/ru/post/457102/

All Articles