In the first article about the structure of a QVD file, I described the general structure and dealt in some detail with metadata. In this article I will describe the format for storing information about columns, share my experience in interpreting this data.
So (remember) QVD file corresponds to a relational table, which, as you know, consists of rows. Each row of the table, in turn, consists of columns (or fields), and the rows have the same structure, which can be described, for example, by a SQL (create table) operator.
In the QVD file, the table is stored as two indirectly related parts:
Character tables (my term) contain the unique values of each column in the source table. About them will be discussed below.
The row table contains the rows of the source table, each row stores the indices of the values of the column (field) of the row in the corresponding symbol table. I will talk about the row table in more detail in the third part of this series.
On the example of our tablet (remember - from the first part)
SET NULLINTERPRET =<sym>; tab1: LOAD * INLINE [ ID, NAME 123.12,"Pete" 124,12/31/2018 -2,"Vasya" 1,"John" <sym>,"None" ];
In this label:
As a rule, character tables are created for all table fields in a QVD file. But there are nuances.
If the field has one value, then this value is usually stored in the symbol table (in this case, the symbol table will contain one entry). And in the string table the field will be absent (for it is therefore clear what the value of this field should be in each row ...)
If the field has no values at all (always contains NULL), the symbol table is not created for it.
These special cases will be described in the third part, when we get to the lines and the algorithm for their reconstruction.
Each symbol table is stored in the QVD file as a binary block, its offset (relative to the beginning of the binary block) is contained in the Offset field of the metadata section of this field, its length (in bytes) in the Length field of the metadata.
Thus, the first character table will always have an offset of 0.
The character tables follow one after the other and are not separated from each other.
The symbol table contains field values that follow each other without separators, each value is represented as follows:
Strings are stored “as is” (in the encoding specified in the metadata), the string ends in a zero byte. The string can have zero length, i.e. consist of only zero byte.
Binary values are stored according to the rules of the architecture where the QVD file was generated (from my experience, you can simply read them as binary values with an eye to "endian-ness").
The whole variety of data types QVD brought with three basic
There are also combined types
In parentheses, I gave the numerical values of the “types” (the first byte of the field value in the symbol table).
An inquisitive mind asks, “where is the three?”. This is not for me, I have a lot of questions too, as the hero of Khabensky said in the famous film “I will refrain…” from comments here.
In general, that's all, not difficult - right?
The same field can have values of different types in the symbol table (integer, floating, and string). I did not believe it myself, until I conducted a series of experiments ... The only thing that can be “guaranteed” (with the reservation from the first part - nothing can be guaranteed) - there cannot be a mixture of “number” and “number with string” (either this or that ). This is important, an inquiring mind will understand :-).
Values of non-numeric fields (not types 1 and 2 in the notation above) have to be read in succession - it is impossible to position on field number N ... Explainable but inefficient (in terms of processing).
Consider again our above label, the symbol table of the ID field will look like this (I write byte-by-character / character-by-character):
Total 40 bytes (see the previous section - metadata, the value of the Length attribute for the ID field).
The practical task (one of them), as I already wrote, was for me to re-create the table by QVD file. It follows from the above (at least - it should follow, I tried :-)) that it is impossible to unambiguously determine the type of the field (the one that, for example, write in "create table ...") from the column description (metadata plus data) .
As I mentioned in the first part - 90% of the fields have the UNKNOWN type in the metadata, the tags also do not allow to unambiguously determine the type of the field (I won’t load the reader with details - believe me) ...
How to be?
In my work I went along the statistical path - I analyze a certain percentage of the column values and, based on the results, I conclude what type to assign to it. Accuracy turns out to be quite satisfactory, the trouble is that you need to analyze (in general) all the data ... In my practice, I limited myself to the first 5-10% of the field values.
If we are done with this about data types, then an inquiring mind will ask a very reasonable question - the above mentioned "create table" means much more than data types ...
I will say this: in the processed files no data types other than those listed above were found. The files corresponded to quite real tables of real databases and contained the whole range of data types (for example, I even got some blobs ... Why did they write to QVD ??? It would be better to write comments).
Probably, for completeness of the picture with data types, it is necessary to clarify about dates and time stamps (other types are a matter of length).
Dates are represented in QVD as an integer - the number of days from the beginning of an era (click era). QlikView / QlikSense specialists will easily say when it started (although it was December 30, 1899, don’t ask why).
Timestamps in QVD are represented by a floating number containing the date as described above, and the time in the fractional part (where .0 corresponds to the time "00:00:00" and .999999 corresponds to the time "23:59:59" - see in more detail, for example, here ).
I haven't dug deep in this direction yet - my tables, recreated from QVD, contain integer and floating types for fields of type "date" and "datetime". Alternatively, you can use the string representation — for fields of this type, a combined representation is always used (types 5 and 6).
The last (about practice) - when reading large files it is logical to create indexes for string fields, which I did. This significantly reduces the processing time in cases where the size of the symbol table is much smaller than the number of rows (that is, one value is found in the field of the source table more than once).
In this article we looked at storing unique values of fields (columns), saw that columns are stored as a sequence of unique values, realized that the data types are mixed and there are only three types (integer, floating, and string).
Next, it remains for us to get acquainted with how the lines are stored - the third and final part of a series of articles on the structure of QVD will be devoted to this.
Source: https://habr.com/ru/post/455940/
All Articles