On the complexity and eeriness of Word files have long been legendary. It was known that this format was extremely confusing, and also completely secretive, so that one could only guess about half of the fields there.
I will not hide the fact that I was interested in these files, but I could not go further than the first page of the description. However, an unclosed gestalt remained.
And now life has forced (or threw up the opportunity) all the same to understand the guts of all well-known documents, especially since it is not necessary to play Stirlitz now, just download the official specifications from the Microsoft website.
')
What can I say? Involuntarily I recall the old vulgar joke: well, horror. Well, just horror, but it's not horror, horror, horror.
Thank God that I parsed these files in Perl, and not on any C autocode. A high level of language and a bunch of ready-made libraries (for example, to read different code pages) is a gift from God.
So work together took a week, and the most difficult was to understand the internal format. Of course, this understanding is not quite complete, because my task was to extract some texts from the document without any formatting, but I did it carefully.
So, how are the Word files arranged?
Container
To begin with, these are not Vord files at all, but a kind of universal container in which the actual documents are packed. In such a container all the files of the Office are put, and maybe something else.
The container format is called differently - docfile, ole storage, compound document file. Already the inconsistency in the names hints that it is not very necessary, because a really useful thing usually has one and clear name. Its main idea is to be able to cram several others into one file. The most reasonable would be (as they did in OpenOffice) pack everything into a ZIP archive (you can without archiving, if it is not needed). The format is known to a huge variety of programs, compact, easy to understand. But in “Microsoft”, the syndrome “Not invented here” is apparently flourishing. The main thing - to invent a bicycle, albeit a three-wheeled, but own.
In principle, the OLE Storage format is reasonable enough and does not give the impression of something completely idiotic. But ... he is absolutely not needed. In fact, this is something like the FAT16 file system, stuck inside a separate file. This is not something that shooting a cannon on sparrows, but the destruction of those same sparrows with nuclear torpedoes. In a document that contains several files, there is no need for a file system that is at least at the FS level of DOS times.
So, CDF files (as the Unix utility file calls them) begin with a header. The title is reserved for the first hundred records TRF (file allocation tables, in popular speech - FAT). FAT there is the most natural, very similar to what can be found on DOS diskettes. The remaining TRF records are in separate sectors connected by a linked list. Additional sectors are only in large (> 7mb) files.
According to the FAT / TRF table, you can collect the contents of any internal file (in MS terminology it is called a stream, but this only confuses the matter), if you know which block it starts from. The initial blocks of different structures are written, naturally, in the title. Then from the table, you can pull out the entire chain of sectors in which the contents of this pseudo file are written.
In particular, the CDF has a root directory that is physically spread across a bunch of sectors. This is a real catalog, again, very reminiscent of the old DOS. True, for efficiency (or for showing off), it is not just a linear list, but a balanced binary tree. This means that tens of thousands of individual records can be written to it without loss of search efficiency. Why is it necessary in the file, in which there are usually five pieces of records, sometimes twenty, well, a maximum of one hundred (presentation with a huge number of pictures) - they know only in Redmond. By the way, the file names in the directory are stored in UTF16 - just in case too.
From the catalog, you can determine the initial sector of any file and use the TRF to pull out its entire chain of placement.
But that is not all.
Since the block size is rather big (usually 512 bytes, 4096 is also possible according to the specification), it is theoretically possible to lose a lot of free space when storing small pseudo files. Therefore, there is a separate storage, divided into blocks of 64 bytes. The repository is again pulled out along the FAT chain.
To indicate which blocks belong to which file, there is a separate FAT label or, more correctly, a mini-FIAT.
So, to get to the Word document, you need to do the following:
1. Read the CDF title
2. Load in the memory of the FAT - file allocation table, gathering it along a chain of sectors.
3. Download the MiniFAT plate, collecting it along the chain of TRF
4. Download the vault of blocks, collecting it along the chain of TRF
5. Download the root directory, collecting it along the chain of TRF
6. Disassemble the catalog and convert it to something readable.
7. Find the directory entry WordDocument
8. If it is a small file, then assemble it using miniTRF from the storage for blocks.
9. If large, then pull out from the disk sector along the chain of TRF.
Uh, like everything.
Each step in itself is not particularly difficult, but in the aggregate they cause only bewilderment. Why so difficult? Why it was impossible to place an ordinary linear directory after the header, and after it continuously, one after another, to write internal files?
The only thing that can be assumed is that all this is done in order to be able to append subfiles without touching the beginning of the main file. It should be noted that, firstly, this is not a highly demanded operation, since all programs usually record documents from beginning to end in one pass. The only exception is MS Word and then only in the notorious fast save mode, cursed by users. And secondly, even in these conditions it is still impossible not to touch the beginning of the main file, since it is necessary to update the directories, TRFs and headers.
In general, "Microsoft" in its role. Why do it simply, if it can be difficult and confusing?
Worddocument
The CDF format, for all its monstrosity, is at least logical and not very complicated (when compared to the rest of the contents of the Word document). His description takes only some twenty pages - ugh compared to 300 pages of Word format.
It is difficult to even call the format of a document a format, the definition of a stone record is much more suitable for it. Imagine such a stone cliff, which is printed fifty million years of history of the planet. Here is the Mesozoic layer, here is the Cenozoic, here is the imprint of the pterodactyl wing, and on top is already tertiary deposits. Approximately the same looks and document from the inside.
Just look at the header, which takes almost a third of the file. There are three headers. At first there is one small one, in which half of the records gaped with “Reserved” or “Not Used” holes. Earlier, in the Mesozoic, there was clearly something lying there, but then it was thrown into the dustbin of history. There is also a version of the recording program, according to which a huge switch / case seems to be executed in the code.
Then comes the second heading, consisting of sixteen-bit words. There is nothing useful in it at all. At its beginning, the size is clearly spelled out in such a way that the shells of the simplest will be deposited here in the future.
After that comes the third heading, this time modern, from long words (32 bits). It is of unmeasured length, at the beginning it also indicates the number of records with an eye to further expansion, and basically is a list of where to look for different tables and pieces of a file - start / size pairs. The tables themselves, by the way, are not here, but in a separate CDF pseudo file called 0Table or 1Table (variants are possible).
The first heading contains the length of the text itself and its beginning. Obviously, in the time of the King of Peas, that was how it could be read. The text lay in one big piece. It's funny that you can read it now, but ... not always! There will be one in ten readable files that will have unintelligible pieces in the middle, at the end there will be footnotes that should not be there, and at the very beginning there will be a large piece of text that was deleted last year. In addition, half of the file will be written in Chinese characters. It is regrettable to note that the well-known catdoc utility of Vitus Wagner in some cases gives exactly such results, from which it can be concluded that it does not correctly parse the format.
Life is actually much more complicated. Once there was really only text in files, but over time various “features” accumulated under pressure from users and marketing. Under them, separate streams have been allocated - for simple footnotes, for endnotes for footnotes, for footers, for some textbox (something else is a perversion - it looks like a text, but not a text. The purpose is not really clear).
The beginnings of these streams are indicated in special places in the heading, but the very first heading for some reason shows the total length — not of the text itself, but of the text plus all these perversions. This is the first reason why inscriptions like Page 1 fall into the output of many utilities.
Somewhere in the Archean, quick save was added to the editor. Its meaning is that the file is not completely rewritten, and additions and changes are simply added to its end, which theoretically should be faster. It was supposed to please these users, but in fact they were unhappy. There is no special difference in the recording speed, but a lot of garbage is formed in the file, moreover, from pieces that are theoretically erased. If there was any secret information, you can easily find it by simply viewing the file dump.
To support fast saving, a special table was added to which the beginning of each piece and its address in the file are recorded. There is no length, but it can be calculated by subtracting the beginning of the current piece from the beginning of the next. However, here too, one must be careful, as the stubs are listed from all streams. Thank God that they go in a certain order, therefore, knowing the total length of the text, it is easy to stop in time.
Theoretically, this complex format is involved only if the special flag fComplex is set in the header. But ... Here on this next "but" many converters are also punctured.
Already in our time, documents have added the ability to write to Unicode. At the same time, a problem arose (as for me, contrived): after all, the files are exactly twice as long. Since software is developed by Americans, who in their souls do not believe in the existence of other alphabets at all, and secretly believe that any strange letters are only found in dissertations about Ancient Greece, and even there they are found only occasionally, the first thing that came to their mind is to separate the pure characters ASCII from dirty Unicode. First write byte per character, second - how it goes.
From this idea arose, for example, an elegant UTF-8 encoding, where two-byte characters are encoded with tricky sequences in the spirit of Huffman coding. At Microsoft, they did the same thing, just not so beautiful. Since we have a table of stubs, we will write there at the same time and which pieces of text are written in pure ASCII (actually sp1252), and which pieces on all sorts of unintelligible alphabets requiring Unicode and, respectively, two bytes per character. Therefore, the current files should always be parsed using a table of pieces, regardless of any flags there. Unicode fragments are taken there as is, only we must bear in mind that the number of bytes read must be twice the number of characters read. Single-byte fragments are marked in the address by the high bit set by the second from the left (why not the first?). To find out the real address, you need to reset this bit, and divide the address into two (!).
If we consider that this table of stubs also takes up space, and even more space in the file is occupied by different binary trees and sector chain tables from the CDF format, the size of text saving on Unicode characters will not amaze the imagination, even in ancient Greek dissertations. On the files in the great and mighty language and say nothing. Would put everything in UTF-16 and not suffer. Well archived to the stream, since so the toad presses.
After the heroic efforts to read the text, in himself, oddly enough, there is nothing complicated. Plain text (corrected for encoding), some codes below the space play a service role. For example, 0x9 denotes tabulation, as it should be, 0xA - end of page, 0x7 - end of table cell, etc. The only subtlety associated with the fields. The beginning of the field content is designated as 0x13, the end of the field is 0x15, the name and parameters of the field are separated by the symbol 0x14 from what is actually visible in the text to the user. But ... The second part can have an embedded field in it, which many programs do not take into account. As a result, stubs like INCLUDEPICTURE or PAGEREF * remain in the text.
However, there is another minor dirty trick. Some characters may mean something completely different, like the current date. To understand whether this is a simple symbol or not, it is necessary to parse the property tables of symbols, about which below. I repent, I just cut out all the characters with the code below the space, which is not quite enough, but cheap, fast and practical.
Having ripped out the text, I did not go further into the format. This is a lesson for the young and strong in spirit - to sort out all these tables with such promising names as CHP, PAPX, SHST, PLCF and so on. The lesson is absolutely for the titans - to reproduce the formatting exactly as the Word itself does.
I will summarize only that everything is stored in special tables, the input to which is the address of the character from the beginning of the stream. Styles are in long lists, changes in styles are in special exception lists. Local style changes, for example, when editing a paragraph or a character, are stored in tables as special commands to change the parent style sheet. The teams themselves are very similar to the commands of a virtual machine from a typical quest game.
It remains only to bring morality, but it is banal: that what one person has invented, the other can always break. That does not make the Word format less shameful, ugly and completely unsuitable for the tasks of mass information exchange in heterogeneous systems.
I think that Microsoft didn’t open it for so many years, not because it was afraid of competition, but simply because it was ... embarrassing.
Additional literature:
Official format description:
http://msdn.microsoft.com/en-us/library/cc313153(office.12) .aspxJ. Spolsky, “Why are Office formats so complex?”
http://www.joelonsoftware.com/items/2008/02/19.html