⬆️ ⬇️

Text at any cost: WCBFF and DOC

A bit later than we wanted, but we continue our conversation about getting text from different data formats. We already got acquainted with how to work with initially XML-base files (docx and odt), read the text from pdf, convert the contents of rtf to plain-text. We now turn to the yummy and sweet - DOC format.



Before the attentive reader asks about the strange abbreviation in the title, I still ask to look at the contents of some doc file:





')

I think that many of us, at the dawn of their computer literacy, tried to open doc-files with a notebook and saw similar crackers. But let us ask ourselves the question: what can we take out of this mess of bytes, which is nothing else but the same “Sail”? The most interesting thing for us here is the first eight bytes that will come across from file to file, namely, "D0 CF 11 E0 A1 B1 1A E1" in hexes, or "Ў±" , if you wish "Ў±" in a notebook.



Now it’s worth decoding the second abbreviation in the title. WCBFF is nothing more than the Windows Compound Binary File Format , which in Russian sounds like " Windows Compound Binary Files Format ." Let's leave the translation on the conscience of the corporation and think about how this format with a terrible name will help us.



So, CFB is the progenitor, or, even more correctly, the skeleton for all Microsoft Office formats from version 97 to 2007 (when saved in compatibility format). This CFB is used not only to store Word text, but also to save Excel sheets or PowerPoint presentations. As a result, we will have to read the backbone that is “encrypted” in CFB, and only then find the text in the read data, taking into account the DOC format.



CFB or small file system



The first stage, as I said, will be reading CFB. CFB represents a file structure in miniature: with sectors, root directories and some similarity of files. Even the problems with this file are the same as with ordinary filesystems - the fragmentation of sectors, for example. Therefore, without knowing the structure of the format, this file will not be easy to read - thanks to Microsoft, for a couple of years, I opened the documentation on both CFB and all other add-on formats.



Let's try to understand how the information is packaged in CFB files. The entire file is divided into sectors - 512 bytes each (in the new, fourth, version, the sector size can be 4096 bytes). In the first sector is the header of the file, a piece of which we saw in the screenshot above. It (the header) contains all the information about how, what and in what sequence to read from the file.



The data in the file is stored in segments (FAT) in the very 512 bytes. If there is not enough space in the sector segment, the remaining data is transferred to the next one along the chain. Chain sectors can be scattered across a file (i.e., a file can be fragmented, as noted above). To maintain the integrity of the chain of sectors, there are special sectors that contain which sector to move from the current one if all data is not read. The end of the chain is characterized by the special word ENDOFCHAIN = 0xFFFFFFFE .



Due to the fact that for some data 512 bytes can be very much, there are "miniature" sectors, called mini FAT. The mini FAT sector has a length of 64 bytes, so 8 (or 64) such small segments can fit into one FAT sector. The choice towards FAT or mini FAT is made on the basis of the full length of the current data. If it is less than 4096 bytes (one of the file header parameters), then you should use mini FAT, otherwise - FAT.



The data in the CFB file is not piled up just like that - it is structured into some tree structure, rooted in the special “file entry” Root Entry . Each such entry has a length of 128 bytes (4 or 32 entries fit into one FAT segment) and is characterized by its name, type (storage — storage, stream — stream, root storage — root storage, empty space — unused), child and fraternal elements, color in a red-black tree. In addition, for the threads and the root element, there are parameters such as the offset and the length of the content.



Thus, each entry into the file system can be characterized by “content attached to it”. For streams, this will be the data stored in them, for the root element, the mini FAT file.



In addition, the file has a structure called DIFAT, which stores references to sectors with chains of FAT sequences. The first 109 DIFAT links are located at the end of the file header and can “serve” files up to 8.5 MB in length, if this is not enough, then the header may contain links to an additional DIFAT sector, which can end with a link to the next DIFAT and so on.



This information briefly describes all the confusion and vacillation that is going on in the CFB files. The format, in principle, is fairly well documented (links, as usual, at the end of the topic), it is enough to read manuals carefully and thoughtfully. The purpose of this article I did not put a full explanation of the work of CFB-files, so let's move on to the main thing - how to read the doc from all this ...



Doc or they stole my offsets



To begin with, I will say that I wrote the doc parsing (along with cfb) only on the third attempt. Before that, something somewhere was somehow not read. And the reason is that everything had to be done according to the documentation, but ... if it’s not a big problem with CFB (except English, as the language of the manual), then with DOC problems are provided.



Let's start with the fact that we have read the file system of our DOC and we are eager to find text data in it. Well, Microsoft, who opened the specification, gave us a gift and made it possible to do it. To do this, we will work with only two entries in the tree structure of the CFB-file elements: this is a stream called “ WordDocument ” and a stream called “ 0Table ” or “ 1Table ”, depending on the situation.



In the first thread is the text of the Word document, but just do not get it. Everything is awful binary, and everything else in Unicode encoding with inverse byte order (as in all CFB files, it's worth noting). In this regard, for the beginning we will read several fields from the FIB - File Information Block - which lies at the beginning of the WordDocument stream and is filled from version to version (in 97 Word, this header occupied about 700 bytes, in 2007 it was already more than 2000) .



First of all, we will read the word at offset 0x000A , in which we will find 0x0200 bits, the unit of which will tell us that we will deal with the table 1Table , and zero - with 0Table . It is worth noting that I came across files with both tables, so the bit will have to be read anyway.



Next, we need to find the CLX - the most asshole an important part of one of the previously selected plates. This CompLeX structure stores offsets and lengths of text data sequences in a WordDocument stream. The length and offset to CLX are in 0x01A2 and 0x01A6 FORD DWORD'ah "documentary flow." After receiving this information, we read the CLX from the table flow and run into the gag ...



The fact is that CLX contains two completely different data structures of variable size - the RgPrc and the important PlcPcd, which we do not need. The fact is that the length of PgPrc can be either zero or any. Fortunately, the documentation does not say how to cut the first data from the second, so in the final code I had to write some kind of crutch, which, oddly enough, works.



After getting PlcPcd or, to be more adequate in the names of the Piece Table, we can split this array into two: the array cp is the length of the text pieces ( lcb i = cp i+1 - cp i ) and pcd (piece descriptors). Each of the latter contains information about the displacement in the WordDocument-stream and the characteristic of fCompress - is this piece compressed in Unicode, or is it ANSI (Windows-1252).



Some control characters can be found in the received pieces, for example, the insertion of an object or an image. In my code, some of them are deleted, I leave the parsing of the other special characters to the reader.



Code option



Well, as usual at the end, a piece of code and links to the sources:

  1. class doc extends cfb {
  2. public function parse ( ) {
  3. parent :: parse ( ) ;
  4. $ wdStreamID = $ this -> getStreamIdByName ( "WordDocument" ) ;
  5. if ( $ wdStreamID === false ) { return false ; }
  6. $ wdStream = $ this -> getStreamById ( $ wdStreamID ) ;
  7. $ bytes = $ this -> getShort ( 0x000A , $ wdStream ) ;
  8. $ fComplex = ( $ bytes & 0x0004 ) == 0x0004 ;
  9. $ fWhichTblStm = ( $ bytes & 0x0200 ) == 0x0200 ;
  10. $ fcClx = $ this -> getLong ( 0x01A2 , $ wdStream ) ;
  11. $ lcbClx = $ this -> getLong ( 0x01A6 , $ wdStream ) ;
  12. $ ccpText = $ this -> getLong ( 0x004C , $ wdStream ) ;
  13. $ ccpFtn = $ this -> getLong ( 0x0050 , $ wdStream ) ;
  14. $ ccpHdd = $ this -> getLong ( 0x0054 , $ wdStream ) ;
  15. $ ccpMcr = $ this -> getLong ( 0x0058 , $ wdStream ) ;
  16. $ ccpAtn = $ this -> getLong ( 0x005C , $ wdStream ) ;
  17. $ ccpEdn = $ this -> getLong ( 0x0060 , $ wdStream ) ;
  18. $ ccpTxbx = $ this -> getLong ( 0x0064 , $ wdStream ) ;
  19. $ ccpHdrTxbx = $ this -> getLong ( 0x0068 , $ wdStream ) ;
  20. $ lastCP = $ ccpFtn + $ ccpHdd + $ ccpMcr + $ ccpAtn + $ ccpEdn + $ ccpTxbx + $ ccpHdrTxbx ;
  21. $ lastCP + = ( $ lastCP ! = 0 ) + $ ccpText ;
  22. $ tStreamID = $ this -> getStreamIdByName ( intval ( $ fWhichTblStm ) . "Table" ) ;
  23. if ( $ tStreamID === false ) { return false ; }
  24. $ tStream = $ this -> getStreamById ( $ tStreamID ) ;
  25. $ clx = substr ( $ tStream , $ fcClx , $ lcbClx ) ;
  26. $ lcbPieceTable = 0 ;
  27. $ pieceTable = "" ;
  28. $ pieceCount = 0 ;
  29. $ from = 0 ;
  30. while ( ( $ i = strpos ( $ clx , chr ( 0x02 ) , $ from ) ) ! == false ) {
  31. $ lcbPieceTable = $ this -> getLong ( $ i + 1 , $ clx ) ;
  32. $ pieceTable = substr ( $ clx , $ i + 5 ) ;
  33. if ( strlen ( $ pieceTable ) ! = $ lcbPieceTable ) {
  34. $ from = $ i + 1 ;
  35. continue ;
  36. }
  37. break ;
  38. }
  39. $ cp = array ( ) ; $ i = 0 ;
  40. while ( ( $ cp [ ] = $ this -> getLong ( $ i , $ pieceTable ) ) ! = $ lastCP )
  41. $ i + = 4 ;
  42. $ pcd = str_split ( substr ( $ pieceTable , $ i + 4 ) , 8 ) ;
  43. $ text = "" ;
  44. for ( $ i = 0 ; $ i < count ( $ pcd ) ; $ i ++ ) {
  45. $ fcValue = $ this -> getLong ( 2 , $ pcd [ $ i ] ) ;
  46. $ isANSI = ( $ fcValue & 0x40000000 ) == 0x40000000 ;
  47. $ fc = $ fcValue & 0x3FFFFFFF ;
  48. $ lcb = $ cp [ $ i + 1 ] - $ cp [ $ i ] ;
  49. if ( ! $ isANSI )
  50. $ lcb * = 2 ;
  51. else
  52. $ fc / = 2 ;
  53. $ part = substr ( $ wdStream , $ fc , $ lcb ) ;
  54. if ( ! $ isANSI )
  55. $ part = $ this -> unicode_to_utf8 ( $ part ) ;
  56. $ text . = $ part ;
  57. }
  58. return $ text ;
  59. }
  60. }
You can get the code with comments on GitHub .



Literature



How not to do:

Links to other articles on the topic "Text at any cost":

Source: https://habr.com/ru/post/72745/



All Articles