Text at any cost: Miette

Yes, you are not mistaken, and this is not a deja vu. You probably once (if a frequenter) saw this topic . Much time has passed since then, and I continue to receive letters with questions and requests for advice on the topic of reading textual information from binary data formats. And this means that the topic is still relevant, interesting for the programming community.

For this year (and indeed it has been more than a year) I have changed my place of work and have been doing completely different things and haven’t been programming for a long time (I don’t program much, to be precise) with PHP. The new project obliged me to improve in python (and feel its power), so one Sunday evening it was decided to rewrite and, most importantly, improve some of my libraries for reading text. Today, I will present to the public a young Miette opensource-project (“ tasty ” if translated from French), which is designed (in any future) to read Microsoft Office files.

Myett’s main task will be primarily to read plain text from office formats, but this time I would like to go further and create the impossible: force the parser to read the formatting (at least minimal). The task is difficult, but quite feasible, if there is time in the evenings and interest (and perhaps feasible help in the form of testing and joint development) from the part of the suffering population. But these are just plans and, so to speak, a hobby.
')
Naturally, python differs in many ways from PHP and, in my opinion, has somewhat more functionality, so the principle of building libraries in a project is somewhat different than the old PHP hack. In this case, it was decided to forbid myself, as a developer and customer in one person, to load any large blocks into memory. Miett reads data gradually, on demand, as Word itself does. This makes it lightweight and undemanding to RAM. In the future, I will try to go through the initial profiler and find narrow necks that should be optimized further.

Go ahead?

I advise you to review the old article and source code for cfb and doc in PHP before reading further.

Project structure

The project consists (and subsequently will consist of) directories, each of which contains a reader of a particular type of file. Now there is a reader on Compound File Binary File Format, which is a wrapper over the data of most office files, and for DOC (Microsoft Word). Next, add support for XLS and PPT.

CFB contains two main objects - Reader and DirectoryEntry, on which the rest of the "readers" are built. The first provides an interface for working with "entries in the directory" that make up the CFB storage. With the Reader class, you can access the required entry by name and number. For the root entry (“Root Entry”), a forwarding was made to the attribute, which, as can be replaced in the DirectoryEntry class, largely orders and standardizes work with mini FAT.

DirectoryEntry implements the minimal interface for working with files: read ([size]), seak (offset, [whence]) and tell (). This again simplifies working with "occurrences" and, in general, in the spirit of python. You can still read the whole entry with read () without a parameter, but when reading a few bytes, you will get quite an advantageous solution that no extra bit will read. In addition, you can access the left / right sibling and child "entries" through the appropriate attributes - this makes walking around the CFB tree convenient and unobtrusive.

On the example of DocTextReader you can see an example of working with CFB. As you can see, unlike the PHP implementation, we try to read a smaller amount of data into RAM, constantly moving through the doc file. We get help from the additional DirectoryEntry get_byte, get_short and get_long methods that read the corresponding number of bytes from a certain place. Implemented probros main "entries" 0 / 1Table and WordDocument as class attributes.

This implementation has a test character, in the future DocTextReader will have standardized methods for reading a specified number of bytes from the selected position, and possibly some other functions of the file class.

Usage example

And finally, an example of using the library.

from doc . text import DocTextReader

doc = DocTextReader ( 'parus.doc' )
root_entry = doc . root_entry
word_document = doc . get_entry_by_name ( 'WordDocument' )
one_table = root_entry . child . left_sibling . left_sibling

fc_clx = self . word_document . get_long ( 0x01a2 )

one_table . seek ( fc_clx )
print one_table . read ( 1 )
print one_table . tell ( ) # fc_clx + 1

print doc . read ( )

PS I hope Miette and I will not disappoint you. Stay tuned for updates on GitHub :)

Source: https://habr.com/ru/post/109124/

All Articles

Text at any cost: Miette

Project structure

Usage example

More articles: