Text at any cost: PPT

Some time ago we discussed with you getting clear text from various data formats: be it PDF or DOC. In one of the discussions , it was suggested that when parsing PowerPoint presentations, I would earn hemorrhoids or another terrible soft spot disease. Well, by the will of fate I had to get the text out of this “sweet” format. Frankly, hemorrhoids could not be earned , but the class for parsing presentations came out.

Not much about PPT format

Like DOC, PPT is an add-on to the WCBFF (Structured Binary File Format) format, which you can read about in an article about MS Word I wrote earlier. I will only note that during testing I found a few errors (and besides, I saw a bunch of “broken” files from which I had to get the text) in the old implementation of WCBFF and DOC , so I advise you to update your sources to those who use my work.

So, we digress. Let's continue the conversation about PowerPoint. Unlike DOC, we will not work with WordDocument and 1Table “files”, as before, but with presentation-specific ones: Current User and PowerPoint Document . The presence of both “files” in the “file system” of a CBF file is mandatory for presentations. By them, we can determine that we have a presentation in front of an erroneous extension.
')
So, to start getting data from PPT, it’s worth reading a small “file” —the record (or “file” consisting of one CurrentUserAtom entry) Current User . This entry contains technical information about who edited the file last time, but this is not the most important. In this block there is information about the offset to the first UserEditAtom record, which will be discussed below.

Now I will tell you how to read the records in PPT. Any record in the presentation contains a special rh header that contains technical information about it. To do this, read the first 8 bytes of any record. The first word usually does not contain the necessary information, but we will need the next 6 bytes. WORD at offset 2 ( rh.recType ) identifies the type of record by which you can find out what to do with the record further. Long at offset 4 ( recLen ) - record length excluding the header of eight bytes. This recording method is quite convenient and allows you to avoid many errors when parsing a presentation file.

What's next? Returning to the UserEditAtom . This entry is already in the PowerPoint Document . Later we will work only with this "file". With the help of reading this and related records, we have to build such a marvelous thing as an array of displacements PersistDirectory , with which we will look for the main structure of the PowerPoint document - DocumentContainer . To do this, we must read the current UserEditAtom record, find in it the offset offsetPersistDirectory to the current "live" version of PersistDirectory and the offset offsetLastEdit to the next UserEditAtom . So let's continue to get offsets until we hit the zeros in the DWORD offsetLastEdit .

After all the offsetPersistDirectory offsets have been offsetPersistDirectory we need to create this same PersistDirectory . We go on the offset in the reverse order and read the record PersistDirectoryAtom . They contain an array of PersistDirectoryEntry entries. Each of them contains the number of the first entry persistId and their number cPersist in the current entry. After this information comes an array of offsets to PersistDirectory objects. This is the most important array by which we will find links to all objects of the presentation.

Now let's go back to the last UserEditAtom read and find the docPersistIdRef field docPersistIdRef . This is the number of the most important DocumentContainer object in PersistDirectory . We read it. It stores the car and a small cart of information about the current presentation: headers and footers, notes for slides and the main thing - the record SlideListWithTextContainer , containing all sorts of different SlideListWithTextContainer about slides.

We will be interested in only three types of records that are stored in this main block: TextCharsAtom , TextBytesAtom and SlidePersistAtom . With the first two everything is easy: this is unicode text on a slide and plain ANSI, respectively. Another thing is when instead of the text we get a link to the SlidePersistAtom slide. According to it, we have to read the Drawing object, which ( sic! ) Is not a PPT object. Yes, inside the slide in this case, the MS Drawing object is embedded, with a rather unpleasant structure of nested records.

When I first learned about this fact, to be honest, I was upset. Another 600 pages with documentation pages. But ODRAW, as it turned out, is built on the same rh headings with the same recType 's as the PPT. This made it possible to ease the task and slightly cheat by searching in the Drawing object of all the same TextCharsAtom and TextBytesAtom by their recType 's.

Implementation

You can get the code with comments on GitHub . It is still a little damp, but I think that in the near future I will find all the pitfalls. The main errors are popping up precisely because of the not quite correct (?) Reading of PersistDirectory . If anyone has clarifications, I will listen to them with pleasure.

Literature

Text at any cost

Source: https://habr.com/ru/post/76033/

All Articles

Text at any cost: PPT

Not much about PPT format

Implementation

Literature

Text at any cost

More articles: