Getting text from .doc. How to do it better?

I got a small task in front of me - to work with various files in which careless users send information about themselves. Having started, with collecting statistics, I saw a terrible picture - who erase what is in that much. Send everything you can. Starting from simple text files (thank Gods, there are adequate people in the world) and ending with PowerPoint or Flash presentations (and I didn’t believe in such people until I saw it myself) . I, if not a fool, naturally decided to bring all this diversity to a single form, providing the possibility of human-machine processing. Without hesitation, I chose the good old html.
Various presentations and pictures were eliminated from the algorithm almost immediately - there is not much sense to make a garden, the benefit is not so often that these wonderful creations of creativeness come across. Handling is not as problematic as the main thread. Text, html, etc. files, in connection with the choice of a single format, could not affect. But with other common formats, of course, I had to tinker.

As a result of a brief search, the wv package was found in the repositories - a set of utilities for converting from .doc (the manual reports the possibility of converting from Word 2000, Word 97, Word 95, and Word 6. As well as limited support for earlier formats.) To html , rtf, LaTeX. The quality of its conversion, judging by the experiments, is not the best - the formatting was somewhat disturbed though. Fortunately, the task of preserving the formatting did not work for me and I started using these utilities with a clear conscience - they pulled out the text itself completely and almost (about this below) without errors.

Also, for several other common formats, the corresponding utilities were used (unrtf, unzip, unrar, etc.) I did not encounter difficulties with them, and therefore I don’t even have anything to tell about them.
')
Let's go back to the doc files. As I have already mentioned, the conversion is going on quite well, if not for one problem that spoiled my nerves a lot. I will tell about it in more detail. After conversion to html, other scripts are set on the text, which process the text, breaking it up into words, searching for code phrases and performing other useful work, using regular expressions. Everything worked fine on other files, but the text obtained from .doc was stubbornly processed with errors — it counts two words for one, it does not recognize the code phrase. The initial error is insignificant, but to find it, I had to run up to a wall three times. Its essence boils down to the fact that in the cp1251 encoding, in which I processed these files, there is one character that is hardly used in everyday life. This is a space. Code number 160 (A0 in hexadecimal codes). It was about him that regular expressions were broken, considering him not to be a whitespace symbol, but the most printed one. Fortunately, I understood what was the reason before the thought occurred to me to drown myself in an office coffee maker.

On this I finish the story of my adventures. If this topic is interesting, I will write about my further torment.

PS Most importantly. Being confident in my own incompetence, I want to find out from readers - perhaps there are more humane ways of getting text from Vord files that I don’t know about? I would also be grateful if I would learn about the ways of converting from other formats that may contain text, but are not decoded from a bay. For example, the same flash. Surely there are ways.

Source: https://habr.com/ru/post/18910/

All Articles

Getting text from .doc. How to do it better?

More articles: