Parsing a Word document to pictures or a story about pre-diploma weekdays

I want to present to your attention a method of extracting data from a word document in the form of pictures. Perhaps the ideas presented will be for some primitive and obvious. But I had to spend a couple of sleepless nights before reaching a normal solution. So, I start.

It was the beginning of 2015. Winter. I rejoiced at the good weather and delightedly thought that I would finally finish the university (I'm lying, now I'm enrolling in magistracy). I recently completed my diploma, so I was even more happy. However, soon, by nature human, the state of serenity smoothly began to be replaced by boredom. And then, as if on purpose, silence was replaced by a phone call.

“Hello, hello, how are you?” A familiar voice sounded.

Intuition immediately determined that the conversation would most likely be on the topic of a “typrogrammer”. So it was. With a sad look, at first I listened about hard times familiar and everything else, but her final request made me interested.
')
“You could not help me with a diploma? In general, it is necessary to make a website-simulator in mathematics, ”she said.

It was interesting. I'm just fascinated by the development of complex frontends. I immediately approved the request.

On the implementation of the simulator took a total of no more than 2 days. Site simulator allows you to take tests in higher mathematics and view theory. Tests can be performed in two modes: in the training mode with the answers highlighted and in the testing mode with the result displayed at the end. The implementation was done on ReactJS and Bootstrap and the process itself was quite enjoyable. But that was only the beginning. It was necessary to fill in the test questions database with data that were not at all in the form of ready-made ordered data.

Formulation of the problem

While I was satisfied with the result of work, a friend called again. She notified me that she had sent me a small archivist with questions in * .doc files to the mail. "If anything, I can help throw questions and answers into the database," she added.

This slightly spoiled my mood, because I did not think that I would have to fill in the base of test questions myself.

Okay. I went to open my GMail and here:

And in each file there are ~ 50 test tasks in the form:

Hmmm, it’s unlikely to work out manually in the database. By the way, each test task in the database was stored as one picture-question, five picture-answers, difficulty level (A, B, C) and the correct answer number (1-5). Frustrated, I decided to postpone this issue for a long time so far, the benefit of the time was enough. But after a few days, another message from a friend fell on the mail, and then another ... As a result, there were 4 sections on higher mathematics, each of the sections consisted of 14-23 subsections, each subsection contained about 30-100 test tasks. And here I was finally convinced that it would definitely not work out to manually drive all this into the database.

By the way about the database. This is a MySQL with three tables: sections, subsections and test questions. Pictures of the quest question and the five answers are stored directly in the database, in the BLOB column. It seems to me more comfortable because there are a lot of these pictures, besides, they weigh a little. And they will all be stored in one place, along with other data.

So what was necessary? In the best case, it was necessary to get ready-made corresponding records in the database from the folder with all word test files, which practically was obtained in the final result. We are also interested in the main thing: directly extracting the images themselves.

Input : word file with test items.
Output data (for example) : a folder with PNG images, where the task has a name like 1.png, and the answers are named 1.1.png, 1.2.png, 1.3.png, 1.4.png, 1.5.png, plus the answers file. txt, inside which the i-th line contains a number from 1 to 5, corresponding to the correct answer of the i-th task.

Implementation

I love Qt Creator. And why should I love him? Most likely because in the university we are trained precisely for it, I do not know. And while using it, I feel some kind of quiet delight. Well, in general, you understand what I began to write a parser program.

At first, I wondered how to interact with this Word at all, in order to somehow pull out the data from there. All sorts of horrible thoughts came to mind, like converting a word file to HTML with subsequent processing. But google immediately sent me on an adequate path, giving information about the language of VBA. I was immediately struck by the abundance of ready-made functions, I learned what paragraphs, positions, etc., are, at the same time, of the complexity of the structure of the word document tree.

However, I was disappointed because I did not understand how to turn a piece of text into a picture. At first I wanted to use something like text2png, after having pulled out the necessary piece of text. But what about the formulas and pictures? There was no built-in function in VBA. At one point, I accidentally flashed a little thought that, like I used to insert into the word document cells from excel in the form of pictures. So it was! This was called a "special insert" and allowed to insert any part of the document in the form of a picture. Suppose we have entered into the clipboard a certain piece of the document, which must be saved as an image. But how to save this picture to disk? Googling also helped find a solution. The code section below saves the contents of the clipboard to disk as a universal EMF vector file.

#include <windows.h> void clipboardDataToEmfFile(QString fileName){ OpenClipboard(0); GetEnhMetaFileBits((HENHMETAFILE)GetClipboardData(14),0,0); HENHMETAFILE returnValue = CopyEnhMetaFileA((HENHMETAFILE)GetClipboardData(14), QDir::toNativeSeparators(fileName).toStdString().c_str()); EmptyClipboard(); CloseClipboard(); DeleteEnhMetaFile(returnValue); }

Fine. However, what kind of beast is this, this EMF? It was necessary to turn it into PNG. I started looking for image converters. After going through a bunch, and did not find adequate. And here again (does anyone believe in intuition?) In my head I began to recall some sophisticated image viewer, which I put in my school years from the disc with the “Golden Software” for fun. But it seems it was not a converter. However, it was necessary to make sure. In my head there was some sort of whether “Ifran”, or “Irfan”, in general, the program was found. Free, with the batch image processing feature, supports command line! And most importantly, it supports EMF. That was what was needed. The IrfanView executable file with the necessary DLL and ini-file of parameters lies in the same folder with the compiled program (I hope this does not violate the license) and is used through the function like this.

 void convertEmfsToPng(QString inFolder, QString outFolder){ QProcess proc; QString exeStr = "\"" + QDir::toNativeSeparators(QDir::currentPath()+"/i_view32.exe") + "\""; QString inFilesStr = "\"" + QDir::toNativeSeparators(inFolder + "*.emf") + "\""; QString outFilesStr = "\"" + QDir::toNativeSeparators(outFolder + "*.png") + "\""; QString iniFolderStr = "\"" + QDir::toNativeSeparators(QDir::currentPath()) + "\""; proc.start(exeStr + " " + inFilesStr + " /advancedbatch /ini=" + iniFolderStr + " /convert=" + outFilesStr); proc.waitForFinished(30*60*1000); }

Now it remains to copy the necessary pieces from the word document to the buffer. To do this, you need to come up with an algorithm for splitting the source text into separate blocks with the task, the answers, the correct answer number and the level of complexity.

The first implementation attempt was the following. We take the original document, replace the text of the form ([1-5]) \) with \ n $ 1 \), i.e. before the beginning of each answer we add a line feed. Replacement lines are written differently on VBA, I don’t remember. Now in the document settings, we set the page width to the maximum, and reduce the font for the entire document. The result is that in the document each task will occupy exactly 8 lines, and:

line 8 * i is the text with the correct answer number and level of complexity
line 8 * i + 1 is the task
line 8 * i + 2 is the answer number 1
...
line 8 * i + 6 is the answer number 5
line 8 * i + 7 - empty

Repeat how tasks look

Now, after this processing, there is nothing left but to walk through the array of a collection of document lines, start counter i, and depending on i% 8 save task / response pictures or retrieve the number of the correct answer with a level of complexity.

But it did not fit. Blame long tasks that are written in one line, look terrible, small and not always fit. In addition, sometimes the replacement of the text "1)" affects other places than the answer numbers. Saddened by the result, I again began to think what could be done in this case. And then I remembered the state machines. I remembered about the state, I remembered about the character input. Remembered the parser. Perhaps this was another decision, but I, as a person far from complex algorithms, was extremely happy with my idea.

Now it is the turn to write and try the parser code based on the finite state machine. We have 7 states:

reading space between jobs, starts with a blank line
Reading the line with the number of the task, in which the number of the correct answer and the level of complexity, begins with "Number"
reading the text of the task, begins with the "Task"
reading the text of the answer number 1, begins with "Answers: 1)."
reading the text of the answer number 2, begins with "2)."
...
reading the text of the answer number 5, begins with "5)."

We implement using the conditions of the beginning of the next state. After testing the first version of the parser, everything went great. Pictures were obtained as in the very word of the document, pretty, large. But here ... from time to time shoals appeared, for example, in one picture an extra piece was captured until the next block of the task. So the parser did not recognize correctly. What is the matter? Everything turned out to be simple - the tasks in the word document were typed by hand and therefore there was a human factor, for example:

instead of "Task" was written "Task"
instead of "Answers" was written "Answer"
instead of "1)." was written "1)." or "1)" or even "1)."

It was a nightmare. Fortunately, the main mistakes were in the writing of numbers of tasks; at least they were taken into account by the parser. The remaining errors after extracting the images were found by a cursory scan of the largest and smallest in size images, followed by correction and re-extraction.

The final piece of the parser is the code below. He is terrible, please do not judge strictly. To store VBA objects, QAxObject is used.

Explanations of variable names, state of the automaton, additional functions used

status - state of the machine:
- -3 - between tasks
- -2 - inside task number
- -1 - inside the job
- 0 - after the word Answers
- 1 - inside answer 1
- 2 - inside answer 2
- 3 - inside answer 3
- 4 - inside answer 4
- 5 - inside answer 5
startind - position of the beginning of the current block (task, response, line with the number of the correct answer and the level of complexity)
n is the ordinal number of the task
nstr - the string of the serial number of the job with leading zeros to three digits
str - the current block string to the current position
lineStart, lineEnd - position numbers of the beginning and end of the current paragraph
lines - the object of the collection of paragraphs of the document
tline - the current paragraph object
line - Range object of the current paragraph
ipar - current paragraph number
tmpObj - the range object of the current character
currChar - current character
outdir - the path of the output folder of the images
getAnswerLine (QString) function - returns a string of two numbers: difficulty level (1-3) and correct answer number (1-5), for example, 24 is a task with difficulty level B and correct answer number 4
the rangeToEmfFile function (QString fname, int start, int end, QAxObject * activeDoc) - saves a piece of the document between the start and end positions of the activeDoc document as an EMF file with the name fname

Awful, long code

 QAxObject *activeDoc = wordApp->querySubObject("ActiveDocument"); int status = -3; int startind = 0; int n=0; QString nstr; QString str = ""; int lineStart, lineEnd; QAxObject *lines = activeDoc->querySubObject("Paragraphs"); if (onlyAsnwers) for (int ipar = 1; ipar <= lines->property("Count").toInt(); ipar++){ QAxObject *tline = lines->querySubObject("Item(QVariant)", ipar); QAxObject *line = tline->querySubObject("Range"); QString str = line->property("Text").toString(); line->clear(); delete line; tline->clear(); delete tline; int ind = str.indexOf(":"); if (ind != -1){ str = str.mid(ind+6); answersTxt << getAnswerLine(str); } } else for (int ipar = 1; ipar <= lines->property("Count").toInt(); ipar++){ QAxObject *tline = lines->querySubObject("Item(QVariant)", ipar); QAxObject *line = tline->querySubObject("Range"); lineStart = line->property("Start").toInt(); lineEnd = line->property("End").toInt(); line->clear(); delete line; tline->clear(); delete tline; str = ""; for (int j=lineStart; j<lineEnd; j++){ QAxObject *tmpObj = activeDoc->querySubObject("Range(QVariant,QVariant)", j, j+1); QString currChar = tmpObj->property("Text").toString(); tmpObj->clear(); delete tmpObj; str += currChar; switch (status){ case -3: if (j>=4 && str.right(5) == ""){ status = -2; startind = j+1; } break; case -2: if (str.right(6) == ""){ n++; nstr = QString::number(n); while (nstr.length() < 3) nstr = "0" + nstr; status = -1; QAxObject *tmpObj = activeDoc->querySubObject("Range(QVariant,QVariant)", startind, j-6); QString tmp = tmpObj->property("Text").toString(); tmpObj->clear(); delete tmpObj; answersTxt << getAnswerLine(tmp); startind = j+2; } else if (str.right(7) == ""){ n++; nstr = QString::number(n); while (nstr.length() < 3) nstr = "0" + nstr; status = -1; QAxObject *tmpObj = activeDoc->querySubObject("Range(QVariant,QVariant)", startind, j-7); QString tmp = tmpObj->property("Text").toString(); tmpObj->clear(); delete tmpObj; answersTxt << getAnswerLine(tmp); startind = j+2; } break; case -1: if (str.right(7) == ":"){ status = 0; rangeToEmfFile(outdir+nstr+".emf", startind, j-7, activeDoc); startind = j+1; } else if (str.right(6) == ":"){ status = 0; rangeToEmfFile(outdir+nstr+".emf", startind, j-6, activeDoc); startind = j+1; } break; case 0: if (str.right(2) == "1)" || str.right(3) == "1 )"){ status = 1; startind = j+2; } break; case 1: if (str.right(2) == "2)"){ rangeToEmfFile(outdir+nstr+".1.emf", startind, j-2, activeDoc); status = 2; startind = j+2; } else if (str.right(3) == "2 )"){ rangeToEmfFile(outdir+nstr+".1.emf", startind, j-3, activeDoc); status = 2; startind = j+2; } break; case 2: if (str.right(2) == "3)"){ rangeToEmfFile(outdir+nstr+".2.emf", startind, j-2, activeDoc); status = 3; startind = j+2; } else if (str.right(3) == "3 )"){ rangeToEmfFile(outdir+nstr+".2.emf", startind, j-3, activeDoc); status = 3; startind = j+2; } break; case 3: if (str.right(2) == "4)"){ rangeToEmfFile(outdir+nstr+".3.emf", startind, j-2, activeDoc); status = 4; startind = j+2; } else if (str.right(3) == "4 )"){ rangeToEmfFile(outdir+nstr+".3.emf", startind, j-3, activeDoc); status = 4; startind = j+2; } break; case 4: if (str.right(2) == "5)"){ rangeToEmfFile(outdir+nstr+".4.emf", startind, j-2, activeDoc); status = 5; startind = j+2; } else if (str.right(3) == "5 )"){ rangeToEmfFile(outdir+nstr+".4.emf", startind, j-3, activeDoc); status = 5; startind = j+2; } break; case 5: if (j>=4 && str.right(5) == ""){ rangeToEmfFile(outdir+nstr+".5.emf", startind, j-5, activeDoc); status = -2; str = ""; } else if (lineEnd-lineStart < 2){ rangeToEmfFile(outdir+nstr+".5.emf", startind, j, activeDoc); status = -3; } break; } } if (status == 5) rangeToEmfFile(outdir+nstr+".5.emf", startind, lineEnd, activeDoc); } lines->clear(); delete lines; activeDoc->clear(); delete activeDoc;

The logic of the above code is slightly different from the one described above. It also uses paragraph breaks. But this does not greatly change the main idea.

This is how it came to “conquer” this Word!

Conclusion

As a result, all tasks in the amount of ~ 4 thousand were retrieved. The necessary parser shell was written. The program for downloading tasks to the remote database and its administration has also been written. The fee was received, her diploma is protected perfectly, mine is also protected perfectly.

Thank you for your attention, I hope this post will help someone with a similar problem. Or maybe someone knows the best implementation?

Update:

A couple of pictures of the result

Source: https://habr.com/ru/post/264443/

All Articles

Parsing a Word document to pictures or a story about pre-diploma weekdays

Formulation of the problem

Implementation

Conclusion

More articles: