📜 ⬆️ ⬇️

Patience and labor all text will extract

During the training session (May-June and December-January), users ask us to check for the presence of borrowing up to 500 documents every minute. Documents come in files of various formats, the complexity of working with each of them is different. To check the document for borrowing, we first need to extract its text from the file, and at the same time to deal with formatting. The task is to implement high-quality extraction of half a thousand texts with formatting per minute, while falling infrequently (and it is better not to fall at all), consume few resources and do not pay for the development and operation of the final creation half of the galactic budget.


Yes, yes, we, of course, know that from three things - quickly, cheaply and efficiently - you need to choose any two. But the worst thing is that in our case we can not cross out. The question is how well we did it ...



Image source: Wikipedia


We are often told that the fate of people depends on the quality of our work. Therefore it is necessary to cultivate perfectionists in oneself. Of course, we are constantly improving the quality of the system (in all aspects), as unscrupulous authors come up with new ways to get around. And I hope that the day is close when the complexity of deception, on the one hand, and a feeling of satisfaction from a quality job, on the other, will induce an absolute majority of students to give up their favorite desire to cheat. At the same time, we understand that the price of error may be the possible suffering of innocent people, if we suddenly cheat.


Why am I doing this? If we were perfectionists, we would thoughtfully approach the writing of a series of articles on the work of the Antiplagiat system . We would painstakingly formulate a publication plan in order to present everything in the most logical and expected way for the reader:



The attentive reader probably noticed that we still do not suffer from excessive perfectionism, so the time has come to consider the first stage - extracting text and formatting documents. We will deal with this today, thinking about the frailty of being and the light at the end of the tunnel, about the non-existence of anything ideal and about striving for perfection, about having a plan and following it and about compromises that we always incline towards life.


In the beginning was the word


First, we extracted from the documents only the most necessary things to check for borrowing - the text of the documents themselves. Main formats were supported - docx, doc, txt, pdf, rtf, html. Then less common ppt, pptx, odt, epub, fb2, djvu were added, however, it was necessary to refuse to work with most of them in the future . Each of them was processed in its own way - somewhere by a separate library, somewhere by its parser. On average, text extraction took about hundreds of milliseconds. It would seem that the main and almost the only difficulty in extracting text is the “parsing” of the format itself, which is especially important for binary pdf and doc formats (and the proprietary nature of the latter makes working with it even more problematic). However, already at this stage, when our desires were limited only to extracting the text, it became clear that any way of reading the formats we need brings with it a number of unpleasant features. The most significant of them are:




Source of the bottom image: Article


Upper Image Source: Hmm ...


Need more data!


If for analyzing a document for borrowing, we had enough textual background of the document, then the implementation of a number of new features is impossible or very difficult without extracting additional data from the document. Today, in addition to the text substrate, we also extract the formatting of the document and render the images of the pages. We use the latter for optical text recognition ( OCR ), as well as for defining some varieties of detours.


Document formatting includes the geometric arrangement of all words and characters on the pages, as well as the font size of all characters. This information allows us to:



To unify the processing of documents and a set of extracted data, we convert documents of all formats supported by us into pdf. Thus, the procedure for extracting document data is performed in two stages:



Convert to pdf. Library selection


Since it is not so easy to take and convert a document into pdf, we decided not to reinvent the wheel and explore the ready-made solutions, choosing the most suitable for us. It was back in 2017.


Candidate selection criteria:



We analyzed the available solutions, selecting among them the 6 most appropriate for our tasks:


LibrarySurface problems
MS Word. InteropRequires: MS Word. Call Microsoft Word via COM. The method has many drawbacks: the need to install MS Word, instability, poor performance. There are licensing restrictions.
DevExpress (17)Does not support ppt, pptx
Groupdocs-
SyncfusionDoes not support ppt, pptx, odt
Neevia Document Converter ProRequires: MS Word. Does not support odt
DynamicPdfRequires: MS Word, Internet Explorer. Does not support odt

MS Word Interop, Neevia Document Converter Pro and DynamicPdf require the installation of MS Office on production, which could permanently and permanently link us to Windows. Therefore, we no longer considered these options.


Thus, we have three main candidates, and only one of them fully supports all the formats we need. Well, it's time to see what they can do.


To test the libraries, we have formed a sample of 120,000 real user documents, the ratio of formats in which roughly corresponds to what we see every day in production.


So, the first round. Let's see what percentage of documents can be successfully converted into pdf by the libraries in question. Successfully, in our case, is not throwing an exception, meeting a 3-minute timeout and returning a non-empty text.


ConverterSuccess (%)Not successful (%)
TotalBlank textExceptionTimeout (3 minutes)Process crash
Groupdocs99.0120.9880.0390.8730.0760
DevExpress99.8190.1810.1230.0190.0390
Syncfusion98.3581.6320.0390.8480.7450.01

Syncfusion immediately stood out, which not only was able to successfully process the smallest number of documents, but also dumped the entire process on some documents (by generating exceptions like OutOfMemoryException or exceptions from the native code that were not caught without dancing with a tambourine).


GroupDocs could not process approximately 5.5 times more documents than DevExpress (everything can be seen on the table above). This is despite the fact that GroupDocs has a license for one developer approximately 9 times more expensive than a license for one developer from DevExpress. This is so, by the way.


The second serious test is the conversion time, the same 120 thousand documents:


ConverterMean (sec.)Median (sec.)Std (sec.)
Groupdocs1.3019660.3280006.401197
DevExpress0.5234530.2520001.781898
Syncfusion8.9228924.98700012.929588



Note that DevExpress not only processes documents on average much faster, but also shows a much more stable processing time.


But the stability and processing speed mean nothing if the output is a bad pdf. Maybe DevExpress skips half the text? We are checking. So, the same 120 thousand documents, this time we calculate the total amount of extracted text and the average share of vocabulary words (the more extracted words are dictionary, the less garbage / incorrectly extracted text):


ConverterTotal amount of text (in characters)Average share of vocabulary words
Groupdocs6 321 145 9660.949172
DevExpress6 135 668 4160.950629
Syncfusion5,995,008,5720.938693

Partly the assumption turned out to be true. As it turned out, GroupDocs, unlike DevExpress, can work with footnotes. DevExpress simply skips them when converting a document to pdf. By the way, yes, the text from the received pdf'ok in all cases is retrieved by means of DevExpress.


So, we have studied the speed and stability of the libraries in question, now we will carefully evaluate the quality of the conversion of pdf documents. To do this, we will analyze not just the volume of the extracted text and the share of vocabulary words in it, but we will compare the texts extracted from the received pdf's with the texts of the pdf'ok obtained through MS Word. We accept the result of converting a document via MS Word for the reference pdf . For this test about 4500 pairs of “ document, reference pdf'ka ” were prepared.


ConverterText extracted (%)Proximity to the length of the textProximity in word frequency
The averageMedianSKOThe averageMedianSKO
Groupdocs99.1310.9854720.9997560.0953040.9799521.0000000.102316
DevExpress99.7260.9713260.9966470.0759510.9656860.9961010.082192
Syncfusion89.3360.8802290.9968450.3069200.8157600.9982060.348621

For each pair of “ reference pdf, conversion result ” we calculated the similarity by the length of the extracted text and by the frequencies of the extracted words. Naturally, these metrics were obtained only in cases where the conversion was successful. Therefore, we do not consider the results of Syncfusion here. DevExpress and GroupDocs showed similar performance. On the DevExpress side, a significantly higher percentage of successful conversion, on the GD side, correct work with footnotes.


Given the results, the choice was obvious. We still use the solution from DevExpress and will soon plan to upgrade to its 19th version.


There is a PDF, extract text with formatting


So, we can convert documents to pdf. Now we are faced with another task: using DevExpress to extract text, knowing all the information we need about each word. Namely:



The image shows a breakdown of the text into pages, and also illustrates the correspondence of the words of the text of the page area.



Image Source: Header Metadata Extraction from Scientific Documents


It would seem that everything should be simple. We look, what API provides us DevExpress:



Okay, everything seems to be there. But how to get the necessary data for each word in the text of the document, which returns DevExpress? We don’t really want to collect the text of a document from words, because, for example, we don’t have information, where between words is just a space, and where is a line break. We'll have to come up with heuristics based on the location of the words ... The text is here, we have already assembled it.



Image source: Eureka!


The obvious solution is to match the words with the text of the document. We look - indeed, in the text of the document the words are arranged in the same order in which the iterator returns them by the words of the document.


We quickly implement a simple word matching algorithm with the text of the document, we add checks that everything is correctly matched, we start ...



Indeed, on the vast majority of pages, everything works correctly, but, unfortunately, not on all pages.



Upper Image Source: Are you sure?


On the part of the documents, we see that the words in the text are not in the order in which they go when iterating through the words of the document. Moreover, it is clear that the opening square bracket in the text in the word list is represented as a closing bracket and is located in another “word”. The correct display of this text fragment can be seen by opening the document in MS Word. What is even more interesting, if the document is not converted into pdf, and the text is directly extracted from the doc, then we get the third version of the text fragment, which does not coincide with either the correct order or the two other orders received from the library. In this fragment, as well as in the majority of others, on which a similar problem arises, it is the case of invisible “RTL” symbols that change the order of the adjacent symbols / words.


Here it is worth remembering that the quality of technical support was called important when choosing a library. As practice has shown, in this aspect, the interaction with DevExpress is quite effective. The problem with the submitted document was promptly corrected after we created the corresponding ticket. A number of other issues related to exceptions / high memory consumption / long processing of documents were also fixed.


However, while DevExpress does not provide a direct way to get the text with the right information for each word, we continue to compare the sometimes incomparable. If we are unable to construct an exact match between the words and the text, we use a series of heuristics that allow small permutations of the words. If nothing has helped - the document is left without formatting. Rarely, but it happens.


Until :)


')

Source: https://habr.com/ru/post/458842/


All Articles