So hard to find, easy to miss and impossible to issue.

Our rules of life: to begin the title of articles with the letter "T" and look for text borrowing quickly, accurately and, most importantly, beautiful . For more than a year, we have successfully found transferable borrowing and rewriting with the help of neural networks. But sometimes you need to deliberately "shoot yourself in the foot" and, limping, go another path, i.e. Do not check for paraphrase or plagiarism, but just leave a piece of text alone. Paradoxically, it hurts, but it is necessary. Let's say right away: we will not touch the bibliography. How to find it in the text? Why is it easy to say, but much harder to make than it seems? All this in the continuation of the corporate blog of the company Antiplagiat , the only blog where they ~~do not like strikethrough text~~ .

_{Image source: Fandom.com}

Why so long to look for that one?

First, a little theory. What is a document and how should we actually handle it? In “The Archeology of Knowledge” M. Foucault notes: “History now organizes the document, splits it up, arranges, redistributes levels, sets ranks, qualifies them according to importance, isolates elements, defines units, describes relationships.” We are certainly not historians of ideas , but from our own experience we know that the document is a patchwork quilting of the motley elements sewn together. What elements are and how they are interconnected depends on the specific document. If this is, for example, student work, then most likely it will contain: a title page, chapters of the main text, figures, tables, formulas, a list of references and applications. In a scientific article, most likely, there will be a summary, but the title page may be completely absent. And the collection of articles or conference materials includes a whole variety of articles, each of which has its own structure. In a word, each element of the document is interesting and self-sufficient and can tell a lot about which particular type the document itself belongs to.

Ideally, everyone - both we and the teachers - would like to have an ideal document structure and process each element in a way that responds to a specific task. The first step to success is to determine how the element is called. We decided to start with stack_more_layers with last but not least , namely with such a text element as “bibliography”. This is the segment in which text borrowing is least interesting to the user. Therefore, it is necessary to show in the report that we “caught” the bibliography and did not look for anything on it.

Life is a spectacle. No matter how long it lasts. The main thing is to have a bibliography at the end.

In a perfect world, everything is beautiful, and the appearance of the document as well. The text of an ideal document is structured, it is pleasant to read, and finding a bibliography, quickly pulling the slider to the very end, is not at all difficult. As practice has shown, the reality is quite different.

To begin with, under the "bibliography" many equally mean the following concepts: "list of references", "used literature", "list of references" and more than a hundred (sic!) Titles. In general, for such things here there are rules for the design of bibliographic references and records, according to which you can pull out the list of references from the text layer. Say more - there is even a GOST for the design of these records . Here, for example, the correct design of a bibliographic record on a well-known book:

True, it is worth considering the fact that the “GOST” guidelines for registration of records take almost 150 pages. For bibliographic references in non-print publications, there is a separate GOST on more than 20 pages. However, a reasonable question arises: how many people will devote time to such entertaining reading material only in order to properly arrange several literary references? As practice shows - few. Of course, there are automatic text layout systems (for example, LaTeX ), but in the student environment (and this is the majority of our “clients”) they are not very common. As a result, at the entrance, we have a text that contains (and perhaps does not contain) a somehow structured list of references.

We clarify another point. The fact is that we do not work directly with the loaded works (pdf, docx, doc, etc.), but first we bring them to a unified form, namely, we extract the text layer . This means that any type of formatting, such as font type or size, is removed from the text. As a result, we only have “raw” text at our disposal, which, due to various extraction artifacts, often looks very bad.

Immediately, we note that our algorithm must be fast and ~~dead~~ accurate. Selecting a block of bibliography is only an additional “feature” in the whole process of checking a document, so it should not spend a lot of resources. So, the algorithm should not be overcomplicated.

To do this, we first define the quality metric, by which we will evaluate the performance of our algorithm. We will consider our task as a classification task. Each line of the text we will refer to one of two classes - a bibliography or non-bibliography. In order not to complicate life with poorly interpreted quality indicators (there are enough of them!), We will assume the proportion of correctly and incorrectly classified lines. We act on the assumption that the incoming text layer is split into lines. And even more - in order for the classification as such to make sense, we need one line not to combine the bibliography on a par with extraneous text. This is a pretty strong assumption, but almost all the texts that have passed through our DocParser are satisfied with it. In the two-class classification of objects, which is our task, the most popular quality metrics are Precision and Recall . How it looks like - see the picture below:

_{Image source above: Wikipedia}
_{Image source below: Series: For Dummies}

The picture shows how many times the algorithm correctly (or not) classified a string, namely:

TP - bibliographic string, which the algorithm determined correctly;
TN is a plain text string that the algorithm determined correctly;
FP - a line of plain text, which the algorithm identified as bibliographic;
FN is a bibliographic string that the algorithm has identified as a plain text string.

Another requirement is that our algorithm must be sufficiently accurate (that is, it has a sufficiently high Precision ). This can be interpreted as “rather we will not select something that is necessary, than we will select something superfluous”.

All may dreams cam

What, in your opinion, does it take the most time to solve a research task? Algorithm development? Implementing a solution in an existing system or testing? No matter how wrong!

Oddly enough, most of the time is spent collecting data and preparing it. Also in this case: in order to come up with an algorithm and adjust its parameters, you must have a sufficient number of marked-up documents. That is, documents for which you know exactly where they contain bibliographic records. It would be possible to involve outside assessors, however for such small tasks it is usually possible to get by with “little blood” and mark up the data on your own. As a result, together we have processed about 1000 documents. Of course, for training, for example, neural networks, this is not enough. However, we recall that the algorithm should be simple, it means that you do not need a lot of data to configure its parameters.

However, before you develop an algorithm, you need to understand the specifics of the data. After viewing about 1000 random documents, or rather, their text layers, we can draw some conclusions about what makes the bibliographic text different from the usual one. One of the most important patterns is that almost always a bibliography starts with a keyword. In addition to the popular "list of references" or "sources used", there are also quite specific ones, for example, "Textbooks, textbooks, monographs".

Another equally important feature is the numbering of bibliographic records. Again it is worth making a reservation that all these "signs" of the list of references are very inaccurate in nature and it is far from always possible to find all the bibliographic records in the text.

However, even such inaccurate signs are enough to develop a simple bibliography search algorithm in the text layer. We describe it more formally:

We are looking for in the text "treble keys" - the key words of the bibliography;
We try to find in the text below the numbering of bibliographic records;
If the numbering is - go through the text until it ends.

This simple algorithm shows almost 100% accuracy, but very low completeness. This suggests that our algorithm selects only bibliographic lines, but does so selectively that it finds only a small part of the bibliography. The difficulty lies in the fact that the bibliography can easily have no numbering, so we will use this algorithm as an auxiliary one.

Let us now try to construct another algorithm that finds the remaining types of bibliographic records in the text. To do this, select the features that distinguish plain text lines from bibliographic records. It is worth noting that the bibliographic text is quite structured, although each author forms this structure in his own way. We have identified the following distinguishing features of the required lines:

The presence of numbering at the beginning of the line - this was already mentioned above in the description of the first algorithm;
Presence in the row of numbers of years. And it should be not just four-digit numbers, (otherwise there will be many coincidences), but specific years that are more often used when quoting: from the 1900s to the present;
Listing the names of authors, editors and other people who participated in the publication of the publication, in different formats;
Specifying page numbers, volumes, and other similar information;
Presence in the line of phrases indicating the release number;
The presence in the line url-address;
Use in the line of professional vocabulary. For the most part, these are special abbreviations, such as 'conf.', 'Scientific-practical.' and similar abbreviations.

We define these features as binary and train them on one of the simplest, but at the same time fairly effective classifiers - Random Forest . The Random Forest algorithm is an ensemble classification method. It consists of a set (usually about 100) of simple decision trees, each of which makes its own decision, to which class the object in question is assigned. The answer of the whole algorithm is created very simply: a class is selected that was formed by the majority of decision trees:

_{Image source: www.researchgate.net}

As mentioned above, we will select the parameters of the algorithm so as to maximize its accuracy. Let's try to apply this algorithm to any document and look at the result of the work:

In the picture above, the lines highlighted in red are the ones that the algorithm considers bibliographic. As you can see, the algorithm does a pretty good job with its task - there is almost no extra in the whole text, but the bibliography itself is defined by “chunks”. This is easily explained: since the algorithm is sharpened for high accuracy, it selects only those lines that are bibliographic with high probability. Unselected lines, in the opinion of the algorithm, look like pieces of plain text.

Let's try to "comb" the result. We need to eliminate two problems: random single selections inside the main text and discontinuous selection of the bibliography itself. A quick and effective solution to these problems is the “gluing” and “thinning” operations. The names speak for themselves: we will delete single bibliographic lines and glue adjacent lines, between which there are several unselected lines. And, most likely, you will have to spend several iterations of gluing-thinning, because in one passage, single lines that are not bibliographic can stick together and not be removed. We set up the parameters of gluing and thinning operations (number of passes, gluing width, removal parameters) on a separate subsample (we don’t know what “retraining” is, we recommend to look here ).

What happened after our improvements? When viewing several documents, we noticed that bibliographies with the following “features” came across:

Fortunately, we have a simple but effective algorithm that takes into account such cases. And since this simple algorithm does not select anything, except for the necessary lines of the bibliography, we can combine the results of the two algorithms without any loss of quality.

Looks pretty good. Of course, since the algorithm is probabilistic, there is the possibility that there is no bibliography at all in the text. We specifically modified the bibliography list a little so that the algorithm did not “notice” it:

But such a bibliography, from a purely subjective point of view, is not much different from ordinary text.

In the end, what did we do? We have implemented a module for the selection of bibliographic records in downloadable documents. The module consists of two algorithms, each of which is sharpened for a specific specifics of the work. One algorithm highlights a numbered bibliography block following the keyword. The second algorithm selects lines that are highly bibliographic, and then performs several “gluing” and “thinning” operations. The result of the module is the combination of the described algorithms.

It is also worth noting that the speed of the algorithm, even on large documents, is quite high. So, under the requirements of the "auxiliary feature" in the verification process, our algorithm is fully suitable.

Conclusion

~~So from the matches and acorns ...~~

As a result, we were able to implement a simple, but effective process for selecting bibliographic records in user texts. And even if this is a small part of the task of identifying the structure of documents, it is nevertheless a huge step in improving the quality of the Antiplagiat service. By the way, the results of our work can already be seen in the user reports of the system. Create your own mind!

Source: https://habr.com/ru/post/449124/

All Articles

So hard to find, easy to miss and impossible to issue.

Why so long to look for that one?

Life is a spectacle. No matter how long it lasts. The main thing is to have a bibliography at the end.

All may dreams cam

Conclusion

More articles: