Modern aspects of the presentation of texts in the analysis of natural language: classical and alternative approaches

Introduction

In computer science, the topic of natural language processing is becoming increasingly popular from year to year. Due to the huge number of tasks that require such an analysis, it is difficult to overestimate the need for automatic processing of text documents.

In this article, we will as simple as possible try to describe the most popular modern approaches to the presentation of text documents for computer processing. And on one of them, which is currently not yet widespread, but it has every chance, let's stop in more detail, because we use this method in SlickJump when developing algorithms, for example, contextual advertising targeting .

Note that the presented approaches are applicable not only to texts, but in general to any objects that can be represented as symbolic sequences, for example, some macromolecules (DNA, RNA, proteins) from genetics. We consider a total of 4 methods:
')

Sign description.
Pairwise overlay (alignment) of texts.
Forming a profile and a hidden Markov model.
Representation by fragments.

So let's get started.

Feature Description

The feature description is an approach that can be called a classic when analyzing a natural language. It implies that a certain number of features are distinguished from the analyzed text documents, for example, words, or pairs of consecutive words (the so-called bigrams). You can filter for signs, for example, by frequency of occurrence in a set of documents, or remove from the list those that are of no interest - for example, prepositions or conjunctions (“stop words”). Further, each document is characterized by the number of occurrences of these attributes in it, or, for example, by their presence / absence, which is conveniently represented as a vector, where each element corresponds to a specific attribute. Their order, of course, must be fixed in advance. Note that an important task is to identify the same signs in different forms - for example, it is convenient to assume that “Katya” and “Katya” are one and the same attribute. For this purpose, processes of stemming are used, in which only the basics of words are considered (most often, roots, and prefixes, suffixes, endings are discarded) or lemmatization, when all words are reduced to “normal form”, for example, nouns in Russian are conveniently reduced to case in the singular. The second process is almost always more difficult, it is often necessary to have a base of word forms (without them, for example, the word “people” does not get the normal form “person”), whereas for the first process a list of word conversion rules is sufficient for many languages.

For example, let us be given a text taken from N. Gumilyov's beautiful poem “Giraffe”:

Listen: far, far away, on Lake Chad
Exquisite wandering giraffe.

Select the signs from here (we will use only separate words), lemmatizing them, we get the following set:

(listen, far, lake, chad, exquisite, wander, giraffe)

The preposition "on" is not taken as a stop word. If we now present the first line of the text given to us in a vector form based on the above list, we get the following vector:

(1, 2, 1, 1, 0, 0, 0)

The resulting vectors can be combined into a matrix, measure different distances (of which there are a great many) between them, apply frequency conversion techniques to them (TF-IDF, for example - but in this case, the number of features and documents should be sufficient for its adequate operation).

Overlay sequences, forming profiles

In some tasks, an approach is used which consists in the pairwise superposition (alignment) of symbol sequences. The essence of this method is to maximize the number of positions in which the matching characters of the sequences are located, while the sequences themselves can be “broken”. This approach is actively used in tasks of bioinformatics, where macromolecules, for example, DNA act as sequences of symbols, and usually such that some of them are the results of changes in others, for example, historical modifications. It is often assumed that all compared sequences have a common ancestor. For this approach, several proximity measures have been developed, generating distance matrices, with which further work is being conducted.

In addition, there is a method that involves the formation of a profile and a hidden Markov model for text documents. This approach is again based on the imposition of sequences, but not in pairs, but in the plural. Positions in which all sequences are involved are highlighted separately. The hidden Markov model is the development of this approach, suggesting that sequences are generated by transitions between its states. At each transition with certain probabilities its symbols are formed. This method is not widely used, because we will dwell on such a brief description of it.

Fragment view

And finally, the last approach, on which we will stop attention, is a fragmentary description of the texts. In this approach, so-called annotated suffix trees are used. An annotated suffix tree (ASD) is a data structure used to compute and store all text fragments along with their frequencies. It is defined as the root tree (that is, one vertex in it is selected and is called the root), in which each node corresponds to one symbol and is marked by the frequency of the text fragment that encodes the path from the root to this node.
When using the SDA, all text is divided into lines - strings of characters. As a rule, one line is formed from 2-4 consecutive words. An example of an SDA for a short string

shown in the image below.

Suffix Length

lines

lengths

(

) consisting of characters

is called a substring

. For example, for the string

suffix length 2 will be a substring

. We give an informal description of the algorithm for constructing ASD on a given string

Initialize an empty structure for storing ASD - . This structure should be able to store a tree with symbols and numbers (their frequency) in nodes.
Find all suffixes strings: .
For each suffix looking in such a path from the root that the coincidence with the beginning of the suffix is as long as possible - . For all characters matched , we increase the frequency of the corresponding nodes by 1.
If a match less length it means that the suffix is not complete. Then in it is necessary to create new nodes by assigning them a frequency of 1.

When constructing an ASD for two or more lines, the suffixes of all the lines are sequentially superimposed on the common tree. Thus, by presenting a document in the form of a set of lines, one can construct for it an appropriate ASD.
ASD, built for the document, allows you to successfully solve such a problem as finding the proximity of a line and a document. This task is called the task of determining the degree of occurrence of a substring in a tree. We present its solution (proposed and described in detail in [3]).

We introduce the concept of conditional probability of a node

in ASD

with root top

. Node frequency

denote by

. Then the conditional probability of a node is:

Where

- ancestor node

in a tree

. For those vertices for which the root is an ancestor

formula

takes the form:

where many

there is nothing more than the set of all nodes on the first level of the tree.

For each suffix

lines

estimate of the degree of its entry into the tree is expressed as the sum of the conditional probabilities of nodes belonging to the maximum match

lengths

Then the assessment of the degree of occurrence of a row in the tree is defined

When comparing the degree of occurrence of a string in two or more ASD formulas

must be modified by dividing by the length of the occurrence found.

If you practice and calculate according to this formula the degree of occurrence in the ASD of some rows in the tree shown in the figure above, we will get the following label:

Line	The degree of entry of the string in the tree
JAVASCRIPT	0.083
Jeep	0.125
SLICKJUMP	0.111
Jk	0.291
Juke	0.125

The algorithm for constructing the ASD, given above, is not optimal. At present, algorithms based on classical methods of working with strings, for example, the Ukkonen algorithm for searching for a substring in a string (for which you can read in [4]) are more often used to build ASD. This allows you to reduce the time spent on the construction of the ASD, and on the calculation of the assessment of the occurrence of a string in the ASD. Another direction - reducing the required memory resources. This can be achieved by applying various data structures to store trees, for example, special arrays.

Benefits

We note the advantages of this approach in comparison with the attribute method. Characteristic approaches feel the deep "structure" of the text much worse, even when using digrams as signs (or even longer sequences of words - trigrams, etc., which, by the way, greatly increases the dimension of the vectors). In addition, a fragmentary approach does not require bringing words to normal forms (if it, of course, uses only the transformation rules, and not the word form base) - if the word is included in the SDA, then its basis will obviously be located in the SDA on some branches. But the most important thing is that the fragmentary approach is tolerant of minor errors in the texts. Of course, there are special methods that allow to teach this and attribute approaches, but this often complicates the algorithms very much. Let us explain in more detail by the example of a specific example from our company.

The technology of contextual targeting of partner store products in SlickJump is based on the probabilistic thematic model LDA (Dirichlet hidden placement) that works with features. In general, probabilistic thematic modeling is a separate extensive and interesting topic; if we do, we will not go into this article. We only note that this approach allows us to solve the problem of finding the degrees of document proximity (i.e., the context and the goods offered), and solve it for a huge number of documents - and there are more than 2 million goods of partner stores in our company's bases.

But for contextual targeting of banners, the number of which is incomparably smaller than the number of products in partner online stores, we use a specially developed model based on the SDA method. This allows you to give good quality contextual targeting even where people write like chicken paw - for example, on forums. Consider an example.

Let us have electronics store banners offering fashionable accessories for smartphones with discounts. And a forum about mobile technology as an advertising platform. The advertiser specifies a list of keywords for his banners or gives it to us (which is convenient, when the number of banners is in the thousands, we have the corresponding functionality). We need to pick relevant banners (usually a web page, forum post, etc.) banners. Let's compare two contextual targeting models: LDA (based on features) and our model based on ASD, and find out whether relevant banners will be found in two different situations: when people write correctly and when they make mistakes.
If all people had written without errors, there would never have been any problems with contextual targeting. If there is a key word “charger”, and in the text of the messages somewhere “charging” is indicated, then there will be no problems with targeting due to the process of normalization of signs. The screenshot below shows that both models found relevant banners.

But if a person wrote hts instead of htc (although a primitive example, but a classic of the genre), everything is much more interesting. Let's try to find banners relevant to this word. The LDA model, based on an indicative approach without special adaptations, will not find anything (or nothing adequate, which is also bad). A method based on the ASD, gives an excellent result, despite the fact that the word has a whole one error on just three letters. This is shown in the screenshot below, which shows the corresponding product names found on request.

Conclusion

Of course, the approach using the SDA will not solve the problems of synonymy and homonymy that are complex for a natural language, so the panacea for everything is not. In addition, the degree of entry of a row into a tree, which is necessary for calculating the relevance, is calculated in a time dependent on the length of this row - the longer the row, the greater the time. And with a primitive approach, such verification should be done in the SDA for all documents. But in SlickJump we use special data structures of higher order, which allow to determine the list of candidates for relevant among all documents and check the degree of entry only into them without loss of quality, not to search where there is obviously nothing to find. We are actively working in this direction, there is much that can be improved.

We have no doubt that the fragmentary method for presenting texts in the coming years will actively develop and develop in many directions. And in the next articles we will review modern algorithms for constructing and representing annotated suffix trees.

References for reading on the topic

Manning K. D., Raghavan P., Schutz H. Introduction to information retrieval. M .: Williams, 2011
Mirkin, B. G. Cluster Analysis Methods for Decision Support: An Overview. M .: Publishing House of the National Research University "Higher School of Economics", 2011
Mirkin B. G., Chernyak E. L., Chugunova O. N. An annotated suffix tree method for estimating the degree of occurrence of lines in text documents. Business Informatics. 2012. V. 3. No. 21. P. 31-41.
Dubov MS, Chernyak EL. Annotated suffix trees: implementation features. In the book: Reports of the All-Russian Scientific Conference AIST-2013 / Ed. Ed .: E. L. Chernyak; scientific Ed .: D.I. Ignatov, M.Yu. Khachai, O. Barinova. M .: National Open University "INTUIT", 2013. S. 49-57.
Mirkin BG, Core Concepts of Data Analysis. Springer, 2012
Korshunov A, Gomzin A. Thematic modeling of texts in natural language. Proceedings of ISP RAS. 2012. No. P.215-244.
Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python. OReilly Media, 2009

Source: https://habr.com/ru/post/227199/

All Articles