Intelligent Word Processing

Natural language related work is one of the key tasks for creating artificial intelligence. Their complexity has long been greatly underestimated. One of the reasons for early optimism in the field of natural language was Noam Chomsky’s pioneering work on generative grammars. In his book "Syntax Structures" and other works, Chomsky proposed an idea that now seems completely ordinary, but then made a revolution: he transformed a sentence in natural language into a tree, which shows the relationship between different words in a sentence.

An example of the parse tree is shown in the figure above (a - syntax analysis based on the structure of direct components; b - based on the grammar of dependencies). A generating grammar is a set of rules of the form S → NP VP or VP → V NP, by which such trees can be generated. On the trees of syntactic analysis one can build rather strict constructions, try to determine, for example, the logic of a natural language, with real axioms and rules of inference.

Now this approach to parsing is called structure-based analysis.
direct components, or phrase-based grammars (phrase structure based parsing). Chomsky, of course, put forward other hypotheses along the way - in particular, one of his central ideas was the idea of a “universal grammar” of the human language, which, at least in part, is laid at the genetic level, even before the birth of a child - but for us now is important It is this new connection between natural language and mathematics that eventually turned linguistics into one of the most “exact” of the humanities.

For artificial intelligence, this linguistic breakthrough at first looked like an indulgence to unrestrained optimism: it seemed that since natural language can be represented as such rigorous mathematical constructions, soon we can finally formalize it, transfer it to a computer, and that will soon be able to talk to us . However, in practice, this program met, to say the least, significant difficulties: it turned out that natural language was not as formal as it once seemed, and most importantly, it depends to a great extent on implicit assumptions that are not easy to formalize. Moreover, it turns out that in order to understand a natural language, it is often necessary not just to properly “parse” such a well-defined formal object, such as a sequence of letters or words, but also to have some “common sense”, an idea about the surrounding world, and with this, computers are bad at the moment.

A simple and understandable example of such a super-complex task is the resolution of an anaphora (anaphora resolution), that is, an understanding of what a particular pronoun in a text refers to. Compare the two sentences: "Mom washed the frame, and now it glitters" and "Mom washed the frame, and now she is tired." Structurally, they are absolutely the same. But imagine how much you need to know and understand a computer in order to correctly determine what the pronoun "she" refers to in each of these phrases!
')
And this is not some specially devised perverted example, but the everyday reality of our language; we constantly refer to the fact that people understand “naturally” ... but for a computer model it is completely illogical! It is “common sense” (commonsense reasoning) that is the main stumbling block for modern processing of natural language. By the way, specialists in natural language processing have long been trying to specifically work in this direction; An annual seminar on “common sense”, International Symposium on Logical Formalization on Commonsense Reasoning, has been held for more than ten years, and recently a common-sense challenge competition, called Winograd Schema Challenge in honor of Terry Winograd, has been launched.

The tasks there are approximately the following: “The cup did not fit in a suitcase because it was too large; what exactly was too big a suitcase or a cup? ”

So, although people are working on intelligent word processing and even making significant progress, computers have not yet learned how to talk. Yes, and with the understanding of the written text is still a disaster, although they also work with the help of deep learning and the recognition and synthesis of speech. But before we start applying neural networks to natural language, we need to discuss another question that the reader probably already had: what, in fact, means “to understand the text”? Machine learning has taught us that first of all we need to define the task, the objective function that we want to optimize. How to optimize "understanding"?

Of course, intellectual processing of texts is not one task, but a lot, and all of them are in one way or another subject to man and are connected with the “holy grail” of understanding the text. Let's list and briefly comment on the main easily quantifiable word processing tasks; some of them will be discussed later in this chapter; We will try to go from simple to complex and conditionally divide them into three classes.

1. The tasks of the first class can be called syntactical; here, as a rule, tasks are very well defined and are problems of classification or problems of generating discrete objects, and many of them are solved quite well now, for example:

(i) part-of-speech tagging: mark words in parts of the given text in parts of speech (noun, verb, adjective ...) and, possibly, by morphological features (gender, case ...);

(ii) morphological segmentation: divide words in a given text into morphemes, that is, syntactic units such as prefixes, suffixes, and endings; For some languages (for example, English) this is not very relevant, but there is a lot of morphology in Russian;

(iii) another variant of the problem of the morphology of individual words - stemming (stemming), in which it is necessary to single out the basis of words, or lemmatization (lemmatization), in which the word should be reduced to the basic form (for example, the singular form of the masculine gender)

(iv) sentence boundary selection (sentence boundary disambiguation): break a given text into sentences; it may seem that they are separated by dots and other punctuation marks and begin with a capital letter, but recall, for example, how “in 1995 T. Vinograd became the supervisor of L. Paige,” and you will understand that the task is not an easy one; and in languages like Chinese, word segmentation (word segmentation) becomes quite non-trivial, because the flow of hieroglyphs without spaces can be divided into words in different ways;

(v) recognition of named entities (named entity recognition): find in the text the proper names of people, geographic and other objects, marking them by the types of entities (names, toponyms, etc.);

(vi) word sense disambiguation: choose which of the homonyms, which of the different meanings of the same word is used in this passage of the text;

(vii) syntactic parsing: according to a given sentence (and, possibly, its context), construct a syntax tree, directly according to Chomsky;

(viii) coreference resolution: determine which objects or other parts of the text include certain words and turnovers; A special case of this problem is the resolution of the anaphor, which we discussed above.

2. The second class is tasks that generally require an understanding of the text, but in form still represent well-defined tasks with correct answers (for example, classification tasks), for which it is easy to come up with no doubt quality metrics. These tasks include, in particular:

(i) language models: for a given passage of text, predict the next word or symbol; This task is very important, for example, for speech recognition (see just below);

(ii) information retrieval (information retrieval), the central task that is solved by Google and Yandex: for a given request and a huge number of documents, find among them the most relevant to this request;

(iii) sentiment analysis: determine the tonality of the text, that is, whether the attitude is positive or negative; tonality analysis is used in online trading to analyze user reviews, in finance and trading to analyze articles in the press, company reports and similar texts, etc .;

(iv) identifying relationships or facts (relationship extraction, fact extraction): extract from the text well-defined relationships or facts about the entities mentioned there; for example, who is related to whom, in which year the company mentioned in the text was founded, etc .;

(v) answering questions: to answer the question asked; depending on the production, it can be either a pure classification (choice from answer options, as in a test), or a classification with a very large number of classes (answers to factual questions like “who?” or “in what year?”), or even a generation text (if you need to answer questions in a natural dialogue).

3. And finally, to the third class we assign the tasks in which it is required not only to understand the already written text, but also to generate a new one. Here, quality metrics are not always obvious, and we will discuss this issue below. These tasks include, for example:

(i) the text generation proper;

(ii) automatic summarization: in the text to generate its summary, abstract, so to speak; This can be considered as a classification task, if you ask a model to choose from the text ready-made sentences that best reflect the general meaning, or you can as a generation task if you need to write a summary from scratch;

(iii) machine translation: from the text in one language, generate the corresponding text in another language;

(iv) dialog models (dialog and conversational models): keep up the conversation with the person; The first chat bots began to appear as early as the 1970s, and today it is a big industry; and although it is not yet possible to conduct a full-fledged dialogue and pass the Turing test, interactive models are already working with might and main (for example, the first line of "online consultants" on different trading sites is almost always chat bots).

An important problem for the models of the latter class is the quality assessment. You can have a set of parallel translations that we consider good, but how to evaluate a new translation made by the model? Or, more interesting, how to evaluate the response of the dialogue model in a conversation? One possible answer to this question is BLEU (Bilingual Evaluation Understudy) [48], a class of metrics designed for machine translation but also used for other tasks. BLEU is a modification of the accuracy (precision) of the response of the model and the “correct answer”, re-weighted so as not to give an ideal assessment of the answer from one correct word. For the entire test case BLEU is considered as follows:

where r is the total length of the correct answer, c is the length of the response of the model, pn is the modified accuracy, and wn are positive weights, giving a total of one. There are other similar metrics: METEOR [298] - harmonic mean of accuracy and completeness by unigrams, TER (translation edit rate) [513] calculates the relative number of corrections that need to be made to the model output to get a reference output, ROUGE [326] counts the proportion of the intersection of n-gram sets of words in the standard and in the resulting text, and LEPOR [204] completely combines several different metrics with different weights, which can also be taught (many of the authors of these articles are French, and the abbreviations were French-speaking).

However, it is curious that, although metrics like BLEU and METEOR are still widely used, it is actually not at all the fact that this is the best choice. First, BLEU has a discrete set of values, so that it cannot be directly optimized by gradient descent. But it is even more interesting that in work [232] very surprising results are given of using various similar quality metrics in the context of assessing model responses in a dialogue. There are calculated correlations (both ordinary and rank) of human assessments of the quality of answers and estimates for different metrics ... and it turns out that these correlations are almost always close to zero, and sometimes completely negative! The best option BLEU managed to achieve a correlation with human estimates of about 0.35 on one dataset, and on the other 0.12 altogether (try to publish a scientific result with such a correlation!). Moreover, such poor results do not mean that the correct answer does not exist at all: assessments of different people have always correlated with each other at 0.95 and higher, so the “gold standard” of quality assessment certainly exists, but how to formalize it, we do not understand yet. This criticism has already led to new constructions of automatically trained quality metrics [537], and we hope that new results in this direction will appear. Nevertheless, while there are no easily applicable alternatives to BLEU type metrics, they are usually used.

In addition, there is still a wide class of tasks related to text, but accepting not a sequence of characters as input, but an input of a different nature. For example, without understanding the language, it is almost impossible to learn how to perfectly recognize speech: although it seems that speech recognition is just a task of classifying phonemes by sound, in reality a person misses a lot of sounds and completes a significant part of what he hears based on his understanding of the language. Back in the 1990s, speech recognition systems reached the human level in recognizing individual phonemes: if you get people to a tape recorder, let them listen to sounds without context and ask them to distinguish "a" from "o", the results will not be outstanding at all; so, for example, to write down what is dictated to you, you need to know well the language in which it happens. Another class is handwriting or typed text recognition tasks.

We will return to many of these tasks in the future. However, the main content of this chapter is not to solve a specific task of processing a natural language, but to tell about the constructions on which almost all modern neural network approaches to such tasks are based - about distributed representations of words.

An excerpt from the book by Sergey Nikolenko, Arthur Kadurin and Ekaterina Arkhangelsk “Deep Learning” is given

Source: https://habr.com/ru/post/351732/

All Articles

Intelligent Word Processing

More articles: