
Hi, Habr!
We hope that many of you have a great rest on New Year's holidays. But, the holidays are over. It's time to get back to machine learning and data analysis. Since January 25, we are launching the third set
of Beeline Data School .
')
In the last
post, we promised to tell you in more detail what we learn in our text analysis classes. In this post, we are fulfilling this promise.
By the way, if you are already actively engaged in analyzing and processing texts and want to try yourself, we recommend playing with the task of
The Allen AI Science Challenge on Kaggle =) and at the same time participate in
DeepHack , the hackathon for analyzing texts and building response systems.
About what we teach in our word processing classes further.
Automatic text processing is an area with a high threshold of occurrence: to make an interesting business proposal or to take part in a text analysis competition, for example, in
SemEval or
Dialogue , you need to understand machine learning methods, be able to use special text processing libraries to not program from scratch routine operations, and have a basic understanding of linguistics.
Since it is impossible to tell about all the tasks of word processing and the intricacies of solving them in several classes, we concentrated on the most basic tasks: tokenization, morphological analysis, the task of identifying keywords and phrases, and determining the similarities between texts.
Tokenization (from the English word tokenization) is the process of breaking the text into sentences and words. The task of tokenization erroneously may seem trivial at first glance.
In fact, the concept of a word or a token - an element of the text - is blurry: for example, the name of the city of New York, formally, consists of two separate words. Of course, for any reasonable processing, these two words should be considered as one token and not processed one by one.
In addition, the point - is not always the end of the sentence, in contrast to the question and exclamation marks. Points may be part of an abbreviation or number entry.
If you do not delve into theory, but treat tokenization from a practical point of view, as an absolutely necessary, but not the most fascinating task, then it is most reasonable to use combined tokens that break the text into words according to the rules, and use binary classification algorithms to break up sentences into sentences. : each point is classified by the classifier either as a positive class (the point is the end of a sentence) or as negative (the point is not the end of a sentence). Similarly, the classifier is drawn with an exclamation and question mark.
Fortunately, students didn’t have to implement these algorithms from scratch: they were all implemented in the Natural Language Toolkit, a library for processing texts in Python.
Morphological analysis (eng. Part of speech tagging) consists in defining the morphological properties of each word: what part of speech does the word belong to, if the word is a noun, what number is it in (and what is its gender and case, if we are talking about Russian language), if the word is a verb, then what is its time, appearance, face, voice, and so on.
It is not so simple to define a part of speech because of morphological homonymy: different words may have coinciding forms, that is, be homonyms. For example, in the sentence “He was surprised by a simple soldier” there are two whole homonyms: simple and soldier.
NLTK implements both smart morphological analyzers, which are able to determine the part of speech depending on the context, and simple morphological dictionaries, which for each word return its most frequent parsing.
The task of identifying keywords and phrases is weakly formalizable: keywords and phrases are usually understood as words and phrases that reflect the thematic specificity of a text.
Due to the vagueness of the definition, there are dozens, if not hundreds, of approaches to identifying keywords and phrases. We consider some of them:
- use machine learning techniques to highlight keywords and phrases. It is assumed that there is a collection of texts in which keywords and phrases are highlighted by experts. Then we can train the classifier to highlight keywords and phrases for such a marked collection.
- highlighting keywords and phrases by morphological patterns. If there is no marked collection of texts, then we can assume that keywords and phrases should be grammatically meaningful: for example, keywords are nouns, and phrases are pairs of words of the form noun + noun or adjective + noun.
- selection of key bigrams (pairs of words) by statistical measures of connectivity. Static measures of connectivity are usually understood as a measure of mutual information of a pair of words w1, w2 (English pointwise mutual information) and its derivatives, borrowed from information theory, or statistical tests of the independence of two events “met w1” and “met w2”. A significant limitation of this approach is that it is applicable only for pairs of words.
- highlighting keywords and phrases by contrast measures. Suppose that we have collected 10 articles from Wikipedia about different pythons - programming languages, movies, attractions. The task is to search for the words most specific for the given text, and not for the entire collection as a whole. To solve this problem, the contrast of a word or phrase is compared with the rest of the collection. One of the most popular measures of contrast is called tf – idf. It consists of two parts: tf is the frequency of the word in the text under consideration, and idf is the reciprocal of the number of documents in which the word is contained. Using the measure tf – idf allows you to find out, for example, that the most important word in the text about Python attraction is slides.

A cloud of keywords and phrases built by
“Alice in Wonderland"Using tf – idf is not limited to the task of extracting keywords and phrases. Another use for tf – idf is to calculate the similarity between texts. If each text from the collection is represented by the vector tf – idf of the words in it, then the similarity between the two texts can be defined as the cosine between two such vectors in the multidimensional word space. This cosine similarity between the two texts can be used as a distance for cluster analysis.
Deep Learning word processing approach: data structure word2vec:In addition to the classic word processing methods described above, the word2vec data structure developed by Google and using the Deep Learning approach has recently gained popularity. With it, you can make fun of your brain a little, namely, to identify words (no matter what language) with vectors in linear space. Well, and then subtract, add, scalar multiply the last. This structure works from the inside quite difficult, but in order to try it in action you just need to learn how to adjust the training parameters. We devote time to this in our classes. From the point of view of the black box, this structure is quite simple: we simply “feed” a large number of texts, which we have without any preprocessing. And at the output we get for each word from our texts a vector (in advance of a given dimension).
The reader immediately has a question: what now to do with it? Here is a list of the most simple tasks that can be solved after a click of your finger:
- For example, in the task of classifying texts, it is possible for each document to calculate the average vector of its words, thereby obtaining not the worst (but not the best) indicative description. Then, having solved the classical problem of machine learning (the problem of classification), we will learn to classify documents.
- The definition of typos in the text - done with the help of a "scalar product" - we invite readers to think at their leisure exactly
- Word clustering (which, by the way, can also be applied to the task of text classification)
- Solution of the problem “find an extra word”
And this is not a complete list of tasks that are solved using this data structure word2vec. In more detail about all the applications and subtleties of the settings we tell in our classes.
It was only an introductory part of such a complex and multifaceted science as Natural Language Processing (natural language processing). In our classes, we also consider the many tasks of automatic text processing applicable, including on Russian corpuses (Russian texts).
This is especially true because At the moment, many vendors on the market offer solutions that have been tested on Anglo buildings and there is no one buy-and-install solution — you have to develop a product from scratch for every task.
Examples of text analytics:Among the applications of natural language analysis today, one of the most famous examples is the monitoring of references to any objects and the analysis of attitudes towards them in the media and social services. networks. It's no secret that the Internet generates a huge amount of textual information every day - news, press releases, blogs, not to mention social networks. Therefore, even the very task of collecting and storing such information, not to mention processing, is a separate business. Of particular value are the methods of word processing, in particular the monitoring of references in the texts of the relationship to a particular object or company.
Surely, many of you have noticed that if you write something in a social network about a well-known brand, or even better - mention it with a tag, then with considerable probability you will receive an answer. Of course, the answer itself is still being written by people (although in this analysis of natural language makes great progress), but clever algorithms are doing the classification of the text of your conversion itself (negative, neutral, positive). This is the classic problem of Sentiment Analysis, which we analyze in detail in the classroom, however, which has a huge number of pitfalls.
Nevertheless, we will solve in essence only this task - some companies do a good business with this. For example, companies analyzing messages in social networks and finding leaders of opinions (this is quite a popular task from the field of graph theory, which we also consider in our lectures).
The task of building bots, conversational robots, and messaging applications is also gaining popularity. Such bots can answer customer questions, with the help of them customers can manage the company's services available to them, buy new products or services. You can read more about bots in messengers, for example,
here.As you probably noticed, today there are a lot of problems in the analysis of natural language processing - and, having solved at least one of them, you can build a whole business.
For anyone interested in text analysis, we recommend taking the course on the
Coursera of the same name - in our opinion, one of the most useful text processing courses. If you have a desire to practice more and solve problems, welcome to the
Beeline Data School , the third set of which will start on January 25th.