There is a huge number of clustering algorithms. The main idea of most of them is to combine identical sequences into one class or cluster based on similarity. As a rule, the choice of algorithm is determined by the task. As for textual data, here the compared components are sequences of words and their attributes (for example, the weight of a word in the text, the type of the named entity, tonality, etc.). Thus, texts are initially converted into vectors with which they produce different types of manipulation. In this case, as a rule, a number of problems arise related to: the choice of primary clusters, the dependence of the quality of clustering on the length of the text, the determination of the total number of clusters, etc. But the most difficult problem is the lack of communication between similar texts, which use different vocabulary. In such cases, the association should occur not only on the basis of similarity, but also on the basis of semantic contiguity or associativity.

For example,
In London, they decided not to declare Russia a new cold war.
Boris Johnson: The West is not in a state of new cold war with Russia
In the British Foreign Ministry said that the West does not want a new cold war with Russia')
All three examples are one news, however, the vocabulary used is different. In such cases, clustering algorithms based on lexical similarity no longer help.
Problems of this type are solved in two ways: 1) compiling thesauri and 2) using various kinds of “clever” algorithms that establish associative-semantic links between words, such as: latent-semantic analysis (LSA), probabilistic latent-semantic analysis (pLSA) , latent placement of Dirichlet (LDA), etc.
The first way - obtaining thesauri - is rather laborious and is completely determined by the task. This means that creating a universal thesaurus is not yet possible.
The second way - algorithmic - also has its drawbacks. First of all, it is the “nebula” of the methods themselves and the non-obviousness of their use for textual data. Say, LDA requires a condition of normal distribution, which is not always satisfied when solving linguistic problems. As a rule, all these algorithms have a large number of parameters, the determination of which is empirical and can significantly affect the quality of the solution (for example, the reduction of the singular values of the diagonal matrix in the LSA has a very non-linear effect on the result).
We tried to circumvent a number of the above problems using the results of the Word2Vec algorithm.
Word2vec
The description of the algorithm can be found in
Wikipedia or another information resource. The algorithm was created primarily for search tasks, so it should be used directly for other linguistic purposes with care.
The algorithm has two ways of learning and several options for demonstrating the results. We, in particular, will be interested in presenting the result of classes — obtaining associative semantic classes (using the k-mean, by the way).
As the experiments show, even on small collections of several million words of use, more or less “meaningful” classes of words are obtained. But how to use this result if words have no weight (for example, prepositions and “key” words have the same idea in such a model)? Another problem is that classes are not perfect and, for example, service words can fall into a semantically significant class. Use a stop list? Then there is a risk of throwing out the baby with the water from the trough. In general, the use of stop lists has its significant drawbacks (for example, by throwing out pronouns, we lose the Russian youth public political movement OUR; by throwing out single-letter words, we will not find information about Kommersant — reduction of the Kommersant newspaper, etc.) .
Finally, how many classes will be optimal for text clustering? Does it depend on the size of the training or test sample, the subject of the text, its volume? We will answer these key questions in the second part of the publication, but for now let's briefly describe the algorithm.
Algorithm Description
The idea of the algorithm in comparing not the words themselves or the sequences composed of them (the so-called n-grams), but the semantic classes into which they fall.
Training requires a large amount of textual material (hundreds of thousands, or better tens of millions of word usage; the larger the sample, the less thematic binding of the model to the text, the less often it is necessary to adjust the model to the text). As a result of the training, a list is obtained where a class is assigned to almost every word of the text (the number of classes is indicated at the training stage). Then this list is based on the frequency distributions of words in the text and in the class corresponding to them, to the format: word - class - weight, smoothed by a specific algorithm. Here the important parameters are the total number of classes and smoothing factors (in more detail in the second part of the publication).
Here is an example of a small semantic class.
baby | daughter |
mummy | mother-in-law |
teacher | docha |
baby | relatives |
teacher | younger |
little sister | wife |
baby | father's |
sister | the former |
aunt | grandmother |
granny | the eldest |
girlfriend | girlfriend |
mother | mistress |
spouse | citizen |
aunt | family |
granddaughter | a girl |
maiden | mother's |
her | |
The pronoun “her” falls into this more or less homogeneous associative-semantic class, which, in principle, is appropriate, but not informative. It can interfere with the clustering of the material, since it will also “drag the blanket” over to itself in frequency. Such “outliers” can be removed by further smoothing the model.
According to the obtained model, all the words from the test sample are converted into numbers corresponding to semantic classes, and further manipulations occur only with numbers. For clustering, a simple algorithm for comparing accumulated documents with each other is used. Integer types (int) are compared, so the speed is quite high. At the same time, a comparison is imposed on a number of filters, such as restrictions on the number of semantic classes in one document, the minimum number of documents in a cluster, a measure of proximity - the number of matched classes. The logic of the algorithm is organized so that the same document can fall into different clusters (since the primary clusters are not defined). Due to this “fuzziness”, the primary result is a fairly voluminous set of clusters that are close in subject. Therefore, post clustering is required, which simply combines close clusters, comparing them by the number of identical documents (by their id).
Thus, using the above described logic, it is possible to achieve not only a high clustering rate, but also not to bother with the definition of the primary clusters, or their total number.
Results and conclusions
In clustering, it is difficult to talk about the quality of the result, since it strongly depends on the material being clustered. In this sense, there is no special reason to make the “gold” standard. But it makes sense to check the models on different buildings for classification (which will be discussed again in the second part of the article).
As the tests on the classification of documents by the vector method on unigrams (comparison by cosine) showed, the clustering models are almost as good as the “words” models. This means that the classification “without a teacher” (models in this sense are universal) can only show results a little worse than a full-fledged training with a teacher. The deterioration is 1-10% and depends on the test material. But this is not the goal! It just means that the clustering models obtained using the Word2Vec algorithm are quite valid for their application on any type of material with any number of classes (in this case clusters).
The quality of the result is determined by the filtering thresholds: for example, if we just need to identify fuzzy duplicates, then we need to set more stringent parameters; and vice versa, if you just need to see the main topics of the material coming out, you can set softer parameters.
In general, the algorithm is universal: it can work on any amount of material. Although with large volumes it is better to work with windows of 10-100 thousand documents; thus obtained primary clusters are then combined into post clustering. The algorithm is also weakly sensitive to the length of the document: it is desirable that the “long” documents be semantically homogeneous (belong to the same subject).
However, there are a number of problems with this algorithm. First of all, it is the dependence of the result on the number of classes in the model: sometimes it can become sensitive, especially for “weak” clusters, i.e. clusters with a small amount of documents. The second problem follows from the imperfect definition of semantic classes, as a result of which sometimes documents that are close in meaning fall into different clusters (differing, say, in a pair of classes). Such problems are eliminated by an additional post association, based on the factorization of documents: the selection of named objects and the connections between them. But this problem is solved already easier, because the number of clusters is small relative to the primary volume of documents and most of them are already merged.
Thus, the presented algorithm has several advantages:
- clustering occurs not by words, but by semantic classes;
- independence of the result of the length of the documents;
- independence of results from the volume of the material;
- automatic detection of the number of clusters;
- no need to identify primary clusters;
- high speed, which allows the use of this algorithm on a large flow of text messages.
Despite the fact that the algorithm is not sensitive to the volume of incoming material, it is recommended to use for clustering large streams of unstructured information, when not only the quality of clustering is important, but also the speed of obtaining the result. For small volumes of thematically separated material (for example, pharmaceuticals, gaming chats, etc.) it is better to use thesauri, since the Word2Vec model may not contain professional terminology or highly specialized jargon.