The well-known distributional semantics utility Word2Vec demonstrates
amazing results and consistently provides specialists using it with top places in
machine linguistics competitions . The advantage of the utility, as well as its analogues - Glove and AdaGram, is in the cheapness of the process of education and the preparation of training texts. But there are drawbacks - the representation of words in the form of vectors works well on words, satisfactorily on phrases, so-so on phrases and in general nothing - on long texts.
In this article, an approach is proposed for discussion that allows you to represent text of any length as a vector, allowing you to perform comparison (distance calculation), addition, and subtraction operations with texts.
From vector representations to semantic vector
Vector representations of words derived from Word2Vec work have the following interesting property - only the distances between the vectors, and not the vectors themselves, make sense. In other words, to decompose the vector representation of a specific word into its components and to study it seems to be an intractable task. First of all, because the learning process starts with some random initial vectors and moreover, the learning process itself is random. Its randomness is connected with the principle of stochastic learning, when parallel-running learning streams do not synchronize the changes being made with each other, realizing the data race in its pure form. But the quality of training this race does not significantly reduce, while the speed of learning increases very noticeably. It is due to the random structure of the algorithm and data - the vector representation of the word is not decomposable into meaningful components and can only be used as a whole.
')
The negative effect of this property of vector representation is the rapid degradation of vectors during operations on them. The addition of vectors of two words usually demonstrates that common thing between these words (if the words are really related in the real world), but an attempt to increase the number of components very quickly leads to the loss of any practically valuable result. To put together the words of one phrase - it is still possible, several phrases - no longer exist. A different approach is needed.
From the point of view of the philistine logic, how can you describe any text? Try to indicate his subject, maybe - to say a few words about the style. Texts dedicated to cars, obviously, will contain a sufficiently large number of the words “car” and close to it, may contain the word “sport”, the names of car brands and so on. On the other hand, texts on other topics will contain such words much less or not at all. Thus, having listed a sufficient number of possible topics of the text, we can calculate the statistics of the presence in the text of the words corresponding to each topic and get a semantic text vector - a vector, each element of which indicates the relationship of this text to the topic encoded by this element.
The style of the text, in turn, is also determined by statistical methods - these are characteristic for the author words, parasites and verbal expressions, the specifics of the beginning of phrases and the placement of punctuation. Since we share uppercase and lowercase letters when learning and do not remove punctuation from the text, the Word2Vec dictionary is full of words like “text,” like that, with a comma. And it is precisely these words that can be used to highlight the author's style. Of course, really huge text corpuses are needed for a sustainable style selection, or at least a very original author's style, but, nevertheless, it’s easy to distinguish a newspaper article from a forum post or tweet.
Thus, to build a semantic vector of text, it is necessary to describe a sufficient number of stable clusters reflecting the subject and style of the text. In the Word2Vec utility itself there is a built-in clusteriser based on kMeans, and we will use it. The cluster will divide all the words from the dictionary into a given number of clusters, and if the number of clusters is large enough, you can expect each cluster to indicate a rather narrow subject of the text, or more precisely, a narrow indication of the subject or style. In my task, I used two thousand clusters. That is - the length of the semantic vector of the text is two thousand elements, and each element of this vector can be explained through the words corresponding to the given cluster.
The relative density of words from each cluster in the text under study describes the text well. Of course, each specific word is related to many clusters, to some more, to some less. Therefore, first of all, it is necessary to calculate the semantic vector of the word - as a vector describing the distance from the word to the center of the corresponding cluster in the Word2Vec vector space. Then, adding the semantic vectors of the individual words that make up the text, we get the semantic vector of the entire text.
The described algorithm, based on the calculation of the relative frequency of words defining relevant topics, is good because it is suitable for texts of any length. From one word to infinity. In this case, as we know, it is difficult to find a sufficiently long text of one subject, often the subject of the text changes from its beginning to its end. A short text, or a message, on the contrary, cannot cover many topics precisely because of its brevity. As a result, it turns out that the semantic vector of a long text is distinguished by the signs of several themes from the short text, in which the signs are much smaller, but they are represented much more strongly. The length of the text is clearly not taken into account; nevertheless, the algorithm reliably spreads short and long texts in vector space.
How to use semantic text vector?
Since each text is matched by a vector in semantic space, we can calculate the distance between any two texts as a cosine measure between them. Having the distance between the texts, you can use the kMeans algorithm for clustering or classification. Only this time - already in the vector space of texts, rather than individual words. For example, if we have the task of filtering out the text stream (news, forums, tweets, etc.) only having topics that interest us, we can prepare a base of pre-marked texts, and for each text being studied, calculate the class for which it is (maximum of averaged cosine measure over several best occurrences of each class - pure kMeans).
In this way, the rather difficult task of classifying texts into a large (several hundred) number of classes was successfully solved, with significant differences in texts by style (different sources, length, even message languages) and with thematic relatedness of classes (one text can often be related to several classes). Unfortunately, the specific figures of the results obtained are under the NDA, but the overall effectiveness of the approach is as follows - 90% accuracy at 9% of classes, 99% accuracy at 44% of classes, 76% accuracy at 3% of classes. These results should be interpreted as follows: the classifier sorts all several hundred target classes according to the degree of conformity of the text to this class, after which if we take the top 3% of classes, the target class will be in this list with 76% probability, and 9% of classes the probability already exceeds 90%. Without exaggeration, it is an amazing result with great practical benefits for the customer.
I invite you to listen to a more detailed report with a detailed description of the algorithm, formulas, schedule and results at the next Dialogue.
How else to use the semantic vector?
The semantic vector of the text, as already mentioned, consists of meaningful (no one will comprehend all two thousand elements of the vector, but it is possible) elements. Yes, they are not independent, but, nevertheless, they are a ready-to-use feature vector that can be loaded into your favorite universal classifier — SVM, trees, or deep grids.
findings
The method of transforming text of arbitrary length into a vector based on vector representations of words Word2Vec really works and gives good results in the problems of clustering and text classification. The signs of the text, encoded by the semantic vector, do not degrade with an increase in the length of the text, but vice versa - they allow for a more subtle differentiation of long texts from each other and strongly separate texts with substantially different lengths. The total amount of calculations is modest - one month on a normal server.
We are happy to answer your questions in the comments.