We use a large structured source of multilingual texts - Wikipedia to improve the quality of text classification. The approach is good with a high degree of automatism and independence from which particular classification problem is being solved. The greatest effect, however, is expected on the problems of determining the subject.
The basic idea is to extract from Wikipedia only those texts that help us solve our classification problem, ignoring the others. If we classify texts about cats, we hardly need texts on quantum physics, although texts on other types of animals may be useful. The automatic separation of such texts from each other is the essence of the described approach.
Wikipedia, as you know, is a collection of articles on a variety of areas of knowledge and interests. At the same time, a significant part of the articles has links to articles of similar subjects, but in other languages. These are not translations, but articles of a general subject. Also, most of the articles fall into one or more categories. Categories, in turn, are generally organized in a hierarchical tree. That is, the task of grouping Wikipedia articles on topics of interest to us can be solved.
We use the DBPedia resource - a previously parsed and structured version of Wikipedia. DBPedia gives us all the necessary information - the titles of the articles, their annotations, categories of articles and higher categories for categories. We start with the most widely represented language in Wikipedia - English. If your task does not have, or few, English texts, use the language, of which there are many documents.
')
Step 1. Clustering Wikipedia
We concentrate on the categories of articles. While ignoring their content. Categories form a graph, mostly tree-like, but there are cycles too. Articles are the end points of the graph (leaves) connected to one or several nodes of the graph. We use the Node2Vec tool to get a vector representation of each category and each article. Articles of a similar subject are grouped in a vector space nearby.
Clustering by any convenient method of an article into a rather large (hundreds) number of clusters.
Step 2. Learning classifier on Wikipedia
Replace the article titles in the resulting clusters with their annotations (Long Abstract and Short Abstract - approximately one paragraph of text per article). Now we have hundreds of clusters given as sets of texts. We use a convenient model and build a classifier that solves the problem of multi-class classification: one cluster - one class. We used FastText.
The output is a model that accepts text as input, and at the output it gives a vector of assessments of the degree to which the text belongs to our hundreds of cluster classes.
If at the first step we cluster the Wikipedia articles not by their categories, but by their content, then, firstly, we will lose information by categories, but it is important, and secondly, we will have a degenerate system - which is clustered and built by text classifier model. The final quality will probably be worse than with a separate approach. Although I did not check.
Step 3. Building a model on its own, combat, data
We use a sample of our combat data and submit each document to the input of the model from step 2. The model returns a vector of ratings. We use this vector as a feature vector for the document in question. As a result, having processed all our training selection of combat documents, we will get a table in a standard form for machine learning - a class label, a set of numerical features. We call this table a learning sample.
We build on the training sample a classifier that can evaluate the information content of individual signs. Decision trees and any of their random forest variations are well suited. The most informative features are those clusters of Wikipedia articles that not only have similar themes to the themes of our combat documents, but, most importantly, the themes of these articles allow us to well divide our fighting classes. At the first iterations, the histogram of the informativeness of the signs is usually quite flat - several informative clusters and a long tail of the remaining hundreds of signs that are almost equal in informativeness.
Having studied the informativeness histogram of the signs, the inflection point was determined empirically each time, and approximately from 10 to 30% of the clusters passed to the next iteration. The essence of the iteration is that the articles from the selected informative clusters were combined, submitted to steps 1-3, where they were again clustered, two classifiers were built again and all were completed with an analysis of the informative histogram. It will take from 3-4 iterations.
It turned out that on our data the digital signs, especially the numbers of the years, have a very strong weight and drag the informational content of the entire cluster onto themselves. As a logical result, clusters dedicated to annual sporting events became the most informative - a mass of numbers and dates, narrow vocabulary. It was necessary in the texts of annotations of articles to remove all the numbers (in the second step). It became noticeably better, clusters of articles that really have a targeted theme (as we imagined it) began to stand out. At the same time, there were also unexpected clusters that logically fell on our combat mission, had the necessary vocabulary, but at the same time, a priori, it was very difficult to guess the usefulness of such clusters.
Step 4. Finalize the model
After several iterations of steps 1-3, we have a reasonable number of articles selected from Wikipedia, whose topics help us to separate our combat documents. We extend the sample with articles-analogues in other languages ​​of interest to us and are building the final clusters, this time - dozens. These clusters can be used in two ways - either to build a classifier similar to step 2, and use it to expand the digital feature vector in your combat mission, or use these text sets as a source of additional vocabulary and embed them in your combat classifier. We used the second path.
Our combat classifier is an ensemble of two models - stripped down naive bayes and xgboost. Naive Bayes works on long grams, these are grams from 1 to 16 elements long, and each gram found inclines the total amount to one of the classes, but Bayes himself does not make the final decision - only gives the sum of the gram weights related to each from classes. Xgboost accepts output bytes, other classifiers and some digital features built by text independently, and already xgboost gives the final model and the final assessment. Such an approach makes it easy to connect any set of texts to the grammar-based Bayes model, including the resulting sets of articles from Wikipedia, and xgboost is already looking for patterns in the form of typical reactions of wikipedia clusters to combat texts.
Results and conclusions
The first result gave an increase from the conditional 60% accuracy to 62%. When replacing the annotations of Wikipedia articles in step 4 by the downloaded articles themselves, the accuracy increased to 66%. The result is logical, because the size of the abstract is two or three phrases, and the size of the article is orders of magnitude larger. More linguistic material - higher effect.
One should expect that having done the whole procedure on the texts of articles, rather than annotations, the quality gain will be even greater, but there is already a technical number problem — it is difficult to deflate and process the entire Wikipedia, or a significant part of it (if you do not start from the first iteration). Also, if you initially use not only English, but all languages ​​of interest - you can still win something. In this case, the growth of the processed volumes is multiple, and not by orders of magnitude, as in the first case.
Semantic Document Vector
For each document, a vector of the relation of the document to the specified topics is built on the basis of Wikipedia categories. The vector is constructed either by the method described in step 3 or by our gram Bayesian. Accordingly, combat documents can be clustered according to these vectors and get a grouping of combat documents on the subject. It remains only to put down hashtags and each new document can already get into the database with tags. On which then users can search. This is the case if you affix tags in an explicit and visible way to the user. It looks fashionable, although I am not a supporter.
Adaptive search
A more interesting method of using semantic document vectors is adaptive search. Observing the user's activity, on which documents he is late, and which he does not even read, it is possible to delineate the user's interests and in the long-term sense (after all, users also share the division of responsibilities and everyone mostly looks for his own) within the current search session.
Documents that have similar themes have similar semantic vectors that have a high cosine measure, and this makes it possible on the fly to evaluate documents in search results according to the degree of their expected compliance with the interests of the user, as a result of which they increase the necessary documents in search results.
As a result, even with identical search queries from each user, search results can be customized to him personally and depending on which of the documents you are interested in at the previous step, the next search step will adjust to the user's needs, even if the search query itself has not changed.
We are currently working on the problem of adaptive search.
Business Hypothesis Testing
Business periodically comes with bright ideas that are very difficult to implement. It is necessary to learn how to find documents according to their description, without having either a marked sample for training, nor an opportunity to submit to the assessors a set of documents for marking. This usually happens when target documents are rarely found in relation to the general flow of documents, and as a result - by submitting a pool of 10 thousand documents without prior filtering to assessors, you can get 1-2 necessary, or even less, output.
Our approach is to create an iterative learning process based on semantic vectors. In the first step, we find several texts that define our target topics — these could be Wikipedia articles, or texts from other sources. For each text its semantic vector is produced. If the target topic is complex, the algebra of sets works — associations, intersections, and the exclusion of some topics from others. For example - there are Wikipedia articles about “Research and Development” and about “Cosmetics”, the intersection of sets will give “R & D about cosmetics”.
All documents in the database can be sorted by degree of compliance with specified topics, then the algebra of sets works on the documents themselves as follows - the document is considered to be relevant to the subject if its semantic vector is closer to the vector of Wikipedia articles of a given topic than the average for the base. Intersection - if at the same time the semantic vector of the document is closer to both subjects than the average over the base. Other operations are similar.
We find a set of hundreds of other documents that have the closest proximity to all positive topics and, at the same time, the smallest proximity to all negative topics (if we are not interested in financial research, we will set as a negative example articles from the “Finance” category ). Let us give these documents to assessors, they will find some positive examples in them, based on these examples we will look for other documents with similar semantic vectors, mark them up, and at the output we will get enough documents for a positive class to build any convenient classifier. You may need several iterations.
Total
The described approach allows you to automatically, without manual analysis, select from Wikipedia or another source sets of texts that help to solve the problem of classification. By simply connecting Wikipedia clusters to a working classifier, one can expect a significant increase in quality without requiring an adaptation of the classifier itself.
Well, adaptive search is interesting.