It took me about three years ago to make a classifier of texts. In this article I will talk about how it worked and in general some aspects of the implementation and testing of such algorithms.
Classification
Classification, according to Wikipedia, is
one of the tasks of information retrieval, which consists in assigning a document to one of several categories based on the content of the document.This is what we will do.
I think there is no need to paint why it is needed, because the classification of documents has long been used in information retrieval, in the search for plagiarism, and so on.
Collection
To train a classifier (if our classifier is a student) you need some collection of documents. However, as for the tests.
To do this, there are various test collections of texts that are mostly easily found on the Internet. I used the RCV1 collection (Reuters Corpus Volume 1). This collection of manually sorted Reuters news for the period from August 1996 to August 1997. The collection contains about 810000 news articles in English. It is constantly used for a variety of research in the field of information retrieval and processing of textual information.
')
The collection contains hierarchical categories (one document can belong to different categories, for example, economics and oil at the same time). Each category has its own code, and these codes are already assigned to the news (about eighty categories in total). For example, four categories head the hierarchy: CCAT (Corporate / Industrial), ECAT (Economics), GCAT (Government / Social), and MCAT (Markets).
Full list of categories can be found
here.Now let's deal with a few more important points. First, for classifier training, it would be nice to still
stem the texts and throw out stop words.
- Stemming. When we classify, it does not matter how the word “oil” is written (“oil”, “oil”, “oil”, “oil”) - we will still understand what it is about oil. Bringing words to one form is called stemming. For stemming texts in English there are ready-made solutions, for example , this is the implementation of the Porter algorithm . This library is written in java at Indiana University and is distributed freely. I did not search in Russian, maybe there is something too.
- Stop words. It's all clear. The words "and", "but", "by", "from", "under" and so on, by no means affect the classification, as they are found everywhere. In practice, they usually take a certain number of the most popular words in the collection and throw them away (numbers can also be thrown away). You can just take a list from here.
Such processing is already an established way to reduce the dimension of the task and improve performance.
Support Vector Machine
Now we abstract from text documents and we will consider one of the existing classification algorithms.
I will not ship here with details and mathematical calculations. If they are interesting, then they can be found, for example, in
Wikipedia or on
Habré .
I'll tell you briefly.
Suppose that we have many points in n-dimensional space, and each of the points belongs to only one of the two classes. If we can somehow divide them into two sets with a hyperplane of dimension n-1, then this will immediately solve the problem. This does not always happen, but if the set is linearly separable, then we are lucky:

The classifier works the better, the greater the distance from the separating hyperplane to the nearest point of the set. We need to find just such a hyperplane of all possible. It is called the optimal separating hyperplane:

For the search, an idle idea is used: to construct two parallel hyperplanes, between which there are no points of the set and to maximize the distance between them. The points closest to parallel hyperplanes are called support vectors, which gave the name to the method.
And what if the sets are not linearly separable? Moreover, most likely it is. In this case, we simply allow the classifier to divide the sets so that points can appear between the hyperplanes (hence the classifier error, which we try to minimize). Having found such a hyperplane, the classification becomes a matter of technology: it is necessary to look from which side of the hyperplane was a vector whose class needs to be defined.
Ready implementation
Of course, you can implement the algorithm yourself. But if there are good ready-made solutions, then this can lead to a waste of a lot of time, and not the fact that your implementation will be better or at least the same.
There is a great implementation of a support vector machine called
SVM light . It was created
at the Technical University of Dortmund and is distributed free of charge (for research purposes only). There are several options for implementation (for example, who can learn on relational databases), but you can take the most simple, working with text files. This classifier consists of two executable files - training and classifying modules.
Training
The training module accepts a text file with training data. The file must contain a vector labeled +1 or -1, depending on which of the two classes it belongs to.
It looks like this:
-1 0 0 0 0 0 6 6 0 0 0 6.56 0 0 0 0 0 0 0 45.56 0 0 0 0 0.5 0 0 0 0 0 0
The module will also understand this entry:
-1 7: 6 11: 6.56 19: 45.56 25: 0.5 (that is, you can list only non-zero values ​​with an index).
Classification
The second module takes as input a text file with data that needs to be classified. The format is the same as for training. After classification, a res.txt is created in which the results are recorded.
svm to classify documents
Now let's see how svm can be applied to the classification of documents. Where to get the vector? How to determine which categories our document belongs to, if there are 80 of them, and svm can only determine belonging to one of two classes?
Document is a vector
Submitting a document as a vector is not difficult. First you need to make a collection
dictionary - this is a list of all the words that are found in this collection (for the RCV1 collection, this is 47,000 words). The number of words will be the dimension of the vectors. After that, the document is represented as a vector, where the i-th element is a measure of the occurrence of the i-th dictionary word in the document. This can be + 1 or -1, depending on whether the word is in the document or not, or the number of entries. Or "share" of the word in the document. There are many options. The result is a vector of large dimension, where most of the elements are zeros.
80 categories - 80 classifiers
If we have 80 categories, we need to run the classifier 80 times. More precisely, we should have 80 classifiers for each category, which can determine whether a document falls into this category or not.
Tests Metrics
To test the classifier, you can use part of the same collection (only then you should not use this part when training) by comparing the results of the classifier with the results obtained manually.
To evaluate the performance of classifiers, several metrics are used:
- Accuracy = ((r - kr) + kw) / n
In fact, this is accuracy.
- Precision = kr / n
- Recall = kr / r
Here
kw is the number of documents that the classifier incorrectly noted as not belonging to the desired category.
kr - the number of documents that classifiers correctly noted as belonging to the desired category.
r is the total number of documents belonging to the desired category according to the classifier.
n is the total number of documents belonging to the desired category.
Of these metrics,
Precision and
Recall are the most important, as they show how much we made a mistake in attributing a category to a document and how randomly our classifier works. The closer both metrics are to one, the better.
And a small example. Having compiled vectors, taking as a measure the words in the document its “share” (the number of occurrences of the word in the document divided by the total number of words in the document), and having trained the classifier on 32000 documents and in the classification of 3000, the following results were obtained:
Precision = 0.95
Recall = 0.75Not bad results, with them the classifier can be used to process new data.
Links
About RCV1:
About RCV1Here you can find a collection.About SVM:
WikipediaGreat article