I want to share the experience of my first participation in the kaggle contest (training
Bag of Words ). And although I did not manage to achieve amazing results, I will talk about how I searched and found ways to improve the examples of the “textbook” (for this purpose I will also briefly describe the examples), and also focus on the analysis of my mistakes. I must warn you that the article will be of interest primarily to newcomers in the field of text mining. However, I describe most of the methods briefly and simply, giving references to more precise definitions, since my purpose is to review the practice, not the theory. Unfortunately, the competition has already ended, but it may still be useful to read the materials for it. Link to the code for the article
here .
Competition Overview
The task itself is to analyze the emotional coloring of the text. To do this, take reviews and ratings of films from the site IMDb. Reviews rated> = 7 are considered positively colored, with less rating - negative. Task: having trained the model on training data, where each text is given a mark (negative / positive), then predict this parameter for texts from the test set. The prediction quality is estimated using a parameter called the
ROC curve . You can read the link in detail, but the closer this parameter is to 1 - the more accurate the prediction.
All examples are written in Python and use the
scikit-learn library, which allows you to use ready-made implementations of all the classifiers and vectorizers we need.
Methods for solving the problem
We have plain text at our disposal, and all data mining classifiers require numerical vectors at the input. Therefore, the primary task is to determine how to convert a text block into a vector (vectorization).
')
The simplest method is Bag of Words, which begins the first example of a textbook. The method consists in creating a common pool of used words, each of which is assigned its own index. Suppose we have two simple texts:
John likes to watch movies. Mary likes movies too.
John also likes to watch football games.
For each unique word, we assign the following index:
"John": 1,
Likes: 2,
"To": 3,
"Watch": 4,
"Movies": 5,
"Also": 6,
"Football": 7,
"Games": 8,
"Mary": 9,
"Too": 10Now each of these sentences can be represented as a vector of dimension 10, in which the number x in the i-th position means that the word number i occurs in the text x times:
[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
Read more in
Wikipedia or in this review
article dedicated to the same competition.
In general, everything is simple, but the devil is in the details. First, the example removes words
(a, the, am, i, is ...) that do not carry any semantic meaning. Secondly, operations with this matrix are performed in RAM, thus, the amount of memory limits the allowable dimension of the matrix. To avoid the
“MemoryError”, I had to reduce the pool of words to 7000 of the most frequent. The
Random Forest classifier is used as a classifier in all examples of the textbook.
Then we are encouraged to experiment with various parameters, which we will do. The first obvious thought is to add
lemmatization , i.e. lead all words to their vocabulary forms. To do this, use the function from the nltk library:
from nltk import WordNetLemmatizer wnl = WordNetLemmatizer() meaningful_words = [wnl.lemmatize(w) for w in meaningful_words]
Another good idea is to slightly change the text vectorization method. Instead of the simple characteristic “how many times a word was encountered in a text”, you can use a little more complicated, but also well-known -
tf-idf (it assigns value to words depending on their rarity in the collection of documents).
Sending to check the results of the original and modified program, we get an improvement from 0.843 to 0.844. This is not very much. Using this example as a basis, you can experiment thoroughly and get much better results. But I did not have much time, and, consequently, attempts (they are limited to the 5th per day). Therefore, I proceeded to the following parts.
The following parts of the textbook are built on a library called
Word2vec , which gives us a numerical representation of words. Moreover, these vectors have interesting properties. For example, the minimum distance between vectors will be in the most similar words.
So, transforming all the words, we get a list of vectors for each review. How to convert it into a single vector? The first option is to simply calculate the arithmetic
average (
average vector ). The result is even worse than the Bag of Words (0.829).
How could this method be improved? It is clear that there is no sense in averaging all the words, too much among them is rubbish, which does not affect the emotional coloring of the text. Intuitively, it seems that evaluative adjectives and, perhaps, some other words will have the most influence. Fortunately, there are methods under the general name of
feature selection , which allow us to estimate how strongly one or another parameter (in our case, a word) correlates with the value of the resulting variable (emotional coloring). Apply one of these methods and take a look at the selected words:
from sklearn.feature_selection import chi2 from sklearn.feature_selection import SelectKBest select = SelectKBest(chi2, k=50) X_new = select.fit_transform(train_data_features, train["sentiment"]) names = count_vectorizer.get_feature_names() selected_words = np.asarray(names)[select.get_support()] print(', '.join(selected_words))
The result is a list of words that confirms the theory:
acting, amazing, annoying, avoid, awful, bad, badly, beautiful, best, boring, brilliant, crap, dull, even, excellent, fantastic, favorite, love, loved, mess, minutes, money, no, nothing, oh, pathetic, perfect, plot, pointless, poor, ridiculous, save, script, stupid, superb, supposed, terrible, waste, wasted, why, wonderful, worse, worst
If we now calculate the average vector, but taking into account only the words from the top list (which we expand to 500 words after a couple of experiments), we will get a better result (0.846), which bypasses even (albeit not much) bag of centroids from the following example of this competition. In this solution (let's designate it as average of top words), Random Forest was also used as a classifier.
Bug work
At this the number of my attempts, and the actual competition, came to an end, and I went to the forum to find out how the more experienced people solved this problem. I will not touch on
solutions that have a truly excellent result (more than 0.96) because they are usually quite complex and multi-pass. But I will point out some options that made it possible to obtain high accuracy using simple methods.
For example, the
indication that a good result was achieved with a simple tf-idf and
logistic regression prompted me to explore other classifiers. Other things being equal (TfidfVectorizer with a limit of 7000 columns) LogisticRegression gives the result - 0.88,
LinearRegression - 0.91,
Ridge regression - 0.92.
If I used linear regression in my solution (average of top words) instead of Random forest, I would get the result of 0.93 instead of 0.84. Thus, my first mistake was that I thought that the vectorization method influenced more than the choice of the classifier. Erroneous thoughts prompted me to the material of an educational article, but I should have checked everything myself.
I extracted the second idea by looking more closely at the code for this example. The caveat is exactly how the TfidfVectorizer was used. A set of combined test and training data was taken, no restrictions were placed on the maximum number of columns, moreover, features were formed not only from individual words, but also from
word pairs (ngram_range parameter = (1, 2)). If your program does not fall from a MemoryError of such a volume, then this significantly improves the prediction accuracy (the author claimed a result of 0.95). Conclusion number two - accuracy can be increased at the cost of a larger amount of computation, and not some particularly clever methods. To do this, for example, you can resort to some service for cloud computing, if your own computer is not very powerful.
As a conclusion I want to say that it was extremely interesting to participate in the kaggle competition, and to encourage those who have not yet decided for any reason. Of course, there are much more difficult contests on kaggle, for the first time you should choose the task according to your strengths. And the last tip - read the forum. Even during the competition, they publish useful tips, ideas, and sometimes even whole solutions.