Kaggle: determining the tonality of texts

Hi, Habr!

# {Data Science for newbies}
')
My name is Gleb Morozov, we are already familiar with the previous articles. Due to numerous requests, I continue to describe the experience of my participation in educational projects of MLClass.ru (by the way, who have not had time - I recommend downloading the materials while they are still available).

Data

The data for the work is provided as part of the Bag of Words competition held on the Kaggle website and represent a training sample of 25,000 reviews from the IMBD website, each of which relates to one of the classes: negative / positive. The task is to predict to which of the classes each review from the test sample will apply.

library(magrittr) library(tm) require(plyr) require(dplyr) library(ggplot2) library(randomForest)

Load the data into RAM.

 data_train <- read.delim("labeledTrainData.tsv",header = TRUE, sep = "\t", quote = "", stringsAsFactors = F)

The resulting table consists of three columns: id, sentiment and review. It is the last column that is the object of our work. Let's see what the review itself is. (since the review is quite long, I will give only the first 700 characters)

 paste(substr(data_train[1,3],1,700),"...") ## [1] "\"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely lik ..."

It is seen that in the text there is garbage in the form of HTML tags.

Bag of Words

Bag of Words, or bag of words, is a model often used in word processing, which is an unordered set of words in the text being processed. Often the model is presented in the form of a matrix, in which the lines correspond to a separate text, and the columns represent the words contained in it. Cells at the intersection are the number of occurrences of the word in the corresponding document. This model is convenient because it translates the human language of words into a computer-friendly language of numbers.

Data processing

For data processing, I will use the capabilities of the tm package. In the following code block, the following actions are performed:

a vector is created from texts
the corpus is created - a collection of texts
all letters are in lower case
punctuation marks are removed
so-called "stop words" are deleted, since Frequently encountered words in a language that do not carry information in themselves (in English, for example, and) In addition, I decided to immediately remove the word, which will most likely be found often in reviews, but is not of interest to the model - the movie.
is stemming, i.e. words are transformed into their basic form

 train_corpus <- data_train$review %>% VectorSource(.)%>% Corpus(.) %>% tm_map(., tolower) %>% tm_map(., PlainTextDocument) %>% tm_map(., removePunctuation) %>% tm_map(., removeWords, c("movie", stopwords("english"))) %>% tm_map(., stemDocument)

Now create a frequency matrix.

 frequencies <- DocumentTermMatrix(train_corpus) frequencies ## <<DocumentTermMatrix (documents: 25000, terms: 92244)>> ## Non-/sparse entries: 2387851/2303712149 ## Sparsity : 100% ## Maximal term length: 64 ## Weighting : term frequency (tf)

Our matrix contains more than 90,000 terms, i.e. the model based on it will contain 90000 signs! It is necessary to reduce it and for this we use the fact that there are a lot of rarely found words in reviews, i.e. it is discharged (sparse term). I decided to cut it very much (in order for the model to fit in the RAM, taking into account 25,000 objects in the training set) and leave only those words that occur at least 5% of the reviews.

 sparse <- removeSparseTerms(frequencies, 0.95) sparse ## <<DocumentTermMatrix (documents: 25000, terms: 373)>> ## Non-/sparse entries: 1046871/8278129 ## Sparsity : 89% ## Maximal term length: 10 ## Weighting : term frequency (tf)

As a result, 373 terms remained in the matrix. Transform the matrix into the data frame and add a column with the target attribute.

 reviewSparse = as.data.frame(as.matrix(sparse)) vocab <- names(reviewSparse) reviewSparse$sentiment <- data_train$sentiment %>% as.factor(.) %>% revalue(., c("0"="neg", "1" = "pos")) row.names(reviewSparse) <- NULL

Now we will train the Random Forest model using the received data. I use 100 trees due to memory limitation.

 model_rf <- randomForest(sentiment ~ ., data = reviewSparse, ntree = 100)

Using the trained model we will create a forecast for test data.

 data_test <- read.delim("testData.tsv", header = TRUE, sep = "\t", quote = "", stringsAsFactors = F) test_corpus <- data_test$review %>% VectorSource(.)%>% Corpus(.) %>% tm_map(., tolower) %>% tm_map(., PlainTextDocument) %>% tm_map(., removePunctuation) %>% tm_map(., removeWords, c("movie", stopwords("english"))) %>% tm_map(., stemDocument) test_frequencies <- DocumentTermMatrix(test_corpus,control=list(dictionary = vocab)) reviewSparse_test <- as.data.frame(as.matrix(test_frequencies)) row.names(reviewSparse_test) <- NULL sentiment_test <- predict(model_rf, newdata = reviewSparse_test) pred_test <- as.data.frame(cbind(data_test$id, sentiment_test)) colnames(pred_test) <- c("id", "sentiment") pred_test$sentiment %<>% revalue(., c("1"="0", "2" = "1")) write.csv(pred_test, file="Submission.csv", quote=FALSE, row.names=FALSE)

After downloading and evaluation on the Kaggle website, the model received an estimate of AUC statistics - 0.73184.

Let's try to approach the problem from the other side. When compiling the frequency matrix and cutting it, we leave the most common words, but, most likely, many words that are often found in movie reviews, but do not reflect the mood of the review. For example, words like movie, film, etc. But, since we have a training sample with a marked mood of the reviews, we can distinguish words whose frequencies differ significantly from negative and positive reviews.

To begin with, we will create a frequency matrix for negative reviews.

 freq_neg <- data_train %>% filter(sentiment == 0) %>% select(review) %>% VectorSource(.)%>% Corpus(.) %>% tm_map(., tolower) %>% tm_map(., PlainTextDocument) %>% tm_map(., removePunctuation) %>% tm_map(., removeNumbers) %>% tm_map(., removeWords, c(stopwords("english"))) %>% tm_map(., stemDocument) %>% DocumentTermMatrix(.) %>% removeSparseTerms(., 0.999) %>% as.matrix(.) freq_df_neg <- colSums(freq_neg) freq_df_neg <- data.frame(word = names(freq_df_neg), freq = freq_df_neg) rownames(freq_df_neg) <- NULL head(arrange(freq_df_neg, desc(freq))) ## word freq ## 1 movi 27800 ## 2 film 21900 ## 3 one 12959 ## 4 like 12001 ## 5 just 10539 ## 6 make 7846

And for positive reviews.

 freq_pos <- data_train %>% filter(sentiment == 1) %>% select(review) %>% VectorSource(.)%>% Corpus(.) %>% tm_map(., tolower) %>% tm_map(., PlainTextDocument) %>% tm_map(., removePunctuation) %>% tm_map(., removeNumbers) %>% tm_map(., removeWords, c(stopwords("english"))) %>% tm_map(., stemDocument) %>% DocumentTermMatrix(.) %>% removeSparseTerms(., 0.999) %>% as.matrix(.) freq_df_pos <- colSums(freq_pos) freq_df_pos <- data.frame(word = names(freq_df_pos), freq = freq_df_pos) rownames(freq_df_pos) <- NULL head(arrange(freq_df_pos, desc(freq))) ## word freq ## 1 film 24398 ## 2 movi 21796 ## 3 one 13706 ## 4 like 10138 ## 5 time 7889 ## 6 good 7508

Let us combine the resulting tables and calculate the difference between the frequencies.

 freq_all <- merge(freq_df_neg, freq_df_pos, by = "word", all = T) freq_all$freq.x[is.na(freq_all$freq.x)] <- 0 freq_all$freq.y[is.na(freq_all$freq.y)] <- 0 freq_all$diff <- abs(freq_all$freq.x - freq_all$freq.y) head(arrange(freq_all, desc(diff))) ## word freq.x freq.y diff ## 1 movi 27800 21796 6004 ## 2 bad 7660 1931 5729 ## 3 great 2692 6459 3767 ## 4 just 10539 7109 3430 ## 5 love 2767 5988 3221 ## 6 even 7707 5056 2651

Fine! We see, as expected, among the words with the greatest difference such terms as bad , great and love . But also here are just common words like movie . It happened that in frequent words, even a small percentage difference gives a high absolute difference. In order to eliminate this omission, we normalize the difference by dividing it by the sum of frequencies. The resulting metric will lie between 0 and 1 , and the higher its value is, the more important this value is in determining the difference between positive and negative feedback. But what to do with words that are found only in one class of reviews and at the same time their frequency is small? To reduce their importance, we add a coefficient to the denominator.

 freq_all$diff_norm <- abs(freq_all$freq.x - freq_all$freq.y)/ (freq_all$freq.x +freq_all$freq.y + 300) head(arrange(freq_all, desc(diff_norm))) ## word freq.x freq.y diff diff_norm ## 1 worst 2436 246 2190 0.7344064 ## 2 wast 1996 192 1804 0.7250804 ## 3 horribl 1189 194 995 0.5912062 ## 4 stupid 1525 293 1232 0.5816808 ## 5 bad 7660 1931 5729 0.5792134 ## 6 wors 1183 207 976 0.5775148

Select the 500 words with the highest difference coefficient.

 freq_word <- arrange(freq_all, desc(diff_norm)) %>% select(word) %>% slice(1:500)

We use the resulting dictionary to create a frequency matrix on which we will train the Random Forest model.

 vocab <- as.character(freq_word$word) frequencies = DocumentTermMatrix(train_corpus,control=list(dictionary = vocab)) reviewSparse_train <- as.data.frame(as.matrix(frequencies)) row.names(reviewSparse_train) <- NULL reviewSparse_train$sentiment <- data_train$sentiment %>% as.factor(.) %>% revalue(., c("0"="neg", "1" = "pos")) model_rf <- randomForest(sentiment ~ ., data = reviewSparse_train, ntree = 100)

After downloading and evaluation on the Kaggle website, the model received an estimate according to AUC statistics - 0.83120, i.e. After working with the signs, we got a 10% improvement in statistics!

TF-IDF

When creating the matrix, the document term as a metric of the importance of the word, we simply used the frequency of the word in the review. The tm package has the ability to use another measure called tf-idf . TF-IDF (from TF - term frequency, IDF - inverse document frequency ) is a statistical metric used to assess the importance of a word in the context of a document that is part of a document collection or corpus. The weight of a word is proportional to the number of words used in the document, and inversely proportional to the frequency of words in other documents in the collection.

Using tf-idf , we will create a dictionary of 500 terms with the highest indicator of this metric. In order for this dictionary to reflect the importance of words most relevantly, we will use an additional training sample in which the mood of the reviews is not marked. On the basis of the dictionary, create a matrix of the document-term and train the model.

 data_train_un <- read.delim("unlabeledTrainData.tsv",header = TRUE, sep = "\t", quote = "", stringsAsFactors = F) train_review <- c(data_train$review, data_train_un$review) train_corpus <- train_review %>% VectorSource(.)%>% Corpus(.) %>% tm_map(., tolower) %>% tm_map(., PlainTextDocument) %>% tm_map(., removePunctuation) %>% tm_map(., removeNumbers) %>% tm_map(., removeWords, c(stopwords("english"))) %>% tm_map(., stemDocument) tdm <- TermDocumentMatrix(train_corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = F))) library(slam) freq <- rollup(tdm, 2,FUN = sum) freq <- as.matrix(freq) freq_df <- data.frame(word = row.names(freq), tfidf = freq) names(freq_df) <- c("word", "tf_idf") row.names(freq_df) <- NULL freq_df %<>% arrange(desc(tf_idf)) vocab <- as.character(freq_df$word)[1:500] train_corpus <- data_train$review %>% VectorSource(.)%>% Corpus(.) %>% tm_map(., tolower) %>% tm_map(., PlainTextDocument) %>% tm_map(., removePunctuation) %>% tm_map(., removeNumbers) %>% tm_map(., removeWords, c(stopwords("english"))) %>% tm_map(., stemDocument) frequencies = DocumentTermMatrix(train_corpus,control=list(dictionary = vocab, weighting = function(x) weightTfIdf(x, normalize = F) )) reviewSparse_train <- as.data.frame(as.matrix(frequencies)) rm(data_train_un, tdm, dtm, train_review) reviewSparse_train <- as.data.frame(as.matrix(frequencies)) row.names(reviewSparse_train) <- NULL colnames(reviewSparse_train) = make.names(colnames(reviewSparse_train)) reviewSparse_train$sentiment <- data_train$sentiment %>% as.factor(.) %>% revalue(., c("0"="neg", "1" = "pos")) rm(data_train, train_corpus, freq, freq_df) model_rf <- randomForest(sentiment ~ ., data = reviewSparse_train, ntree = 100)

We use this model on a test sample and get the value of AUC - 0.81584 .

Conclusion

This work is one of the possible options for creating a predictive model based on text data. One option to improve the quality of the model may be an increase in the number of terms used from the document-term matrix, but this way requires a significant increase in the used machine resources. It can also lead to much better results to refer not to the frequencies of words, but to their meanings and the connections between them. To do this, refer to the word2vec model. In addition, a large field for research is the consideration of terms in the context of the document.

Source: https://habr.com/ru/post/270591/

All Articles