📜 ⬆️ ⬇️

Deep learning on R, we train word2vec

Word2vec is practically the only deep learning algorithm that can be run relatively easily on a regular PC (and not on video cards) and which builds a distributed representation of words in a reasonable time, at least so they think on Kaggle . After reading here about what tricks you can do with a trained model, I realized that I just had to try such a thing. There is only one problem, I mostly work in the R language, but I could not find the official implementation of word2vec under R, I think it simply does not exist.

But there are word2vec sources in C and a description on Google, and in R you can use external libraries in C, C ++ and Fortran. By the way, the fastest R libraries are made specifically in C and C ++. There is also an R-wrap tmcn.word2vec , which is still under development. Its author
Jian Li (site in Chinese) did something like a demo for the Chinese language (it also works with English, I have not tried it with Russian yet). The problems with this version are as follows:

Having estimated all this “wealth”, I decided to make my own version of the R-interface to word2vec. To tell you the truth, I don’t know C very well, I had to write only simple programs, so I decided to use the Jian Li source code as a basis, because they are compiled under Windows for sure, otherwise there wouldn’t be a package. If something does not work, they can always be checked with the original.

Training


In order to compile the C code for R under Windows, you need to additionally install Rtools . This toolkit contains the gcc compiler, which runs under Cygwin. After installing Rtools, you need to check the PATH variable. There should be something like:
 D: \ Rtools \ bin; D: \ Rtools \ gcc-4.6.3 \ bin; D: \ R \ bin

Under OS X, no Rtools are required. You need an installed compiler, the presence of which is checked by the gcc --version command. If not, you need to install Xcode and through Xcode - Command Line Tools.

About calling C libraries from R you need to know the following:
  1. When calling a function, all values ​​are passed in the form of pointers, and care must be taken to explicitly state their type. The most reliable way is to pass char parameters with the subsequent conversion to the necessary type already in C;
  2. The called function does not return a value, i.e. must be of type void;
  3. In the C-code you need to add the instruction #include <Rh>, and if there is a complicated math, then also #include <R.math>;
  4. If you need to display something on the R console, instead of printf () it is better to use Rprintf (). True, my printf () also works.

To begin, I decided to do something very simple, such as Hello, World! But so that any value is transferred there. Rstudio, which I usually use, allows me to write C and C ++ code and highlights everything correctly. After writing and saving the code in hello.c, I called up the command line, went to the correct directory and started the compiler with the following command:
 > R --arch x64 CMD SHLIB hello.c

Under win32, the architecture key is not needed:
 > R CMD SHLIB hello.c

As a result, two files appeared in the directory, hello.o (it can be safely removed) and the hello.dll library. (On OS X, instead of dll, you get a file with the extension so). Calling the received hello function in R is done with the following code:
dyn.load("hello.dll") hellof <- function(n) { .C("hello", as.integer(n)) } hellof(5) 

The test showed that everything works correctly and it remains to prepare the data for experimenting with word2vec. I decided to take them to Kaggle from the “Bag of Words Meets Bags of Popcorn” task. There is a training, test and unpartitioned sample, which together contain a hundred thousand review films from IMDB. After downloading these files, I removed HTML tags, special characters, numbers, punctuation marks, stop words and tokenized ones. I omit the details of the processing, I already wrote about them.
')
Word2vec accepts data for training in the form of a text file with one long line containing words separated by spaces (I found this out by analyzing examples of working with word2vec from official documentation). I stuck data sets in one line and saved it in a text file.

Model


In the variant Jian Li, these are two files word2vec.h and word2vec.c. The first one contains the main code, which basically coincides with the original word2vec.c. In the second, a wrapper for calling the TrainModel () function. The first thing I decided to do was pull out all the parameters of the model in the R-code. It was necessary to edit the R-script and the wrapper in word2vec.c, it turned out this construction:
 dyn.load("word2vec.dll") word2vec <- function(train_file, output_file, binary, cbow, num_threads, num_features, window, min_count, sample) { //...    ... OUT <- .C("CWrapper_word2vec", train_file = as.character(train_file), output_file = as.character(output_file), binary = as.character(binary), //...    ) //...      OUT... } word2vec("train_data.txt", "model.bin", binary=1, # output format, 1-binary, 0-txt cbow=0, # skip-gram (0) or continuous bag of words (1) num_threads = 1, # num of workers num_features = 300, # word vector dimensionality window = 10, # context / window size min_count = 40, # minimum word count sample = 1e-3 # downsampling of frequent words ) 

A few words about the parameters:
binary - model output format;
cbow - which algorithm to use for teaching skip-gram or bag of words (cbow). Skip-gram runs slower, but gives better results in rare words;
num_threads - the number of processor threads involved in building the model;
num_features - the dimension of the word space (or a vector for each word), recommended from tens to hundreds;
window - how many words from the context the learning algorithm should take into account;
min_count - limits the size of the dictionary for meaningful words. Words that are not found in the text more than the specified number are ignored. The recommended value is from ten to one hundred;
sample - the lower limit of the frequency of occurrence of words in the text, it is recommended from .00001 to .01.

Compiled with the following command with the keys recommended in the makefile :
 > R --arch x64 CMD SHLIB -lm -pthread -O3 -march = native -Wall -funroll-loops -Wno-unused-result word2vec.c

The compiler issued a number of warnings, but nothing serious, the cherished word2vec.dll appeared in the working directory. Without problems, I loaded it into R with the dyn.load function (“word2vec.dll”) and started the function of the same name. I think only the pthread key is useful. You can do without the rest (some of them are registered in the Rtools configuration).

Result:
There were a total of 11.5 million words in my file, a dictionary - 19133 words, the model building time is 6 minutes on a computer with an Intel Core i7. To check if my options work, I changed the value of num_threads from one to six. It would be possible not to look at the monitoring of resources, the time to build the model was reduced to one and a half minutes. That is, this thing can handle eleven million words in minutes.

Similarity score


In the distance, I practically did not change anything, just pulled out the parameter for the number of returned values. Then I compiled the library, loaded it into R and checked it on two words “bad” and “good”, considering that I am dealing with positive and negative reviews:
 Word: bad Position in vocabulary: 15
          Word CosDist
 1 terrible 0.5778409
 2 horrible 0.5541780
 3 lousy 0.5527389
 4 awful 0.5206609
 5 laughably 0.4910716
 6 atrocious 0.4841466
 7 horrid 0.4808238
 8 good 0.4805901
 9 worse 0.4726501
 10 horrendous 0.4579800

 Word: good Position in vocabulary: 6
         Word CosDist
 1 decent 0.5678578
 2 nice 0.5364762
 3 great 0.5197815
 4 bad 0.4805902
 5 excellent 0.4554003
 6 ok 0.4365533
 7 alright 0.4361723
 8 really 0.4153538
 9 liked 0.4061105
 10 fine 0.4004776

Everything turned out again. It is interesting that from bad to good the distance is more than from good to bad if you count in words. Well, as they say "from love to hate ..." is closer than the other way around. The algorithm calculates the similarity as the cosine of the angle between the vectors using the following formula (picture from the wiki ):

So, having a trained model, you can calculate the distance without C, and instead of a similarity, evaluate, for example, differences. To do this, you need to build a model in text format (binary = 0), load it into R with read.table () and write some amount of code, which I did. Code without exception handling:
 similarity <- function(word1, word2, model) { size <- ncol(model)-1 vec1 <- model[model$word==word1,2:size] vec2 <- model[model$word==word2,2:size] sim <- sum(vec1 * vec2) sim <- sim/(sqrt(sum(vec1^2))*sqrt(sum(vec2^2))) return(sim) } difference <- function(string, model) { words <- tokenize(string) num_words <- length(words) diff_mx <- matrix(rep(0,num_words^2), nrow=num_words, ncol=num_words) for (i in 1:num_words) { for (j in 1:num_words) { sim <- similarity(words[i],words[j],model) if(i!=j) { diff_mx[i,j]=sim } } } return(words[which.min(rowSums(diff_mx))]) } 

Here is built a square matrix of the number of words in the query for the number of words. Then for each pair of unmatched words, a similarity is calculated. Then the values ​​are summarized in rows, the row with the minimum amount is found. The line number corresponds to the “extra” position in the query. Work can be accelerated if you count only half the matrix. A couple of examples:
 > difference ("squirrel deer human dog cat", model)
 [1] "human"
 > difference ("bad red good nice awful", model)
 [1] "red"

Analogies


Finding analogies allows you to solve puzzles such as "man refers to a woman as the king refers to?". The special word-analogy function is only in the original Google code, so I had to tinker with it. I wrote a wrapper to call a function from R, removed the infinite loop from the code, and replaced the standard I / O streams with parameter passing. Then compiled into the library and did some experiments. I didn’t manage a piece with the king-queen, apparently eleven million words are not enough (authors of word2vec recommend around a billion). A few good examples:
 > analogy ("model300.bin", "man woman king", 3)
       Word CosDist
 1 throne 0.4466286
 2 lear 0.4268206
 3 princess 0.4251665

 > analogy ("model300.bin", "man woman husband", 3)
         Word CosDist
 1 wife 0.6323696
 2 unfaithful 0.5626401
 3 married 0.5268299

 > analogy ("model300.bin", "man woman boy", 3)
      Word CosDist
 1 girl 0.6313665
 2 mother 0.4309490
 3 teenage 0.4272232

Clustering


After reading the documentation, I realized that it turns out that word2 clustering is embedded in word2vec. And in order to use it, it is enough to “pull out” another parameter in R - classes. This number of clusters, if it is more than zero, word2vec will produce a text file format word - cluster number. Three hundred clusters were not enough to get something sane. Heuristics from developers: dictionary size divided by 5. Chose 3000 accordingly. Let me give you a few successful clusters (successful in the sense that I understand why these words are side by side):
            word id
 335 humor 2952
 489 serious 2952
 872 clever 2952
 1035 humor 2952
 1796 references 2952
 1916 satire 2952
 2061 slapstick 2952
 2367 quirky 2952
 2810 crude 2952
 2953 irony 2952
 3125 outrageous 2952
 3296 farce 2952
 3594 broad 2952
 4870 silliness 2952
 4979 edgy 2952

         word id
 1025 cat 241
 3242 mouse 241
 11189 minnie 241

            word id
 1089 army 322
 1127 military 322
 1556 mission 322
 1558 soldier 322
 3254 navy 322
 3323 combat 322
 3902 command 322
 3975 unit 322
 4270 colonel 322
 4277 commander 322
 7821 platoon 322
 7853 marines 322
 8691 naval 322
 9762 pow 322
 10391 gi 322
 12452 corps 322
 15839 infantry 322
 16697 diver 322

Using clustering, it is easy to do a sentiment analysis. To do this, you need to build a "bag of clusters" - a matrix the size of the number of reviews for the maximum number of clusters. In each cell of such a matrix there should be the number of hits of the words from the review in the given cluster. I have not tried, but I don’t see any problems here. It is said that the accuracy for the IMDB review is the same or slightly less than if it is done through the “Bag of Words”.

Phrases


Word2vec can work with phrases, or rather with stable combinations of words. To do this, the original code has a word2phrase procedure. Its task is to find frequently occurring combinations of words and replace the space between them with an underscore. The file that is obtained after the first pass contains two words. If you send it back to word2phrase, triples and fours will appear. The result can then be used to train word2vec.
Made a call to this procedure from R by analogy with word2vec:
 word2phrase("train_data.txt", "train_phrase.txt", min_count=5, threshold=100) 

The min_count parameter allows not to consider phrases occurring less than the specified value, the threshold controls the sensitivity of the algorithm, the larger the value, the fewer phrases will be found. After the second pass, I got about six thousand combinations. To look at the phrases themselves, I first made a model in text format, pulled out a column of words, and filtered it by the underscore. Here is an example snippet:
 [5887] "works_perfectly" "four_year_old" "multi_million_dollar"               
 [5890] "fresh_faced" "return_living_dead" "seemed_forced"                      
 [5893] "freddie_prinze_jr" "re_lucky" "puerto_rico"                        
 [5896] "every_sentence" "living_hell" "went_straight"                      
 [5899] "supporting_cast_including" "action_set_pieces" "space_shuttle"     

I selected several phrases for distance ():
 > distance ("p_model300_2.bin", "crouching_tiger_hidden_dragon", 10)
 Word: crouching_tiger_hidden_dragon Position in vocabulary: 15492
                  Word CosDist
 1 tsui_hark 0.6041993
 2 ang_lee 0.5996884
 3 martial_arts_films 0.5541546
 4 kung_fu_hustle 0.5381692
 5 blockbusters 0.5305687
 6 kill_bill 0.5279162
 7 grindhouse 0.5242150
 8 churned 0.5224440
 9 budgets 0.5141657
 10 john_woo 0.5046486

 > distance ("p_model300_2.bin", "academy_award_winning", 10)
 Word: academy_award_winning Position in vocabulary: 15780
                    Word CosDist
 1 nominations 0.4570983
 2 ever_produced 0.4558123
 3 francis_ford_coppola 0.4547777
 4 producer_director 0.4545878
 5 set_standard 0.4512480
 6 participation 0.4503479
 7 won_academy_award 0.4477891
 8 michael_mann 0.4464636
 9 huge_budget 0.4424854
 10 directorial_debut 0.4406852


At this point, I have completed the experiments. One important note is that word2vec “communicates” with memory directly, as a result of which R may become unstable and crash the session. Sometimes this is due to the output of diagnostic messages from the OS, which R cannot correctly process. If there are no errors in the code, it helps to restart the interpreter or Rstudio.

R-code, C source code and Windows dll compiled under x64 in my repository .

UPD:
As a result of the dispute with ServPonomarev and the subsequent analysis of the word2vec code, we managed to find out that the algorithm is trained in lines of 1000 words, in which the window moves in plus / minus 5 words. When an EOL character is detected, which the algorithm converts to a special word with zero numbers in the dictionary, the window movement stops and continues in a new line. The representation of words separated by EOL in the model will differ from the representation of the same words, separated by a space. Conclusion: if the source text is a collection of documents, as well as phrases or paragraphs, separated by a newline, you should not get rid of this additional information, ie leave EOL symbols in the training set. Unfortunately, it is very difficult to illustrate with examples.

Source: https://habr.com/ru/post/258983/


All Articles