Word2vec is practically the only deep learning algorithm that can be run relatively easily on a regular PC (and not on video cards) and which builds a distributed representation of words in a reasonable time, at least so they think on
Kaggle . After reading
here about what tricks you can do with a trained model, I realized that I just had to try such a thing. There is only one problem, I mostly work in the R language, but I could not find the official implementation of word2vec under R, I think it simply does not exist.
But there are
word2vec sources in C and a
description on Google, and in R you can use external libraries in C, C ++ and Fortran. By the way, the fastest R libraries are made specifically in C and C ++. There is also an R-wrap
tmcn.word2vec , which is still under development. Its author
Jian Li (site in Chinese) did something like a demo for the Chinese language (it also works with English, I have not tried it with Russian yet). The problems with this version are as follows:
- First, all parameters are sewn in C code;
- Secondly, the author made only one function for working with the trained model - distance, which evaluates the similarity of words and displays 20 variants with the maximum value;
- Thirdly, I was unable to build a package under x64 Windows. On win32, the package is installed without problems.
Having estimated all this “wealth”, I decided to make my own version of the R-interface to word2vec. To tell you the truth, I don’t know C very well, I had to write only simple programs, so I decided to use the Jian Li
source code as a basis, because they are compiled under Windows for sure, otherwise there wouldn’t be a package. If something does not work, they can always be checked with the original.
Training
In order to compile the C code for R under Windows, you need to additionally install
Rtools . This toolkit contains the gcc compiler, which runs under Cygwin. After installing Rtools, you need to check the PATH variable. There should be something like:
D: \ Rtools \ bin; D: \ Rtools \ gcc-4.6.3 \ bin; D: \ R \ bin
Under OS X, no Rtools are required. You need an installed compiler, the presence of which is checked by the gcc --version command. If not, you need to install
Xcode and through Xcode - Command Line Tools.
About calling C libraries from R you need to know the following:
- When calling a function, all values ​​are passed in the form of pointers, and care must be taken to explicitly state their type. The most reliable way is to pass char parameters with the subsequent conversion to the necessary type already in C;
- The called function does not return a value, i.e. must be of type void;
- In the C-code you need to add the instruction #include <Rh>, and if there is a complicated math, then also #include <R.math>;
- If you need to display something on the R console, instead of printf () it is better to use Rprintf (). True, my printf () also works.
To begin, I decided to do something very simple, such as Hello, World! But so that any value is transferred there. Rstudio, which I usually use, allows me to write C and C ++ code and highlights everything correctly. After writing and saving the code in hello.c, I called up the command line, went to the correct directory and started the compiler with the following command:
> R --arch x64 CMD SHLIB hello.c
Under win32, the architecture key is not needed:
> R CMD SHLIB hello.c
As a result, two files appeared in the directory, hello.o (it can be safely removed) and the hello.dll library. (On OS X, instead of dll, you get a file with the extension so). Calling the received hello function in R is done with the following code:
dyn.load("hello.dll") hellof <- function(n) { .C("hello", as.integer(n)) } hellof(5)
The test showed that everything works correctly and it remains to prepare the data for experimenting with word2vec. I decided to take them to
Kaggle from the “Bag of Words Meets Bags of Popcorn” task.
There is a training, test and unpartitioned sample, which together contain a hundred thousand review films from IMDB. After downloading these files, I removed HTML tags, special characters, numbers, punctuation marks, stop words and tokenized ones. I omit the details of the processing, I already
wrote about them.
')
Word2vec accepts data for training in the form of a text file with one long line containing words separated by spaces (I found this out by analyzing examples of working with word2vec from official documentation). I stuck data sets in one line and saved it in a text file.
Model
In the variant Jian Li, these are two files word2vec.h and word2vec.c. The first one contains the main code, which basically coincides with the original word2vec.c. In the second, a wrapper for calling the TrainModel () function. The first thing I decided to do was pull out all the parameters of the model in the R-code. It was necessary to edit the R-script and the wrapper in word2vec.c, it turned out this construction:
dyn.load("word2vec.dll") word2vec <- function(train_file, output_file, binary, cbow, num_threads, num_features, window, min_count, sample) { //... ... OUT <- .C("CWrapper_word2vec", train_file = as.character(train_file), output_file = as.character(output_file), binary = as.character(binary), //... ) //... OUT... } word2vec("train_data.txt", "model.bin", binary=1, # output format, 1-binary, 0-txt cbow=0, # skip-gram (0) or continuous bag of words (1) num_threads = 1, # num of workers num_features = 300, # word vector dimensionality window = 10, # context / window size min_count = 40, # minimum word count sample = 1e-3 # downsampling of frequent words )
A few words about the parameters:
binary - model output format;
cbow - which algorithm to use for teaching skip-gram or bag of words (cbow). Skip-gram runs slower, but gives better results in rare words;
num_threads - the number of processor threads involved in building the model;
num_features - the dimension of the word space (or a vector for each word), recommended from tens to hundreds;
window - how many words from the context the learning algorithm should take into account;
min_count - limits the size of the dictionary for meaningful words. Words that are not found in the text more than the specified number are ignored. The recommended value is from ten to one hundred;
sample - the lower limit of the frequency of occurrence of words in the text, it is recommended from .00001 to .01.
Compiled with the following command with the keys recommended in the
makefile :
> R --arch x64 CMD SHLIB -lm -pthread -O3 -march = native -Wall -funroll-loops -Wno-unused-result word2vec.c
The compiler issued a number of warnings, but nothing serious, the cherished word2vec.dll appeared in the working directory. Without problems, I loaded it into R with the dyn.load function (“word2vec.dll”) and started the function of the same name. I think only the pthread key is useful. You can do without the rest (some of them are registered in the Rtools configuration).
Result:
There were a total of 11.5 million words in my file, a dictionary - 19133 words, the model building time is 6 minutes on a computer with an Intel Core i7. To check if my options work, I changed the value of num_threads from one to six. It would be possible not to look at the monitoring of resources, the time to build the model was reduced to one and a half minutes. That is, this thing can handle eleven million words in minutes.
Similarity score
In the distance, I practically did not change anything, just pulled out the parameter for the number of returned values. Then I compiled the library, loaded it into R and checked it on two words “bad” and “good”, considering that I am dealing with positive and negative reviews:
Word: bad Position in vocabulary: 15
Word CosDist
1 terrible 0.5778409
2 horrible 0.5541780
3 lousy 0.5527389
4 awful 0.5206609
5 laughably 0.4910716
6 atrocious 0.4841466
7 horrid 0.4808238
8 good 0.4805901
9 worse 0.4726501
10 horrendous 0.4579800
Word: good Position in vocabulary: 6
Word CosDist
1 decent 0.5678578
2 nice 0.5364762
3 great 0.5197815
4 bad 0.4805902
5 excellent 0.4554003
6 ok 0.4365533
7 alright 0.4361723
8 really 0.4153538
9 liked 0.4061105
10 fine 0.4004776
Everything turned out again. It is interesting that from bad to good the distance is more than from good to bad if you count in words. Well, as they say "from love to hate ..." is closer than the other way around. The algorithm calculates the similarity as the cosine of the angle between the vectors using the following formula (picture from the
wiki ):

So, having a trained model, you can calculate the distance without C, and instead of a similarity, evaluate, for example, differences. To do this, you need to build a model in text format (binary = 0), load it into R with read.table () and write some amount of code, which I did. Code without exception handling:
similarity <- function(word1, word2, model) { size <- ncol(model)-1 vec1 <- model[model$word==word1,2:size] vec2 <- model[model$word==word2,2:size] sim <- sum(vec1 * vec2) sim <- sim/(sqrt(sum(vec1^2))*sqrt(sum(vec2^2))) return(sim) } difference <- function(string, model) { words <- tokenize(string) num_words <- length(words) diff_mx <- matrix(rep(0,num_words^2), nrow=num_words, ncol=num_words) for (i in 1:num_words) { for (j in 1:num_words) { sim <- similarity(words[i],words[j],model) if(i!=j) { diff_mx[i,j]=sim } } } return(words[which.min(rowSums(diff_mx))]) }
Here is built a square matrix of the number of words in the query for the number of words. Then for each pair of unmatched words, a similarity is calculated. Then the values ​​are summarized in rows, the row with the minimum amount is found. The line number corresponds to the “extra” position in the query. Work can be accelerated if you count only half the matrix. A couple of examples:
> difference ("squirrel deer human dog cat", model)
[1] "human"
> difference ("bad red good nice awful", model)
[1] "red"
Analogies
Finding analogies allows you to solve puzzles such as "man refers to a woman as the king refers to?". The special word-analogy function is only in the original Google code, so I had to tinker with it. I wrote a wrapper to call a function from R, removed the infinite loop from the code, and replaced the standard I / O streams with parameter passing. Then compiled into the library and did some experiments. I didn’t manage a piece with the king-queen, apparently eleven million words are not enough (authors of word2vec recommend around a billion). A few good examples:
> analogy ("model300.bin", "man woman king", 3)
Word CosDist
1 throne 0.4466286
2 lear 0.4268206
3 princess 0.4251665
> analogy ("model300.bin", "man woman husband", 3)
Word CosDist
1 wife 0.6323696
2 unfaithful 0.5626401
3 married 0.5268299
> analogy ("model300.bin", "man woman boy", 3)
Word CosDist
1 girl 0.6313665
2 mother 0.4309490
3 teenage 0.4272232
Clustering
After reading the documentation, I realized that it turns out that word2 clustering is embedded in word2vec. And in order to use it, it is enough to “pull out” another parameter in R - classes. This number of clusters, if it is more than zero, word2vec will produce a text file format word - cluster number. Three hundred clusters were not enough to get something sane. Heuristics from developers: dictionary size divided by 5. Chose 3000 accordingly. Let me give you a few successful clusters (successful in the sense that I understand why these words are side by side):
word id
335 humor 2952
489 serious 2952
872 clever 2952
1035 humor 2952
1796 references 2952
1916 satire 2952
2061 slapstick 2952
2367 quirky 2952
2810 crude 2952
2953 irony 2952
3125 outrageous 2952
3296 farce 2952
3594 broad 2952
4870 silliness 2952
4979 edgy 2952
word id
1025 cat 241
3242 mouse 241
11189 minnie 241
word id
1089 army 322
1127 military 322
1556 mission 322
1558 soldier 322
3254 navy 322
3323 combat 322
3902 command 322
3975 unit 322
4270 colonel 322
4277 commander 322
7821 platoon 322
7853 marines 322
8691 naval 322
9762 pow 322
10391 gi 322
12452 corps 322
15839 infantry 322
16697 diver 322
Using clustering, it is easy to do a sentiment analysis. To do this, you need to build a "bag of clusters" - a matrix the size of the number of reviews for the maximum number of clusters. In each cell of such a matrix there should be the number of hits of the words from the review in the given cluster. I have not tried, but I don’t see any problems here.
It is said that the accuracy for the IMDB review is the same or slightly less than if it is done through the “Bag of Words”.
Phrases
Word2vec can work with phrases, or rather with stable combinations of words. To do this, the original code has a word2phrase procedure. Its task is to find frequently occurring combinations of words and replace the space between them with an underscore. The file that is obtained after the first pass contains two words. If you send it back to word2phrase, triples and fours will appear. The result can then be used to train word2vec.
Made a call to this procedure from R by analogy with word2vec:
word2phrase("train_data.txt", "train_phrase.txt", min_count=5, threshold=100)
The
min_count parameter allows not to consider phrases occurring less than the specified value, the
threshold controls the sensitivity of the algorithm, the larger the value, the fewer phrases will be found. After the second pass, I got about six thousand combinations. To look at the phrases themselves, I first made a model in text format, pulled out a column of words, and filtered it by the underscore. Here is an example snippet:
[5887] "works_perfectly" "four_year_old" "multi_million_dollar"
[5890] "fresh_faced" "return_living_dead" "seemed_forced"
[5893] "freddie_prinze_jr" "re_lucky" "puerto_rico"
[5896] "every_sentence" "living_hell" "went_straight"
[5899] "supporting_cast_including" "action_set_pieces" "space_shuttle"
I selected several phrases for distance ():
> distance ("p_model300_2.bin", "crouching_tiger_hidden_dragon", 10)
Word: crouching_tiger_hidden_dragon Position in vocabulary: 15492
Word CosDist
1 tsui_hark 0.6041993
2 ang_lee 0.5996884
3 martial_arts_films 0.5541546
4 kung_fu_hustle 0.5381692
5 blockbusters 0.5305687
6 kill_bill 0.5279162
7 grindhouse 0.5242150
8 churned 0.5224440
9 budgets 0.5141657
10 john_woo 0.5046486
> distance ("p_model300_2.bin", "academy_award_winning", 10)
Word: academy_award_winning Position in vocabulary: 15780
Word CosDist
1 nominations 0.4570983
2 ever_produced 0.4558123
3 francis_ford_coppola 0.4547777
4 producer_director 0.4545878
5 set_standard 0.4512480
6 participation 0.4503479
7 won_academy_award 0.4477891
8 michael_mann 0.4464636
9 huge_budget 0.4424854
10 directorial_debut 0.4406852
At this point, I have completed the experiments. One important note is that word2vec “communicates” with memory directly, as a result of which R may become unstable and crash the session. Sometimes this is due to the output of diagnostic messages from the OS, which R cannot correctly process. If there are no errors in the code, it helps to restart the interpreter or Rstudio.
R-code, C source code and Windows dll compiled under x64 in my
repository .
UPD:As a result of the dispute with
ServPonomarev and the subsequent analysis of the word2vec code, we managed to find out that the algorithm is trained in lines of 1000 words, in which the window moves in plus / minus 5 words. When an EOL character is detected, which the algorithm converts to a special word with zero numbers in the dictionary, the window movement stops and continues in a new line. The representation of words separated by EOL in the model will differ from the representation of the same words, separated by a space. Conclusion: if the source text is a collection of documents, as well as phrases or paragraphs, separated by a newline, you should not get rid of this additional information, ie leave EOL symbols in the training set. Unfortunately, it is very difficult to illustrate with examples.