📜 ⬆️ ⬇️

Everything you know about word2vec is not true

A classic explanation of word2vec as a Skip-gram architecture with a negative sample in the original scientific article and countless blog posts looks like this:

while(1) { 1. vf = vector of focus word 2. vc = vector of focus word 3. train such that (vc . vf = 1) 4. for(0 <= i <= negative samples): vneg = vector of word *not* in context train such that (vf . vneg = 0) } 

Indeed, if you google [word2vec skipgram], what we see:


But all these implementations are wrong .

The original implementation of word2vec on C works differently and is completely different from this. Those who professionally introduce systems with word2 word attachments do one of the following:
')
  1. Directly call the original implementation of C.
  2. The gensim implementation is gensim , which is transliterated from C source to the extent that the variable names match.

Indeed, gensim is the only true implementation known to me in C.

C implementation


The C implementation actually supports two vectors for each word . One vector for the word in focus, and the second for the word in context. (Seems familiar? True, the GloVe developers borrowed the idea from word2vec, without mentioning this fact!)

The implementation in code C is exceptionally literate:


Why random and zero initialization?


Once again, since this is not at all explained in the original articles and anywhere else on the Internet , I can only guess.

The hypothesis is that when negative samples come from the whole text and are not weighted by frequency, you can choose any word , and most often a word, the vector of which is not trained at all . If this vector has a value, then it randomly shifts the really important word into focus.

The point is to set all negative examples to zero, so that the representation of another vector will be affected only by vectors that occur more or less frequently .

In fact, this is quite clever, and I have never thought about how important initialization strategies are.

Why am I writing this?


I spent two months of my life trying to reproduce word2vec as described in the original scientific publication and countless articles on the Internet, but it did not work out. I could not achieve the same results as word2vec, although I tried my best.

I could not imagine that the authors of the publication literally fabricated an algorithm that does not work, while the implementation does something completely different.

In the end, I decided to study the source code. For three days I was confident that I understood the code incorrectly, since literally everyone on the Internet was talking about a different implementation.

I have no idea why the original publication and articles on the Internet do not say anything about the real work mechanism of word2vec, so I decided to publish this information myself.

It also explains the radical choice of GloVe to set separate vectors for a negative context - they just did what word2vec does, but told people about it :).

Is it a scientific hoax? I don't know, hard question. But honestly, I'm incredibly angry. Probably, I will never be able to take seriously the explanation of algorithms in machine learning: the next time I will go straight to watch the source.

Source: https://habr.com/ru/post/454926/


All Articles