Everything you know about word2vec is not true

A classic explanation of word2vec as a Skip-gram architecture with a negative sample in the original scientific article and countless blog posts looks like this:

while(1) { 1. vf = vector of focus word 2. vc = vector of focus word 3. train such that (vc . vf = 1) 4. for(0 <= i <= negative samples): vneg = vector of word *not* in context train such that (vf . vneg = 0) }

Indeed, if you google [word2vec skipgram], what we see:

But all these implementations are wrong .

The original implementation of word2vec on C works differently and is completely different from this. Those who professionally introduce systems with word2 word attachments do one of the following:
')

Directly call the original implementation of C.
The gensim implementation is gensim , which is transliterated from C source to the extent that the variable names match.

Indeed, gensim is the only true implementation known to me in C.

C implementation

The C implementation actually supports two vectors for each word . One vector for the word in focus, and the second for the word in context. (Seems familiar? True, the GloVe developers borrowed the idea from word2vec, without mentioning this fact!)

The implementation in code C is exceptionally literate:

The syn0 array contains a vector word embedding, if it comes across as a word in focus. Here is a random initialization .

 https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369 for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { next_random = next_random * (unsigned long long)25214903917 + 11; syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; }

The other array, syn1neg , contains the word vector when it appears as a context word. Here initialization is zero .

During training (Skip-gram, a negative sample, although other cases are about the same), we first select the word focus. It persists throughout the course of training in positive and negative examples. The gradients of the focus vector accumulate in the buffer and are applied to the focal word after training both in positive and negative examples.

 if (negative > 0) for (d = 0; d < negative + 1; d++) { // if we are performing negative sampling, in the 1st iteration, // pick a word from the context and set the dot product target to 1 if (d == 0) { target = word; label = 1; } else { // for all other iterations, pick a word randomly and set the dot //product target to 0 next_random = next_random * (unsigned long long)25214903917 + 11; target = table[(next_random >> 16) % table_size]; if (target == 0) target = next_random % (vocab_size - 1) + 1; if (target == word) continue; label = 0; } l2 = target * layer1_size; f = 0; // find dot product of original vector with negative sample vector // store in f for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; // set g = sigmoid(f) (roughly, the actual formula is slightly more complex) if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; // 1. update the vector syn1neg, // 2. DO NOT UPDATE syn0 // 3. STORE THE syn0 gradient in a temporary buffer neu1e for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; } // Finally, after all samples, update syn1 from neu1e https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541 // Learn weights input -> hidden for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];

Why random and zero initialization?

Once again, since this is not at all explained in the original articles and anywhere else on the Internet , I can only guess.

The hypothesis is that when negative samples come from the whole text and are not weighted by frequency, you can choose any word , and most often a word, the vector of which is not trained at all . If this vector has a value, then it randomly shifts the really important word into focus.

The point is to set all negative examples to zero, so that the representation of another vector will be affected only by vectors that occur more or less frequently .

In fact, this is quite clever, and I have never thought about how important initialization strategies are.

Why am I writing this?

I spent two months of my life trying to reproduce word2vec as described in the original scientific publication and countless articles on the Internet, but it did not work out. I could not achieve the same results as word2vec, although I tried my best.

I could not imagine that the authors of the publication literally fabricated an algorithm that does not work, while the implementation does something completely different.

In the end, I decided to study the source code. For three days I was confident that I understood the code incorrectly, since literally everyone on the Internet was talking about a different implementation.

I have no idea why the original publication and articles on the Internet do not say anything about the real work mechanism of word2vec, so I decided to publish this information myself.

It also explains the radical choice of GloVe to set separate vectors for a negative context - they just did what word2vec does, but told people about it :).

Is it a scientific hoax? I don't know, hard question. But honestly, I'm incredibly angry. Probably, I will never be able to take seriously the explanation of algorithms in machine learning: the next time I will go straight to watch the source.

Source: https://habr.com/ru/post/454926/

All Articles

Everything you know about word2vec is not true

C implementation

Why random and zero initialization?

Why am I writing this?

More articles: