By the will of fate, Word2Vec, trained in search queries, came into my hands. Under the cat are examples of use with explanations.
What is Word2Vec?Word2Vec is a technology from Google, which is focused on the statistical processing of large arrays of textual information. W2V collects statistics on the co-occurrence of words in phrases, after which the neural network methods solves the problem of reducing the dimension and outputs the compact vector representations of the words, to the maximum extent reflecting the relationships of these words in the processed texts. I advise you to read the source in order not to rely on my confused retelling of technology.
What Word2Vec was investigated?')
For training, requests were taken to the domestic Internet search engine, respectively - for the most part Russian-speaking. The vector length is 256 elements, available for both the “skipgrams” algorithm and “bag of words”. The total number of words is over 2.6 million, these are all words of the Russian language, many words from other languages, misprints, names and codes - in general, everything that only people could look for.
Requests to the search engine are very different in characteristics from ordinary texts, and this gives some new quality of results.
Typos./distance vectors.bin
Enter word or sentence (EXIT to break): switch
Word: Position in vocabulary Addition: 124515
Word Cosine distance
- adventure 0.748698
Adventures 0.726111
adventures 0.692828
adventure 0.670168
connection 0.666706
adventure 0.663286
connections 0.660438
adventure 0.659609
Our services include not only the resolution of typos in the request, but also a complete list of all typos committed by people. All typos are collected in a single cluster, which is convenient. The difference from the spelling checker system is cardinal. After all, a typo is not allowed by the method of calculating the Levenshtein distance (the minimum number of changes necessary to get the correct form out of the erroneous form), but according to the statistics of real errors of real users.
Transliteration not the layout./distance vectors.bin
Enter word or sentence (EXIT to break): avito
Word: avito Position in vocabulary: 1999
Word Cosine distance
- awito 0.693721
Avito 0.675299
fvito 0.661414
Avita 0.659454
irr 0.642429
Ovito 0.606189
avito 0.598056
./distance vectors.bin
Enter word or sentence (EXIT to break):
- psp 0.723194
pshchpdp 0.721070
pshpd 0.712373
software 0.704579
pschadu 0.695897
SSD 0.694641
Item 0.692646
The value of 0.681183
plate 0.660203
PGPD 0.653649
google 0.649897
by pop 0.647420
Item 0.643923
see 0.641619
Nftvuch 0.640587
photochip 0.631423
pschppu 0.620105
gogle 0.616396
shpp 0.612234
google 0.608240
Although the tasks of transliteration and correcting incorrect layouts are solved by simpler and faster methods, it’s still nice that Word2Vec is capable of not failing here.
Site names, analog sites./distance vectors.bin
Enter word or sentence (EXIT to break): google
- googol 0.850174
google 0.809912
Goggle 0.786360
google 0.760508
googl 0.734248
goog 0.731465
google 0.726011
Google 0.725497
mcgl 0.724901
gugul 0.722874
google 0.719596
google 0.719277
google 0.715329
Google 0.713950
Yandex 0.695366
google 0.690433
googl 0.669867
./distance vectors.bin
Enter word or sentence (EXIT to break): mail
- rambler 0.777771
meil 0.765292
inbox 0.745602
maill 0.741604
yandex 0.696301
maii 0.675455
myrambler 0.674704
zmail 0.657099
mefr 0.655842
jandex 0.655119
gmail 0.652458
vkmail 0.639919
Word clustering is the main function of Word2Vec, and as you can see, it works well.
Semantically close words./distance vectors.bin
Enter word or sentence (EXIT to break): coffee
- coffe 0.734483
tea 0.690234
tea 0.688656
cappuccino 0.666638
code 0.636362
cocoa 0.619801
espresso 0.599390
coffee shop 0.595211
chicory 0.594247
kofe 0.593993
Kopuchino 0.587324
chocolate 0.585655
cappuccino 0.580286
cardamom 0.566781
latte 0.563224
./distance vectors2.bin
Enter word or sentence (EXIT to break): coffee
- 0.757635 grains
soluble 0.709936
tea 0.709579
coffe 0.704036
mellanrost 0.694822
sublimated 0.694553
ground 0.690066
coffee 0.680409
tea 0.679867
decaffeinated 0.678563
cappuccino 0.677856
monoarabica 0.676757
freshly brewed 0.676544
decaf 0.674104
Gevalia 0.673163
soluble 0.659948
etiopia 0.657329
electric car 0.652837
The first issue is Word2Vec in the “skipgrams” mode - that is, in the word selection mode by their environment, while the second issue - Word2Vec in the “bag of words” mode - the selection of words together with their environment. The first one is the words interchangeable with coffee, the second one is the words characterizing coffee. The second issue is especially useful when we begin to think about how to evaluate the importance of words in a query. What is the main word, and which specifies the query.
Request clustering./distance vectors2.bin
Enter word or sentence (EXIT to break): mobile phone
- cellular 0.811114
phone 0.776416
smartphone 0.730191
telfon 0.719766
mobile 0.717972
mobile phone 0.706131
phone 0.698894
Phone 0.695520
phone 0.693121
mobile 0.692854
teleon 0.688251
phones 0.685480
telefrn 0.674768
cellular 0.673612
A query made up of several words can be reduced to one most characteristic word. Moreover, it is not necessary that this word is present in the original query at all. You can compare verbose queries with each other and without intermediate translation to one word. Here we see the rephrasing and expansion of the query in action.
Semantic relations between wordsThe most interesting part of Google's Word2Vec description is how they turned the king into a queen by simple arithmetic operations on vectors. We haven’t managed to turn this focus on search queries; not many have been looking for kings and queens lately, but some semantic relationships really stand out.
It is required to find a word that refers to Germany the same way Paris refers to France.
./word-analogy vectors2.bin
Enter three words (EXIT to break): france paris germany
- Munich 0.716158
Berlin 0.671514
DĂĽsseldorf 0.665014
hamburg 0.661027
cologne 0.646897
amsterdam 0.641764
frankfurt 0.638686
Prague 0.612585
aschaffenburg 0.609068
Dresden 0.607926
Nuremberg 0.604550
crowd 0.604543
Gmunden 0.590301
./word-analogy vectors2.bin
Enter three words (EXIT to break): us dollar ukraine
- UAH 0.622719
Dolar 0.607078
hryvnia 0.597969
Ruble 0.596636
Dollar 0.588882
hryvnia 0.584129
Ruble 0.578501
ruble 0.574094
dollar 0.565995
tenge 0.561814
Dolar 0.561768
currencies 0.556239
$ 0.548859
UAH 0.544302
Impressive ...
Assessing the importance of words in the queryThe evaluation principle is simple. It is necessary to determine to which cluster the request as a whole is, and then select the words as far as possible from the center of this cluster. Such words will be the main, and the rest - clarifying.
./importance vectors.bin
Enter word or sentence (EXIT to break): buy pizza in Moscow
Importance buy = 0.159387
Importance pizza = 1
Importance = 0.403579
Importance Moscow = 0.455351
Enter word or sentence (EXIT to break): download twilight
Importance download = 0.311702
Importance twilight = 1
Enter word or sentence (EXIT to break): Vladimir Putin
Importance Vladimir = 0.28982
Importance Putin = 1
Enter word or sentence (EXIT to break): Nikita Putin
Importance Nikita = 0.793377
Importance Putin = 0.529835
Vladimir for Putin - the word is almost unimportant. Almost all Putin, found on the Internet - Vladimir. But Nikita Putin - on the contrary, Nikita is more important. Because it is necessary to choose Nikit from all Putin on the Internet.
findingsAs such, the findings here are few. The technology works and works well. I hope that this post will be used as an illustration in Russian to those features that are hidden in Word2Vec.