⬆️ ⬇️

Word2Vec in examples

By the will of fate, Word2Vec, trained in search queries, came into my hands. Under the cat are examples of use with explanations.



What is Word2Vec?



Word2Vec is a technology from Google, which is focused on the statistical processing of large arrays of textual information. W2V collects statistics on the co-occurrence of words in phrases, after which the neural network methods solves the problem of reducing the dimension and outputs the compact vector representations of the words, to the maximum extent reflecting the relationships of these words in the processed texts. I advise you to read the source in order not to rely on my confused retelling of technology.



What Word2Vec was investigated?

')

For training, requests were taken to the domestic Internet search engine, respectively - for the most part Russian-speaking. The vector length is 256 elements, available for both the “skipgrams” algorithm and “bag of words”. The total number of words is over 2.6 million, these are all words of the Russian language, many words from other languages, misprints, names and codes - in general, everything that only people could look for.



Requests to the search engine are very different in characteristics from ordinary texts, and this gives some new quality of results.



Typos



./distance vectors.bin

Enter word or sentence (EXIT to break): switch



Word: Position in vocabulary Addition: 124515



Word Cosine distance

- adventure 0.748698

Adventures 0.726111

adventures 0.692828

adventure 0.670168

connection 0.666706

adventure 0.663286

connections 0.660438

adventure 0.659609




Our services include not only the resolution of typos in the request, but also a complete list of all typos committed by people. All typos are collected in a single cluster, which is convenient. The difference from the spelling checker system is cardinal. After all, a typo is not allowed by the method of calculating the Levenshtein distance (the minimum number of changes necessary to get the correct form out of the erroneous form), but according to the statistics of real errors of real users.



Transliteration not the layout



./distance vectors.bin

Enter word or sentence (EXIT to break): avito



Word: avito Position in vocabulary: 1999



Word Cosine distance

- awito 0.693721

Avito 0.675299

fvito 0.661414

Avita 0.659454

irr 0.642429

Ovito 0.606189

avito 0.598056




./distance vectors.bin

Enter word or sentence (EXIT to break):

- psp 0.723194

pshchpdp 0.721070

pshpd 0.712373

software 0.704579

pschadu 0.695897

SSD 0.694641

Item 0.692646

The value of 0.681183

plate 0.660203

PGPD 0.653649

google 0.649897

by pop 0.647420

Item 0.643923

see 0.641619

Nftvuch 0.640587

photochip 0.631423

pschppu 0.620105

gogle 0.616396

shpp 0.612234

google 0.608240




Although the tasks of transliteration and correcting incorrect layouts are solved by simpler and faster methods, it’s still nice that Word2Vec is capable of not failing here.



Site names, analog sites



./distance vectors.bin

Enter word or sentence (EXIT to break): google

- googol 0.850174

google 0.809912

Goggle 0.786360

google 0.760508

googl 0.734248

goog 0.731465

google 0.726011

Google 0.725497

mcgl 0.724901

gugul 0.722874

google 0.719596

google 0.719277

google 0.715329

Google 0.713950

Yandex 0.695366

google 0.690433

googl 0.669867




./distance vectors.bin

Enter word or sentence (EXIT to break): mail

- rambler 0.777771

meil 0.765292

inbox 0.745602

maill 0.741604

yandex 0.696301

maii 0.675455

myrambler 0.674704

zmail 0.657099

mefr 0.655842

jandex 0.655119

gmail 0.652458

vkmail 0.639919




Word clustering is the main function of Word2Vec, and as you can see, it works well.



Semantically close words



./distance vectors.bin

Enter word or sentence (EXIT to break): coffee

- coffe 0.734483

tea 0.690234

tea 0.688656

cappuccino 0.666638

code 0.636362

cocoa 0.619801

espresso 0.599390

coffee shop 0.595211

chicory 0.594247

kofe 0.593993

Kopuchino 0.587324

chocolate 0.585655

cappuccino 0.580286

cardamom 0.566781

latte 0.563224




./distance vectors2.bin

Enter word or sentence (EXIT to break): coffee

- 0.757635 grains

soluble 0.709936

tea 0.709579

coffe 0.704036

mellanrost 0.694822

sublimated 0.694553

ground 0.690066

coffee 0.680409

tea 0.679867

decaffeinated 0.678563

cappuccino 0.677856

monoarabica 0.676757

freshly brewed 0.676544

decaf 0.674104

Gevalia 0.673163

soluble 0.659948

etiopia 0.657329

electric car 0.652837




The first issue is Word2Vec in the “skipgrams” mode - that is, in the word selection mode by their environment, while the second issue - Word2Vec in the “bag of words” mode - the selection of words together with their environment. The first one is the words interchangeable with coffee, the second one is the words characterizing coffee. The second issue is especially useful when we begin to think about how to evaluate the importance of words in a query. What is the main word, and which specifies the query.



Request clustering



./distance vectors2.bin



Enter word or sentence (EXIT to break): mobile phone

- cellular 0.811114

phone 0.776416

smartphone 0.730191

telfon 0.719766

mobile 0.717972

mobile phone 0.706131

phone 0.698894

Phone 0.695520

phone 0.693121

mobile 0.692854

teleon 0.688251

phones 0.685480

telefrn 0.674768

cellular 0.673612




A query made up of several words can be reduced to one most characteristic word. Moreover, it is not necessary that this word is present in the original query at all. You can compare verbose queries with each other and without intermediate translation to one word. Here we see the rephrasing and expansion of the query in action.



Semantic relations between words



The most interesting part of Google's Word2Vec description is how they turned the king into a queen by simple arithmetic operations on vectors. We haven’t managed to turn this focus on search queries; not many have been looking for kings and queens lately, but some semantic relationships really stand out.



It is required to find a word that refers to Germany the same way Paris refers to France.



./word-analogy vectors2.bin



Enter three words (EXIT to break): france paris germany

- Munich 0.716158

Berlin 0.671514

DĂĽsseldorf 0.665014

hamburg 0.661027

cologne 0.646897

amsterdam 0.641764

frankfurt 0.638686

Prague 0.612585

aschaffenburg 0.609068

Dresden 0.607926

Nuremberg 0.604550

crowd 0.604543

Gmunden 0.590301




./word-analogy vectors2.bin



Enter three words (EXIT to break): us dollar ukraine



- UAH 0.622719

Dolar 0.607078

hryvnia 0.597969

Ruble 0.596636

Dollar 0.588882

hryvnia 0.584129

Ruble 0.578501

ruble 0.574094

dollar 0.565995

tenge 0.561814

Dolar 0.561768

currencies 0.556239

$ 0.548859

UAH 0.544302




Impressive ...



Assessing the importance of words in the query



The evaluation principle is simple. It is necessary to determine to which cluster the request as a whole is, and then select the words as far as possible from the center of this cluster. Such words will be the main, and the rest - clarifying.



./importance vectors.bin



Enter word or sentence (EXIT to break): buy pizza in Moscow



Importance buy = 0.159387

Importance pizza = 1

Importance = 0.403579

Importance Moscow = 0.455351



Enter word or sentence (EXIT to break): download twilight



Importance download = 0.311702

Importance twilight = 1



Enter word or sentence (EXIT to break): Vladimir Putin



Importance Vladimir = 0.28982

Importance Putin = 1



Enter word or sentence (EXIT to break): Nikita Putin



Importance Nikita = 0.793377

Importance Putin = 0.529835




Vladimir for Putin - the word is almost unimportant. Almost all Putin, found on the Internet - Vladimir. But Nikita Putin - on the contrary, Nikita is more important. Because it is necessary to choose Nikit from all Putin on the Internet.



findings



As such, the findings here are few. The technology works and works well. I hope that this post will be used as an illustration in Russian to those features that are hidden in Word2Vec.

Source: https://habr.com/ru/post/249215/



All Articles