Patties in distribution semantics

For several months I am looking curiously towards distributive semantics - I got acquainted with the theory, learned about word2vec , found the appropriate library for Python (gensim), and even got a model of lexical vectors formed from the national corpus of the Russian language. However, for creative immersion in the material there was not enough soul-collecting data that would be interesting to twist through distributional semantics. At the same time, he enthusiastically read poems-pies (a kind of synthesis of snooty chastooshkas and profound hockey) - some even memorized and recalled their acquaintances on the occasion. And finally, enthusiasm and curiosity found each other, giving rise to an inspiring idea in the associative depths of consciousness - why not combine the pleasant with the useful and not collect some “poetic” search engine based on the improvised means.

from false conclusions
we can lay the truth
about how to multiply
two negative numbers

The “poetry” of the search was supposed to be realized due to the innate ability of distribution vectors to show the degree of semantic similarity of lexemes with a real number (the smaller the angle between word vectors, the more likely these words are close in meaning - cosine measure , classic of the genre, in general ). For example, the "princess" and "shepherd" are much less close than the "shepherd" and "sheep": 0.139 against 0.603, which is probably logical - the vectors of the national corps should reflect the harsh reality, and not the fairy-tale world of G.Kh. Andersen. The method of calculating the depth of correlation (diffusion) of the query and pie appeared almost by itself (cheap and angry) as a normalized sum of similarities of each word from list X with each word of list Y (stop words were thrown out, all others were reduced to normal form, but about this later).

Semantic Diffusion Calculation Code

def semantic_similarity(bag1, bag2: list, w2v_model, unknown_coef=0.0) -> float: sim_sum = 0.0 for i in range(len(bag1)): for j in range(len(bag2)): try: sim_sum += w2v_model.similarity(bag1[i], bag2[j]) except Exception: sim_sum += unknown_coef return sim_sum / (len(bag1) * len(bag2))

The results of the poetic search and pleased, and amused. For example, the following poem-list was issued for the request “music”:

 [('  ' '   ' '    ' '   ', 0.25434666007036322), ('   ' '    ' '    ' '   ', 0.19876923472322899), ('    ' '    ' '    ' '   ', 0.19102709737990775), ('   ' '  ' '     ' '    ', 0.15292901301609391), ('   ' '   ' '   ' '', 0.14688091047781876)]

It is noteworthy that the word "music" is not in one pie, from those entered into the base. However, all pie associations are very musical and the degree of their semantic diffusion with the query is rather high.
')
Now, in order of the work done ( source code on GitHub).

Libraries and Resources

Module pymorphy2 - to bring words to normal grammatical form
Module gensim - connect word2vec models for semantic processing
A distributive model of ruscorpora lexical vectors is also required for work (320 MB)

Data model formation

Oleg presented in the form of text
everything that oksana says
broke into chapters and paragraphs
and on separate words

First of all, from a text file ( poems.txt ), which contains rhymes-pies (about eight hundred), the list is actually cut, containing, in fact, these cakes in the form of lines. Further, a bag of words ( bag of words ) is squeezed out of each string-pie, in which each word is reduced to normal grammatical form and from which noise words are kicked out. After that, for each bag, the semantic “density” (intro-diffusion) of the pie is calculated and a specific associative list is formed (it helps to understand which semantic layer prevails from the point of view of the distribution vectors model). All this pleasure fits into the dictionary under the corresponding keys and is written to a file in json format.

Squeeze code bag of words

 def canonize_words(words: list) -> list: stop_words = ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '') grammars = {'NOUN': '_S', 'VERB': '_V', 'INFN': '_V', 'GRND': '_V', 'PRTF': '_V', 'PRTS': '_V', 'ADJF': '_A', 'ADJS': '_A', 'ADVB': '_ADV', 'PRED': '_PRAEDIC'} morph = pymorphy2.MorphAnalyzer() normalized = [] for i in words: forms = morph.parse(i) try: form = max(forms, key=lambda x: (x.score, x.methods_stack[0][2])) except Exception: form = forms[0] print(form) if not (form.tag.POS in ['PREP', 'CONJ', 'PRCL', 'NPRO', 'NUMR'] or 'Name' in form.tag or 'UNKN' in form.tag or form.normal_form in stop_words): # 'ADJF' normalized.append(form.normal_form + grammars.get(form.tag.POS, '')) return normalized

Data Model Formation Code

 def make_data_model(file_name: str) -> dict: poems = read_poems(file_name) bags, voc = make_bags(poems) w2v_model = sem.load_w2v_model(sem.WORD2VEC_MODEL_FILE) sd = [sem.semantic_density(bag, w2v_model, unknown_coef=-0.001) for bag in bags] sa = [sem.semantic_association(bag, w2v_model) for bag in bags] rates = [0.0 for _ in range(len(poems))] return {'poems' : poems, 'bags' : bags, 'vocabulary' : voc, 'density' : sd, 'associations': sa, 'rates' : rates}

General analysis of the model

The most "hard" pie:

kill offense anger and lust
pride envy and longing
and what's left of you
out of mercy finish

Density, squeeze, associative list

  0.16305980883482543 ['_V', '_S', '_S', '_S', '_S', '_S', '_S', '_V', '_S', '_V'] ['_S', '_S', '_S', '_S', '_S', '_S', '_S', '_S', '_S', '_S']

As expected, the top of the “hard” ones included pies, in which there are many words that are close in meaning - both synonymous and so antonymic (and not so clearly) and easily generalized into some category (in this case, emotions).

The most "soft" pie:

on this ship we will be saved
don't be my ancestor
put the ship do not be fooled
sick

Density, squeeze, associative list

  -0.023802562235525036 ['_S', '_V', '_S', '_V', '_V', '_A'] ['_S', '_V', '_S', '_S', '_V', '_S', '_S', '_S', '_S', '_S']

Here, the negative density, as it seems to me, is largely due to the fact that the hospital meaning of the word “vessel” was not included in the associative list. This is generally one of the weak points of distributional-semantic models - in them, as a rule, one value of a lexeme suppresses all the others with popularity.

Model Searches

About how the search works, in essence, it was written above - the pies are unceasingly sorted by the level of their diffusion (generalized semantic similarity) with the words of the query.

Code

 def similar_poems_idx(poem: str, poem_model, w2v_model, topn=5) -> list: poem_bag = dm.canonize_words(poem.split()) similars = [(i, sem.semantic_similarity(poem_bag, bag, w2v_model)) for i, bag in enumerate(poem_model['bags'])] similars.sort(key=lambda x: x[1], reverse=True) return similars[:topn]

A few examples:

Consciousness

 >> pprint(similar_poems("", pm, w2v, topn=5)) [('   ' '    ' '   ' ' ', 0.13678271365987432), ('  ' '   ' '   ' '   ', 0.1337333519127788), ('    ' '    ' '    ' '     ', 0.12728715072640368), ('   ' '     ' '    ' '   ', 0.12420312280907075), ('   ' '   ' '  ' '    ', 0.11909834879893783)]

free will

 >> pprint(similar_poems(" ", pm, w2v, topn=5)) [('   ' '    ' '   ' '  ', 0.12186796715891397), ('    ' '   ' '     ' '  ', 0.10667187095852899), ('   ' '      ' '   ' '    ', 0.10161426827828646), ('    ' '      ' '   ' '', 0.10136245188273822), ('     ' '    ' '      ' '     ', 0.098855948557813059)]

Winter

 >> pprint(similar_poems("", pm, w2v, topn=5)) [('  ' '    ' '    ' '  ', 0.1875936291758869), ('  ' '      ' '      ' '    ', 0.18548772093805863), ('   ' '   ' '   ' '   ', 0.16475609244668787), ('  ' '  ' '    ' '    ', 0.14671085483137575), ('      ' '    ' '   ' '', 0.13253569027346904)]

Obviously, a search through the prism of other distributive-semantic models (different shells, learning algorithms, vector dimensions) will give different results. In general, the technology works, and works very satisfactorily. Fuzzy semantic search is implemented easily and carelessly (at least, on a relatively small amount of data). In the future, if the hands reach and, most importantly, the head will catch up, I plan to implement an assessment of the rating of pies based on the training sample. To begin with, a simple weighted sum (semantic diffusion will act as a weighting factor). Then, perhaps, something useful from machine learning.

Source: https://habr.com/ru/post/275913/

All Articles

Patties in distribution semantics

Libraries and Resources

Data model formation

General analysis of the model

Model Searches

More articles: