Dataset: associations to the words and expressions of the Russian language

Recently, methods of distributive semantics have become widespread to assess semantic similarity. These approaches have shown themselves well in a number of practical tasks, but they have a number of strict limitations. So, for example, language contexts are very similar for emotionally polar words. Therefore, antonyms from the point of view of word2vec are often similar words. Also, word2vec is fundamentally symmetrical, because the basis for the consistency of words in the text, and the popular measure of similarity between vectors - the cosine distance - is also independent of the order of the operands.

We want to share with the community the base of associations we collected to the words and expressions of the Russian language. This data set is devoid of the disadvantages of the methods of distributive semantics. Associations retain emotional polarity well and are asymmetric in nature. Read more in the article.

Why does distribution semantics "not see" part of the picture of the world?

Written language is very concise information. To unpack it and understand the essence, we add additional resources - common sense, our knowledge of the world, cultural context. If some of this information is not available to you, for example, you have fallen into a new company or are immersed in a new subject area, you will be forced to fill gaps in knowledge by asking questions or studying additional sources.
')
The computer is deprived (so far) of such an opportunity to learn. Therefore, it is important for the NLP developer to understand that there is not and cannot be a piece of useful information about the world. It needs to be collected and connected additionally.

What are associations?

Everyone in childhood was playing a game, when one person calls a word, a neighbor offers his own association. Then come up with an association to the association, etc. It is often interesting not only to hear the association of another person, but also to understand the course of his thoughts, how he came to this or that word. This allows a little insight into how we think.

You can look at it in a different way. Living people have the most relevant and uncompressed information about the world and language. Connected with this is our amazing ability to resolve language ambiguities. Any language model will be a cut of this information with inevitable losses. Distributive models give one cut, associations allow to look from a different angle. Perhaps the path to a slightly more voluminous language picture lies in the use of both models.

TL; DR or give a link to dataset

Actually dataset, which we want to share with the community, is the basis of such associations. Below we describe the features of the data, but if you can not wait - feel free to scroll down and go to GitHub to download the database.

Asymmetry of the matrix of associations

Another annoying feature of distributive models is their symmetry. Those. CHAIR and FURNITURE will be similar, but how to understand the relationship of words to each other? Clustering over vectors helps a little, but this information is not available in the original model.

Associations are asymmetrical. So, for example, to the word LIME there will be a strong association of FRUIT. But the opposite is not true - if LIME is associated with the word FRUK, then not at all in the first place. This is connected both with the generalizing role of the word FRUK in the language, and with the actual cultural context of the inhabitants of Russia.

Accordingly, the specularity and its quantitative expression are interesting attributes of associations, which favorably distinguish them from purely statistical tools, such as distribution models.

What can be done with dataset

We see the ultimate goal of all language studies in teaching a computer to understand language at the human level. This does not necessarily imply the ability of the machine to think (whatever we put into this concept), rather skillfully emulate the way a person works with the language.

I would like to hope that additional sources of information, which are not so much for Russian, will help scientists and researchers to advance along this path. Below we will offer several areas of research that seemed to us quite interesting:

Implement the algorithm assoc2vec, taking as a basis ideas from GloVe and replacing contextual consistency with associative.
Cluster associations within the framework of each individual headword or dataset as a whole, for example, to select a cluster of individual word meanings.
Investigate the possibility of automatically building a thesaurus of the Russian language. (Observation: in contrast to contexts, the association matrix is asymmetric.)
Use slices of associations by gender to conduct a sociological survey.
To make an interesting visualization of the associations and links within the dataset themselves. (For example, a map of various paths between associations.)
Investigate the nature of the symmetry / asymmetry of the relative frequencies of the mirror associations.

These are just a few ideas, in reality they can be much more. Come up with your experiments and be sure to share the results on Habré or even in scientific journals.

Download link and license

Dataset: associations to the words and expressions of the Russian language

Dataset is licensed under CC BY-NC-SA 4.0 .

Source: https://habr.com/ru/post/341406/

All Articles