📜 ⬆️ ⬇️

New Year dataset: open semantics of the Russian language

New Year is a time of miracles and gifts. The main miracle that nature has given us, of course, is natural language and human speech. And we, in turn, want to make a New Year's gift to all researchers of this phenomenon and share data on the open semantics of the Russian language.

In the article we will allow ourselves to discuss a little on the topic of meanings, tell you how we came to the need to create an open semantic markup, tell you about the present results and future directions of this great work. And, of course, we will give a link to dataset, which you can download and use for your experiments and research.

TL; DR


In habrastitse "Teach a bot! - markup of emotions and semantics of the Russian language " we told about the beginning of a great work on the creation of an open semantic markup of the Russian language. Now we got the first results and want to share with the community.
')
First of all, we focused on the markup of objects of the material world, as well as the emotional and evaluative coloring of words and expressions of the Russian language. These are the two most valuable from the practical point of view and the most understandable from the point of view of marking the area of ​​semantics.

Link to GitHab: open semantics of the Russian language (dataset) .

About semantics and meanings


Semantics or the science of meanings is generally recognized as one of the most difficult sections of linguistics. This is not surprising, given the fact that even the concept of meaning is not so simple to define. (Try to explain on fingers what is the meaning?)

The text, which we analyze by computer methods, is devoid of these very meanings. Those. the text sets a certain outline, but the real meaning is materialized only at the moment of reading the text by a man, when our brain forms inside itself a thought-image or thought-concept of what has been written.

The significance of a word is a fiction, an empty sound, if it does not rely on something that is not just a relation. We can argue about the significance of the word “mouton” ( fr. “Ram” and “lamb”), on the one hand, and “ram” and “lamb”, on the other, only insofar as we know what, in fact, in question, that is, to what segment of extra-linguistic reality these words refer.

Morkovkin V.V. Ideographic dictionaries. - M .: From-MGU, 1970.

This presents an unsolvable complexity for a machine, since they themselves do not in any way possess an interpreter of the human language and can effectively solve only those tasks that do not require interpretation, but are solved at the level of the text and the statistics computed on top of it.

NB Strictly speaking, when you apply machine learning on top of a marked array of texts (for example, by sentiment), then markup is your semantics in this particular task. The problem here lies in the low resolution of such markup and the impossibility of generalizing and applying the “knowledge” to a fundamentally new task.

We show clearly. For a computer, our language is as follows:

Ezo Kmetdzafpayez tpya rashchila Lemagmeshiruyu Dpozhlodz, zn. Oli Gift Do noey p reme le ovpataez ilzemkmezazomor hepofehednoso yagyna and rozhez ebbenziflo meschaz pisch se Ghats, Nozomi le zmevuyuz ilzemkmezatsii and meschaery la umofle zendza and fyhidpellych kofemch forest dzazidzin.

Ego klebshgafryag brya nasimi melachsinuy shirohmoshg, g. shan omi E f Toei Nele IU odrabaeg imgelklegagolon zherofezheshtopo yachyta and noheg evvetgifmo lesag ris n chabazhi, Togola IU gleduyug imgelklegatsii and forest ma ulofme getshga and fyzhishremmyz kofelz mepo shgagishgit.

(Several variants of the paragraph above, in which the vowels are preserved, and the consonants are mixed inside similar groups of letters.)

In the above fragment, you can calculate various statistics, compatibility, n-grams, systems of endings, etc. It is possible to construct an algorithm that, taking into account the extracted statistical information, will imitate the question-answer system, i.e. find in the text sentences that are most similar to the question or even compose an answer from several fragments. With a large amount of data and with the construction of a qualitative model, such a system can very well imitate a person.

But the real work with meaning, when it is necessary to operate with extra-linguistic knowledge of the world, for example, to answer questions that require withdrawal, is hardly feasible in a purely statistical paradigm.

The essence of our work is to create a simplified model of the surrounding world and mark up the language from the point of view of this model. Those. try to bind elements of the language to non-language reality.

NB For the sake of justice, it is worth noting that people are interested in the possibility of grouping vocabulary by semantic similarity back in ancient Rome. If you would be interested to look into the history of the issue, we recommend that you refer to the book of V. Morkovkin “Ideographic Dictionaries”, where in the second chapter a detailed historical excursion is made.

What we do: philosophy


The worlds in which people live are incredibly complex and diverse. Especially the world that exists in our heads - emotions, feelings and experiences, abstract concepts, mentality, ethics and morality.

Semantics of intangible spheres have been occupied by entire teams of eminent scientists for many years. We deliberately do not fit into these areas. Until. More precisely, we get in, but not so thoroughly and not in the first place.

Our focus is mainly on the material world and that small piece of the world of the intangible, regarding ratings and emotions. This is primarily due to the fact that most of the applications of NLU are in these areas, and thus they are most interesting from a practical point of view. Second: you need to start with something more simple and unambiguous, and the choice of the material sphere in this light is completely justified.

The sphere of emotions is, of course, a non-material world, but it is difficult to find a more important aspect of the human psyche. Moreover, it is directly related to a useful practical task - analyzing the tonality of the text. In addition, the written language is greatly deprived of information about emotions. For example, the contexts of polar emotions are often highly symmetrical and purely statistical methods do not distinguish between words with a positive and negative emotional charge.

What we do: specifics


All the words we divide into two large classes - physical objects / phenomena and everything else. The last part is still set aside - it will be interesting to us in the second place.

We divide physical entities into four large classes: living, places, objects and substances.

The weather and food occupy a somewhat separate place in the human consciousness - they do not fall into any of the previous classes in an amicable way. Accordingly, it makes sense to separate them separately.

The second large part of the first stage of our work is the marking of the emotional-evaluative component of language signs. Here, all entities (material and non-material) are divided into three classes: positive, negative and neutral. In the polar classes, the estimated charge strength is estimated. However, assessments are a topic for a separate large conversation, they are too illusory and subtle, but even here human savvy can find a way out of the situation.

Two key principles (assumptions)


Two key principles that we follow when marking up are the naivety of a picture of the world and the rejection of context.

The world that surrounds us may change depending on our knowledge of it. More precisely, the world most likely remains the same, but our perception of it and, accordingly, the system of classification of objects and phenomena is a flexible thing. So, for example, we are surprised to learn that watermelon is a berry by biological classification. Although it would seem, what a berry - he is so huge. And a tomato in a number of scientific systems is a fruit, which does not correspond to our everyday view of it, nor to the order of display in the showcase of a grocery store. Nevertheless, it is important for us to fix precisely the everyday or naive picture of the world.

The second important markup principle is context disclaimer. The units of language are considered separately from the flow of speech and their natural environment in some average, most frequent and obvious sense. Sometimes it goes sideways. So, for example, the word minus can be completely neutral, if interpreted as an arithmetic operation. But as a synonym for the word disadvantage, it acquires a negative connotation. But in general, if you build your system correctly and do not ignore the laws of statistics, then such roughness should be smoothed at the level of machine learning methods.

Waiver of context - the decision, at least, controversial. But at the first stage it was important to do this for three reasons. First of all, taking into account the context significantly increases the complexity and volume of markup with completely unobvious benefits. Secondly, the perennial question of how to fix the context in a machine-readable form and how to bind the token to a specific value when using data. And the third moment. Each meaning of a word in a language has its own frequency of use, which also varies between topics. This is the parameter, the value of which in the explanatory dictionaries is fixed with a scant rarely. for rarely used values ​​and completely inaccessible to the machine.

Our decision, which we took as engineers rather than as scientists, was justified by the results of the first experiments, and the methods of machine learning do indeed in most cases compensate for the averaged estimates in different contexts.

Additional markup


At the first stage of work, we tried to cover, as it seems to us, the most important areas from a practical point of view - the material world and the emotional-evaluative component of language signs. But in parallel with the main direction of the markup, we tried to launch several experimental sections that would allow more meaningful planning of further work:


We will not go into details yet; a more detailed description is in the repository.

Future plans


In the very near future we plan to launch work in the following areas:


But our world is not limited to the sphere of tangible and more distant plans:


Experimental ideas or what can be done with dataset


Traditionally, we not only share data, but also provide ideas for ready-made experiments and research directions that seemed to us worthy of attention.


Interesting results can be obtained by combining datasets by semantics and associations (in the same repository). We already do this to refine the pitch markup; dataset is next in the repository by reference.

From the world on the case - dataset benefit


Recall and describe in the comments any case where you needed explicit semantic markup in your work, but it was not at hand. This will give us valuable food for thought on the further development of dataset.

Download link and license


Dataset: open semantics of the Russian language

Dataset is licensed under CC BY-NC-SA 4.0 .

Source: https://habr.com/ru/post/344582/


All Articles