Teach the bot! - marking of emotions and semantics of the Russian language

Perspectives of a bright, robust future are raining down on us from all sides. Or not very light, in the spirit of the Matrix and the Terminator. In fact, the machines are already confidently coping with translations, it is not worse and much faster than people recognize faces and objects of the surrounding world, learn to understand and synthesize speech. Cool? Not that word!

But the matter is seriously complicated by the fact that computers have learned how to navigate in our world. Everything that they do so well, they do by analogy, without going into the essence and not loading themselves with the meaning of what is happening. Maybe it is for the better - we will live longer without being enslaved by a soulless tribe of cars.

But curiosity pushes for risky steps, namely, attempts to acquaint the computer with our world, including the inner world — feelings, emotions and experiences.
')
How we plan to pump the mind of the machines, teach them emotions, feelings and value judgments, as well as where you can download the marked
data - read the article.

I do not want to read, show the result!

You can immediately try to train the bot on the link: Teach the bot!

If you like to answer, create your Map and the result will be remembered.

Limitations of Distributive Semantics

meme about distributive semantics, word2vec, robot, coffee

What, in fact, is the problem of computer comprehension of texts, because a machine can study all textual cultural heritage and learn everything from there? Better words will tell the result of the word2vec.

For the lexeme "man":

woman 0.650
married 0.594
middle aged 0.542
anti-man 0.538
...
pregnant 0.519
not bearing 0.516
girl 0.498
...

Or for the word "hot":

warm 0.510
...
cold 0.498
cool down 0.486
roast 0.467
...

And for the highly positive emotion "delight":

admiration 0.715
...
outrage 0.609
fury 0.597
horror 0.586
despair 0.584
...
trembling 0.531
confusion 0.523
confusion 0.522
...
rabies 0.472
...

Or for the broad concept of "technology":

...
technology 0.569
art 0.451
skill 0.410
...
aircraft construction 0.393
industry 0.392
medicine 0.379
craft 0.375
...
industry 0.370
...
knowledge 0.360
science 0.358
...

Actually, these examples clearly show how much information gives context. Quite a lot, but clearly not enough to breed antonyms, part-whole, general-particular, make a distinction between vertical and horizontal links.

Therefore, it is reasonable that many researchers along with the approaches of distributive semantics (read: word2vec) use thesauri. For English, such a resource is WordNet, for Russian - RuTez, Wiktionary.

The obvious is not so obvious

Every researcher who decides to make a bold attempt to explain the meaning to the car will sooner or later be confronted with the fact that the most seemingly trivial things to a computer are completely non-obvious. Moreover, not even a word has been written about them in children's books. The world, in a number of aspects, is cognized by us through our organs of perception — through sight, hearing, smell, touch, taste, and others.

Then we communicate with each other the extremely concise and brief context of the situation, which unfolds in a single head into a detailed picture. Moreover, for each person the situation is revealed in different ways, depending on personal experience, cultural background, characteristics of character and world perception.

Emotions, feelings, experiences

Words and phrases carry much more meaning than is recorded in explanatory dictionaries. First of all, it is connected with such unsteady and weakly-scaled properties as evaluation and accompanying emotional coloring. For example, the phrase heavy flour carries a strong negative emotion. And the phrase “ stormy joy ” is a strong positive one. Not a gift is something negative, but not too much. And, for example, a virtuoso has a pretty strong positive assessment.

The difficulty with fixing such subtle characteristics of words is that they are extremely subjective and poorly formalized. Let's say the word strategy is positive or neutral? One can only agree with the fact that it is not negative.

Nevertheless, emotional and evaluative attributes are an integral part of language units and play a rather important role in human communication. Therefore, if we want to make the car more humane and pleasant to communicate, it must also be imbued with these subtle matters.

What to do?

Manually creating such a dictionary would be extremely time consuming, because you want to mark up not only words, but also phrases. In addition, all assessments will be strongly tied to the subjective opinion of the researcher.

Good news! We live in 2017, and we have access to such wonderful technologies as the Internet and crowdsourcing. The latter allows to cope simultaneously with both the problem of laboriousness and the subjectivity of assessments. Of course, this gives rise to the “hospital average” effect, but for the first approximation we allow ourselves to close our eyes to irregularities of this kind.

Teach the bot! - marking of emotions and semantics of the Russian language

The idea is implemented on the language platform Word Map . The work will be conducted in several directions:

Estimated markup. The task is to mark the words and expressions of the Russian language according to the criteria of positive / neutral / negative and the strength of expression of the trait.
Emotional markup. The task is to mark emotionally colored words and phrases by polarization and the strength of the emotional background.
Thesaurus markup. The task is to mark the vertical and horizontal connections between words, set semantic tags for words and expressions.
Experimental marking of relations according to the theory of “Meaning ⇔ Text” proposed by I. A. Melchuk: MAGN (coffee) = strong coffee, MAGN (feeling) = strong feeling, etc.

To use human labor with maximum benefit and to make the tasks interesting for the respondents, the approaches of distributive semantics and machine learning are applied. For the basis of the system of semantic categories, we took the classification used for the NCRF.

How to participate?

An important goal of our initiative is to fill in the missing linguistic resources for the Russian language that are open for use by researchers, linguistic scientists, and practical engineers. We expect that based on the data of the markup interesting research will be conducted, scientific articles written, articles on Habré, engineering products and open technologies will appear.

You can help the project in the following ways:

Participate in the training of the bot. It is easy and fun, and also allows you to pump over your language consciousness and notice the interesting features of the Russian language.
Like, cher, Alisher! Share links to the project in social networks, tell about it in your blog or on the site.
Constructive criticism helps to develop and not dive into the swamp of their own illusions. The discussion is very important in order to adjust the course in time and create a truly useful resource. The only wish: criticize - offer.
Semantics and cognitive linguistics. We are trying to improve our understanding of modern approaches to semantics and the creation of such resources. We will be glad to advice or recommendations, what to read, what to study, with whom to consult.
Spread of information. We can use your advice on where else you can talk about the project - this could be your favorite tech blog, online technology magazine, VKontakte / Slake / Telegram group or something else.

Open data

Aggregated markup results will be open for download and are available under CC BY-NC 4.0 license.

We expect to receive and publish the first results by the middle / end of July - everything will depend on the activity of the respondents. To avoid missing anything, put asterisks and subscribe to our githab:

Open data on the word map

Where is the money, Zin?

It's great to try to combine crowdsourcing and crowdfunding in one project, which we did by launching a fundraising campaign on Planet.ru:

Teach your computer to understand our world and emotions

Important. We are already doing the project and will bring it to our results with our own resources and available resources. The collected data, as promised, will be open and accessible to everyone. The question is only in the timing and volume markup. Now we expect to get the basic result (10,000 most frequent words) in three months, the full volume markup will take about two years.

Additional resources will help to significantly speed up the results. We need to help the developers involved in creating and improving the markup system, add new semantic categories and conduct research. Also, funds are needed to promote the project and conduct competitions.

You can donate any amount of money to the campaign - at the same time you will know that in general success there is also your contribution, and every invested ruble will be spent on a cool and rewarding business.

Do not forget that you can help the initiative and without money. Put likes and tell about the project in social networks - this is a very simple, completely free, but very effective way to promote.

And remember ...

The choice is always yours.

Corporate sponsorship

You represent a well-established business and you are interested in the development of open linguistic data in Russia? Become a corporate sponsor of the project! You get an eternal graphic link from the project page, additional advertising to thousands of people and unearthly respect from the community.

We will spend every invested ruble with incredible efficiency, and for several monthly salaries of one programmer in a large company we will do the whole project, the results of which will be used by thousands of researchers, scientists and engineers.

Commercial use

For questions about commercial use or business-specific markup, write to kartaslov@mail.ru or to the author of the article.

Thanks

I would like to express my big gratitude to the organizers and participants of Dialogue 2017 - the 23rd international conference on computational linguistics and intellectual technologies.

It was in the backroom discussions of the event that the need for this kind of markup became clear, and a group of like-minded people gathered to discuss the experimental markup of relations on the theory of “Meaning Text”. I would like to hope that next year, based on the collected data, it will be possible to launch an interesting new competition in the framework of Dialogue Evaluation.

Links

Source: https://habr.com/ru/post/331582/

All Articles

Teach the bot! - marking of emotions and semantics of the Russian language

I do not want to read, show the result!

Limitations of Distributive Semantics

The obvious is not so obvious

Emotions, feelings, experiences

What to do?

Teach the bot! - marking of emotions and semantics of the Russian language

How to participate?

Open data

Where is the money, Zin?

Corporate sponsorship

Commercial use

Thanks

Links

More articles: