
As you know, ABBYY creates programs that are somehow related to language processing: ABBYY Lingvo translates words from one language to another, ABBYY FineReader converts texts from print to electronic, ABBYY Compreno (we wrote about it
here ) will translate into different languages ​​whole texts. Programs of this type are called “high-tech” because they are based on the results of serious scientific research. And in our case - on the results from the field of artificial intelligence, pattern recognition and computational linguistics. About computer linguistics and will be discussed in this post.
We are very interested in the development of this science, so we don’t limit ourselves to research within the company, but which year in a row we are organizing a scientific conference on this topic -
“Dialogue” . Since computational linguistics is a rather specific area of ​​knowledge, linguists know a lot about Dialogue and everyone else does not know much. Under the cut, we will tell you more about it.
"Dialogue" - the largest conference on computer linguistics in Russia. Dialogue - because experts from different areas of theoretical linguistics and developers of linguistic technologies, such as extracting knowledge from texts, speech recognition and speech synthesis, or machine translation, meet and exchange experience here. The conference has been held for about 35 years (with short breaks), and for the last ten years ABBYY has been its main organizer. It so happened that the main ideas of the Dialogue coincided with the positions in the field of natural language processing that our company has always adhered to: that the future of computational linguistics lies in the combination of modern engineering and mathematical methods and full-fledged linguistics.
')
Nowadays, at many foreign computer linguistics conferences there is a strong “lurch” towards purely statistical methods, while Dialog tries to promote the idea that statistical training turns out to be even more effective if it is “superimposed” on full-fledged models of natural language. And here engineers cannot do without the participation of linguists. Another distinctive feature of the Dialogue is a special attention to the Russian language. At conferences that are held in other countries, for obvious reasons, the Russian language is practically not involved, and at the Dialogue, modern methods of computational linguistics are applied primarily to it.
Why else need a "dialogue"? Russian computational linguistics still noticeably lags behind the western one both qualitatively and quantitatively. We have far fewer specialists and companies in this field than, for example, in Germany. We are on average worse equipped both theoretically and methodically, somewhat divorced from the mainstream world. “Dialogue” is designed to help not only overcome this lag, but also draw attention to those particular areas in which Russian computational linguistics is quite competitive. The conference discusses the most relevant and interesting problems. To do this, we invite world-famous researchers, and they talk about their projects, share the most relevant work experience.
This year, the focus was corpus linguistics. Shells are large arrays of texts that are used for linguistic analysis. It can be said that almost all the results in modern theoretical and computer linguistics are obtained using shells. Machine translation systems and other automatic analysis systems are trained on them, modern dictionaries are based on examples from cases, language theories are tested on case data.
How do they work with hulls? Let's give an example. Our company is one of the initiators of a project dedicated to regional differences in the Russian language -
“Languages ​​of Russian Cities” . Project participants collect information on differences in the names of the same objects and concepts in different cities of Russia and the near abroad. Most readers have heard about curb and curb, but what do the words mark, trempel or cartoon mean? Thousands of words, which are used only in certain regions of Russia, have been found and verified exactly how they are used with the help of the modern Russian language cases. Of course, corpuses were used in which data on language geography were contained (for example, corpuses composed of texts from local media or blogs whose participants report where they live).
So, almost any linguistic research is conducted today with the use of corpus data. But not every study clearly articulates what properties the corpus and methods of working with it must possess in order for the results to be credible. Roughly speaking, for different tasks and corps should be created (selected) taking into account their specificity. For example, if you are working on a recognition system for modern colloquial speech, the
National Corpus of the Russian language will not suit you, because it is based on works of fiction. If you are making an automatic translation of news feeds, you need a corpus containing well-chosen media texts. A separate question that was discussed at the “Dialogue” is whether the whole Internet can be used as a corpus. As you know, there you can find texts of almost any type. But you need to develop tools for the automatic selection of suitable texts.
As we have said, Russian computational linguistics has a lot to learn. To do this, researchers are invited to the "Dialogue", who talk about the most "fresh" world achievements. For example, last year such leading linguistic luminaries such as Yorick Wilks and Joakim Nivre performed. This year, Eduard Hovy and Diana McCarthy were guests of the conference.
Another important topic of the “Dialogue” is a comparison of the quality of automatic text analysis systems. In Europe, it has long been customary to agree on methodologies for assessing the quality of such systems, and only such scientific work that meets the agreed criteria of the so-called can be submitted to the conference. "Evaluation". We have yet to achieve the introduction of a culture of verifying the results, since in Russia it has long been customary to rely on the qualitative assessments of the developers themselves, and they are far from always objective.
One of the important tasks for Dialogue in this regard is the development of technologies for conducting competitions between automatic text analysis systems and criteria for assessing the quality of the performance of these systems. For example, at Dialogue 2010, there was a competition for the systems of automatic morphological analysis of the Russian language (systems that can do grammatical analysis of words). The twelve systems developed by leading scientific institutions and commercial companies were compared in several ways, including resolving ambiguities in identifying parts of speech and other grammatical meanings of words depending on the context. For example, such systems should be able to determine in which grammatical meaning the word “glass” is used in the analyzed text - as a noun or as a verb.
This year it was discussed how to compare the results of the syntax analysis. Difficult problems of syntax different systems of the automatic analysis solve differently. Some do a complete analysis of sentences based on grammars (remember how in school: one line - subject, two - predicate, and so on?), Others - partial analysis of sentence fragments, others use statistical models based on identifying the most common chains of words.
It turned out to be quite difficult to agree on, but the syntactic testing will be held this fall. By the way, it was decided to involve, along with experts, university students who are related to computational linguistics, linguists and programmers, to analyze its results. If you want to take part in this project, write in a personal.
What else to say about the "Dialogue"? In addition to ABBYY, Moscow State University is involved in organizing the conference. Mv Lomonosov Institute of Linguistics, Russian State University for the Humanities, Institute of Informatics Problems of the Russian Academy of Sciences, Institute of Information Transmission Problems of the Russian Academy of Sciences, Yandex, Association of Artificial Intelligence. The Russian Foundation for Basic Research helps to hold the conference.
The generally recognized high level of reports at the Dialogue is assisted by a large group of strict expert reviewers (about 60 Russian and foreign experts) who help select the most interesting works for the conference and weed out weak and secondary ones.
We are sure that such a solid company will help domestic computational linguistics to reach a new level. All reports of Dialogue 2011 are posted
on the conference website .
Sveta Luzgina,
with the support of the organizing committee "Dialogue"