Friday format: “language development” - research that combines IT and linguistics
In today's article we will try to talk about several technological projects that are directly related to the processing of natural language, working with dictionaries and databases based on text arrays, studying what users write in social networks - using the example of foreign research and development of the ITMO University.
A number of areas of work with natural languages involves the use of semantic technologies. In this case, the work is carried out primarily with ontologies, which define the relations between the objects of semantic relations and allow you to make the interaction with the machine more "human". ')
The "Semantic Web" as a direction for the development of the Internet and machine interaction is a well-known idea and has been developing for a long time. Nevertheless, there are still quite a few new directions for applying semantic data. Projects using semantic technologies are also working at ITMO University.
For example, the resident company Technopark ITMO University VISmart is developing a project Ontodia , which allows the use of semantic technologies for application needs, including for developers. The user can load semantic data into Ontodia, and at the output receives their visualization in the form of a graph.
As examples of the use of such visualizations, developers cite the search and comparison of information from unstructured medical data in the Northwestern Medical Research Center. V.A. Almazov
Another example of implemented projects based on semantic technologies is an extension for the Open EdX system, which allows you to personalize the educational process as part of training in online courses. Employees of the ITMO University from the international laboratory "Intellectual methods of information processing and semantic technologies" together with a colleague from Yandex created otnologiya describing all the components of MOOC: content, usage scenarios, participants in the process, etc. As a result, developers have the opportunity to identify interdisciplinary communication between courses, which are published on the edX platform.
From the point of view of NLP algorithms, we use the following mechanism: we take textual content from the course content (in video lectures, these are subtitles) and select the key words from them with the help of algorithms - the so-called “domain concepts”.
These concepts we mark on the prepared ontology. Thus, we get semantic units of content in each course, with the help of which we can further link different courses on different topics and different subject areas.
- Dmitry Volchek, graduate student of the Department of Informatics and Applied Mathematics of the ITMO University
Thanks to this, students and creators of MOOC can track how and in what quality a particular concept is used in different courses, what is meant by it within different subject areas - and, ultimately, to get a three-dimensional idea of the concept of interest.
Word Processing Algorithms and Big Data
Another area of work with natural language is the use of algorithms for counting and evaluating certain characteristics of large arrays of textual data. Despite the fact that this task seems to be a trivial example of working with big data, there are also some nuances.
As Dmitry Muromtsev, head of ITMO University and head of the international laboratory "Intellectual methods of information processing and semantic technologies", says , work on such projects is often based on a similar scenario: developers analyze a large array of texts and evaluate its linguistic characteristics - morphology, syntax, nuances associated with the use of certain words and phrases, and so on.
The idea and algorithms of similar services are approximately the same. They use a set of word processing approaches that have already become standard. The uniqueness lies in the fact that these algorithms must be very precisely adjusted to each specific language. We in our laboratory, in particular, are also engaged in such work.
After all, when we talk in life, we use the rules that we study almost from birth - in school, in daily communication, and so on. The same must be done with the machine: in fact, from scratch and to train it in these rules with very high quality
- Dmitry Muromtsev
Such work sometimes leads to unexpected results. For example, not so long ago, a similar method allowed scholars to conduct a more detailed analysis of Shakespeare's heritage. It turned out that 17 of 44 of his plays were written “in collaboration” (a 1986 study revealed only 8 “collaborations”). In itself, the practice of borrowing and refining works by different authors is not out of the ordinary for English poets of the 16th century.
Moreover, in some cases, it was difficult to determine the exact authorship of a work or parts of it until recently, since writers not only exchanged ideas, but also tried to imitate each other's style.
More accurately determine the authorship of the work or part of it allowed the analysis of the so-called. official words that do not have nominative functions and reflect the relationship between "independent" words. Analysts were able to identify patterns of their use, which can clearly indicate a particular author and make up his "unique linguistic portrait." For example, one of the distinctive characteristics of Shakespeare was the construction of “and with” (as in “With mirth in funeral and with dirge in marriage”).
According to scientists, the precise definition of who among the poets turned out to be involved in the creation of well-known plays allows us to dispel the myth of Shakespeare's exclusiveness to a certain extent. So, for example, the “heavyweight” first part of the trilogy “Henry VI” Shakespeare, as it turned out, wrote himself (earlier it was attributed to possible co-authors), but Thomas Middleton put his hand to the play “All is well that ends well”.
Another unusual example of a linguistic project based on big data is the “ de-harmonizer ”. The project of Israeli scientists makes it possible to evaluate a number of characteristics of a scientific text (based on an analysis of a corpus of 500 thousand scientific articles) and determine how it will be understood by a wide audience. The service counts the number of words in specific vocabulary, as well as rare words, and, based on the data obtained, determines the availability of the text (we wrote about this project in more detail here ).
Text analysis
A number of studies (including those conducted at ITMO University) involve several technologies of natural language analysis. Example - projects in the field of opinion mining (analysis of the tonality of the text). The analysis of tonality involves the creation of a domain ontology, the use of statistical tools for analyzing natural language, the use of machine learning algorithms, and (in some cases) the involvement of experts to more accurately evaluate texts.
At ITMO University, a similar project was implemented within the framework of solving the problem of analyzing public opinion on the Internet. For the analysis of opinions, the employees of the Advanced Computational Technologies Laboratory of the Scientific-Research Institute of NKT use data from social networks (VKontakte, Twitter, Instagram, Live Journal), which form the basis for further processing. Further, each publication is marked in accordance with a given set of characteristics (the number of likes, reposts, comments, sherov), and the data itself is combined by a graph of links, which can be used to track the dissemination of information.
This project is used to study social processes on the Internet and continues to evolve. For example, the NII NKT has already conducted several studies based on the analysis of data from social networks and the processing of natural language.
One of them is the monitoring of the network activity of informal communities, which allows for a deeper study of the features of information dissemination and the phenomenon of the emergence of problem-oriented communities with informational influence. Another project is the construction of an “ emotional map ” for a given area, when, based on publications with geotags and an assessment of their content, analysts can get an idea of what people feel in one place or another.
There are more and more projects related to the processing of natural language every year, and they themselves are more ambitious. Scientists from the UK, for example, say that "the computing power of computers is increasingly turning to solving linguistic problems, because it is one of the most difficult and time-consuming tasks for modern developers."