Translation subtleties: as volunteers, ABBYY LS and IBS translate into Russian a cursory specialization of Data Science

Vladimir Podolsky vpodolskiy , an analyst in the department for work with education of IBS , became the editor of the Russian translation of the Data Science specialization on Coursera (as part of a joint project of IBS and ABBYY LS). We publish his detailed post about the difficulties of translating professional texts on the subject of data, the practice of working with the crowd platform and the experience of long-term online studies. Recall that Vladimir himself graduated from the specialty Data Science on Coursera. We published his detailed review of all 9 course courses from Johns Hopkins University (part 1 and part 2 ).

Hello again, Habr!
Coursera and other MOOC'i - very entertaining and addictive thing. Thanks to them, you can learn a lot, learn a lot. It is important to have only access to the network and not be lazy . Throughout the whole MOOC history, the same rule applies as when writing a Ph.D. thesis: “If you are not ready to do every day a little bit, it’s better not to take it at all.” Following it, you can cope with the science of data, and with the introduction of artificial intelligence, and even with quantum physics ...
Today I would like to talk about one of the difficulties encountered in studying open courses almost all over the world. Of course, this global difficulty is language. And the problem, as a rule, is not even that the level of human knowledge does not allow us to understand what the lecturer is talking about ... The fact is that it is very, very difficult to understand certain English-language terms that do not have a clear analogue in Russian. And one can not even recall the speed of speech - as a rule, foreign teachers do not make allowances for those who do not have English as their native language.

When dealing with foreign MOOCs, you need to get ready for the majority of teachers to tell the material with the staggering speed of a jet plane. You can, of course, move the slider back - but, believe me, this venture will bother you with the third, and you will thank fate if the slides are written in concise and accessible language, even if in a different language. This is especially true for those who in our country for some reason could not or did not begin to learn a foreign language.

If you know a foreign language and while reading this text you shrug in amazement ... In vain. In Russia there are indeed many who did not learn a foreign language, those who seemed sufficient for the life of the language of Pushkin and Tolstoy. In general, there are quite a few professionals as well as people striving to develop in some chosen direction. And if the educational market of the country does not provide the necessary materials (good, now the situation is still straightened), the person should have the opportunity to join the knowledge produced abroad. It is with this goal that such a social initiative as the translation of foreign online courses into Russian flourishes.

When it comes to some kind of public initiative, you can imagine something handicraft done on your knee and according to the principle “I don’t get any money for it - let them say thank you for that!”. Perhaps it was. But I am sure that the segment of public initiatives in Russia has passed this stage. And this is confirmed by the crowdsourcing initiative of IBS and ABBYY LS in organizing the translation of Data Science Specialization, which I happened to master not so long ago (posts about this: part 1 , part 2 )

At the same time, the role of companies in the translation process, of course, is great, but you should not exaggerate it - ABBYY LS became the provider of crowdsourcing translation of subtitles for video lectures, while IBS supported this good initiative by the work of their experts, who successfully passed specialization and applied their knowledge in their work. . Actually, this is how I ended up among the expert group, whose members carefully review the translations of the crowdsourcing community and glue them together, eliminating various terminological flaws.

In today's article, I will talk about how the translation expertise is conducted, as well as how the SmartCAT platform created by ABBYY LS helps me in this process. So let's go!

Using the correct terminology

Perhaps the biggest problem in the examination of translation has been and remains the problem of using correct terms. In principle, the problem is not so serious if there is already a well-established terminology in Russian for the translated area. If there is no such terminology, then you have to choose the Russian version, guided by two criteria:
A) it should not duplicate terms with a different definition;
B) it should be as adequate as possible for the other person to have an intuitive understanding.

Perhaps the problem of finding correct terms is the main problem in examining the translation of other project participants. Of course, there are problems of building correct phrases and sentences, but they are in general trivial and rather relate more to the art of writing literate and understandable texts, about which much has already been written. Therefore, I’ll give a little more detail on how to choose terms for translation and examination.

The first and most important advice in this matter is to try to find and at least fluently study the relevant literature in Russian. And it is not necessary that these should be solid scientific folios - even articles, notes and interviews on highly specialized topics written by Russian-speaking experts in this field will do. Of course, in the case of journalistic work, there is always the risk of running into humpback and one-eyed “deatasaientists” instead of glamorous and fashionable “data researchers”. However, such Anglicisms and slang are easily monitored due to the fact that they stand apart in the middle of the rest of the Russian-language text.

In case the exact term could not be found in the existing literature, we can assume the translation of the term into Russian, and then check it for adequacy by searching for professional forums and sites. After viewing several thematic pages, you will most likely be able to find the most common translation of the term in the professional community. Of course, it is not worth spending a lot of time on such searches - if the available options are very rare, they can hardly be used as an authoritative source of information.

The third option is to search for similar terms in related fields of knowledge. For example, for data science, you can safely rely on textbooks on statistics, probability theory, the basics of artificial intelligence ... The main thing in all these searches is not to dig. For individual terms, many equally good (or equally bad) translation variants are found. In this case, I usually choose one of them (as a rule, the most accurate and euphonic), and then stick to it.

If in the end none of these options worked, then you will have to rely on your own knowledge and background in the relevant field. In the end, as graduate students like to do it, why not introduce a new term :)?

"Smart cat" - a faithful assistant translator and expert

No matter how strong the community’s desire is to translate Coursera courses, this would hardly have been possible without the quality tools provided by ABBYY LS. The tool provided by them is called SmartCAT. Smart is smart. CAT - Cat. I'm serious - see the picture.

Although no, just kidding, CAT is an abbreviation of Computer Assisted Translation (translator assistance system). The CAT-systems are based on the principle of breaking the translated text into small parts, according to one or two sentences. Each such part is called a segment. The CAT system processes each segment in two ways:
The main disadvantage and the main advantage of CAT systems is human participation in translation. Let the initial translation variant be selected automatically, in any case, it must be confirmed by a person (translator, subject matter expert). The disadvantage of this approach is obvious - you have to involve people in translation and peer review of translations. Of course, human participation stretches the translation process, which is certainly a negative factor for those who want to keep up with the times and quickly receive information in their national language. On the other hand, the involvement of people in the translation process has an obvious positive trait - automatic translation systems are still inferior to humans in their ability to build accurate and correct sentences from a semantic (semantic) point of view. Plus, a person is able to catch the mood of a text or speech, which allows him to better form the translation so that people can understand the phraseological units or even the lecturer's jokes.

ABBYY LS SmartCAT is a kind of CAT-tools, a kind of cloud environment that allows you to automate the translation process to the maximum. But without fanaticism - as already mentioned, man is assigned a key role in the translation. Although this environment is, of course, sold to firms and freelancers, I use it exclusively as part of the “Translate Coursera” crowdsourcing project .

The crowdsourcing of the project “Translate Coursera” is that everyone can take part in the translation. You simply register on the site, choose an interesting or close course and start translating it with the support of SmartCAT. SmartCAT has wide support: here you will find both machine translation options, translations of similar segments, and built-in dictionaries and reference books of terms, as well as all kinds of word searches, the ability to listen to the audio recording of the original. Is it that coffee is not offered in the morning, but I think they will correct this mistake with the new release ;-)

Expert job

Well, this is probably all that I know about the translation side of the project. Now I’ll tell you what the expert sees and does with the help of SmartCAT. So godmode ON!

When you log in to the system, the inscription “Workspace” (highlighted in yellow) appears directly below my name. To proceed to the examination of the translations assigned to me, I have to click on it, and then select the Crowd Review option.

After that, I get to the page with a list of all courses, in the translation of which I participate as an expert (see the screen below). Opposite the course name is indicated the progress of the translation (blue) and the progress of the examination (blue bar, overtaking the blue). Clicking on the name of the course opens a list of video clips of the course lectures, the translation of the subtitles to which I am examining. At the top of the expanded list of video clips, you can see the “Download” button - it is responsible for downloading the original and translated subtitle files. To go directly to the examination of the translation of any video lecture, you need to click on its name.

After clicking on the title of the lecture video clip, I find myself on a new page, where all the key tools SmartCAT offers me for examining the translation of subtitles (see the figure below) are presented. Consider this page a little more ...

The abundance of elements on the page of expertise of the translation of a video fragment is impressive - in my subjective impression, widescreen screens are the most convenient in the process of examination. As practice has shown, each control unit is involved in the process of editing the translation. Here, apparently, it is worth saying thanks to the developers who removed all unnecessary and left only the most necessary elements.

As you can see, most of the page is occupied by a window with segments in English and their translations into Russian. By clicking on the button with a triangle to the left of the English version of the segment, you can go to the corresponding part of the video to listen to what the lecturer is saying and see what he is doing - this often helps to understand what the lecturer really means, because the speech recognizer generating subtitles, alas, sometimes, but mistaken. The video itself is displayed on the tab in the lower right corner of the page.

Although the screenshot already filled all fields for translation into Russian, initially they are empty. To fill them in, you need to choose one of the translation options proposed by the community, or take a machine translation, and in the most extreme case, you can also translate the text yourself. As a rule, among the transfers of the community is any suitable option. To display all available community translation options for a particular segment, you need to select it with a mouse click. The proposed translation options will be displayed in the window at the bottom of the page. A variant of machine translation and translation obtained from the translation memory will be displayed on the CAT window on the right.

If among the options proposed by the community, it was possible to find the most adequate one, then you need to click on the corresponding red button with an arrow to the right of the translation option. The option chosen in this way will go to the translation window, after which you can place the cursor on this option in the segments window and start editing it as plain text. When editing the translation is completed, you need to click on the checkmark icon in the toolbar at the top, or the Ctrl + Enter combination. After that, SmartCAT considers the segment transfer to be completed and pro- duced and updates the green status bar at the very top of the page. If necessary, the translation of the segment finalized in this way can be returned.

There is one requirement for the translation of each segment that cannot be violated (otherwise the segment translation cannot be confirmed). Each translation must contain exactly the same number of line breaks (arrow on a blue background, as on the Enter key) as there are in the original fragment in English. The point, apparently, in the subtitle timing ... Although it is, in fact, not very convenient. There are very common situations when an English text is longer than its translation due to all sorts of pauses and reservations, from which I try to save subtitles. In this case, you have to somehow tweak in order not to lose the clarity of the presentation and fit into the numerous line breaks in a small translation of the segment.

In addition, SmartCAT provides an expert with the ability to put a translation of a term into a course dictionary so that translators can later use a single translation option. Unfortunately, I have not yet had the opportunity to expert translations of other, more advanced courses, so I can’t definitely say if the term translations I used to the dictionary were useful to anyone or not ...

Another very useful thing in the SmartCAT platform is dictionaries. They are good because they provide several translation options and even explanations. I think if I studied to be a translator, this environment would be very useful for me in terms of learning new words.

Of course, I did not disclose all the functionality of SmartCAT in this article, but described only the one I personally used when examining the translation of the Exploratory Data Analysis course.

Time spent

As for the time spent on the examination of translations ... In fact, everything is different. The time spent mainly depends on three factors:
As a rule, the examination of fragments lasting 6-7 minutes takes from 45 minutes to an hour and a half, while longer varieties (10 minutes or more) can take 2 hours to complete. Such a long duration is associated with many factors:
As a result, it turns out that in a week about 4 hours of time is spent on expertise, while “3-5” video clips lasting an average of 7 minutes are “examined”. The course, the examination of which I am translating, contains 39 video clips of various lengths (up to a giant 40 minutes long!). Given the total employment, I hope that it will be possible to deal with him on the New Year.

Taking into account what has been written, it can be concluded that the expert is the “bottleneck” of the translation. And indeed it is. However, without an expert anywhere - as I have already been convinced many times, the terminological problem is very acute for a crowdsourcing translation initiative.

Boundaries of formalization

Why did I not tell you how to understand which translation of the term is correct and which is not? Everything is simple - this is an unformalized area. As a rule, I am guided by the fact that I don’t like something in the translation text. The text is trite "does not sound." This understanding is based on the study of the relevant Russian-language thematic literature, 6 years of study at the Bauman Moscow State Technical University for an engineer of computer systems and networks, as well as 2 years of work in IBS in an analytical direction. Of course, when translating a highly specialized text by people who are not experts in this subject area, there is always a risk of getting an incorrect translation of special terms. Very often this risk is realized. But the need for proofreading is compensated by the fact that most of the text, as a rule, does not contain complicated terms, therefore the translation of such fragments turns out to be more than tolerable.

Instead of a conclusion - what is all this necessary for?

The question voiced above is important, it is difficult to live without motivation :) It is clear that the examination of the translation gives those who study courses on Coursera with Russian subtitles - a high-quality translation and distinct terms. The question is what the expert can give to the expert himself, in addition to his sense of moral satisfaction and accomplishment of public duty.

Frankly speaking, the experience of examination allowed me to once again plunge into the specialty “Data Science” - something emerged in my memory, something was better structured in my head. It became easier to transfer to paper and explain the basic processes of data analysis. Since I overcame all the specialization in English, I simply did not think about Russian-language versions of concepts and descriptions of data analysis processes. Unfortunately, immersion in English-language courses makes itself felt - sometimes it’s trite in conversation that it’s impossible to immediately find a worthy analogue of the term in Russian. It is the acquisition of knowledge in English and the unwillingness or inability to present them in Russian contribute to the emergence of Anglicisms and other borrowed words in Russian, and using them turns out to be problematic to convey information both to Russian-speaking experts in this field and to a wider audience of non-specialists.

You can not go far into the forest - not so long ago, the knowledge of Russian-language terminology of the course, which I learned during the examination of translations, was put into practice by giving an introductory lecture on data analysis at MGIMO as part of the course “Management of Innovations”. The task was complicated by the fact that MGIMO does not belong to the category of engineering universities, so the materials had to be adapted and structured so that even a person who is unqualified in data analytics and mathematics has a holistic view of what data analysis is, how it is generally performed and for what is needed. The experience of the examination of translations helped a lot with this - the conceptual apparatus and the main ideas for the overview and introductory lecture were easily formed in Russian. I hope that I managed to show the new generation of state managers how data analysis can be used to build a smart and positive government policy ...

