📜 ⬆️ ⬇️

Multilingual Badoo: "translation difficulties"



Good localization, that is, the adaptation of the application for users from different countries, will allow it to win the hearts of its audience. The bad, on the contrary, will become a real pain. For example, one of the navigators on Google Play offers "Do not update, you have not purchased a commercial card" and scares that "On some devices you will be asked to select the installation folder."

The goal of localization is not to simply make the application available in other languages, but to make every user feel that it was designed taking into account the specific features of his or her native language.
')
In this article, we briefly describe the aspects of localization that need to be paid attention to first and foremost, and we will share the experience that we have gained in translating Badoo into 46 languages. This is a very broad topic, and we will continue to tell in detail how we implemented these or other tools. At the end of the article, you can vote and choose the aspect that you are interested in learning first.

Introduction


Supporting multiple locales is a complex, multi-step task that begins with the customization of your application code. Virtually any text transmitted to the user (if it is not a technical component) may require modifications for some languages.

There are many solutions to separate the translated text from the untranslatable and organize a translation system without fatal flaws. We do not use ready-made solutions, we decided to build and develop our own system, to independently step on all the rakes and reinvent the wheel. But our system turned out to be truly flexible and suitable for us in everything. Let's start with the terminology and general principles of work.

The key element of the translation system is certain pieces of text that are compact enough to be convenient to operate, but large enough to maintain logical integrity. We call such fragments lexemes . Consider for example the Badoo messenger. This is a good example: there are enough similar interfaces in both mobile and web applications.


Pay attention to a few key points that are well seen in this screenshot. There are various tokens:

Frequently used lexemes, such as “Search”, “Unread”, “Girl”, etc. in Badoo, they are separate from others and can be reused in different subsystems of our large and versatile architecture, including uniform translations for mobile and web applications. Key advantages of this approach:

With tokens containing variables (“View profile and {{number}} photos”) everything is simple: you just need not forget to substitute data.

With dependence on numbers and inclinations, everything is much more complicated (“{{number}} girls will be seen here”), we will discuss this topic in a separate section.

The process of preparing and displaying translations can be a serious headache in terms of system performance, especially if you have to do this more than 20 thousand times per second (the peak load in Badoo can be higher).

Now let's take a closer look at what attention should be paid.

Dialects and Multistage Failover


Some languages ​​have dialects. For example, English is British and American, and Spanish is Colombian, Argentinean and Mexican. And even if the translations coincide by 99%, it may turn out that the same phrase should sound completely different to them. If you do not take into account this little nuance, a big embarrassment can happen. For example, rapariga in Portuguese means "girl", but in Brazil, the word is used in the sense of "moth". For the Brazilian dialect, the word garota is used, which is not applicable in Portugal because it means “little girl”.

In Badoo, we have built languages ​​in the form of a tree. The root element is “universal English”. Other languages ​​(including British and American English) branch out from it, some of which, in turn, have dialects.

Translators work from top to bottom: universal English is translated first, then languages ​​of the second level, and only then their dialects. That is, the translation into Spanish comes from universal English, and into Mexican from Spanish.

When translations are displayed to the user, the search is performed from bottom to top. For example, for the Mexican language, a Mexican translation is first sought. If it is not found - Spanish. If it is not - universal English.

Letter writing and punctuation


For most languages, it is enough to translate text, and the appearance of the application and the surrounding text elements are not subject to any modification. However, there are specific languages:

For languages ​​with reverse spelling, it is required not only to translate text, but also to make the interface mirror: not only the direction of the text changes, but also the direction of information perception.


With punctuation, there are cases easier. For example, Asian languages ​​(Japanese, Korean) use their own UTF-8 characters for a period, exclamation and question marks (they look almost like ours, but not ours):
?!
.

And there are more difficult cases. For example, in Spanish, question and exclamation marks are duplicated at the beginning of a sentence in an inverted form.


And in no case should punctuation be excluded from tokens!

Formats and units


There are subtle, but very important differences in the formatting of dates and numbers, which can give them completely different meanings in different countries.
For example, 03/07/2013 could mean July 3 or March 7, depending on local standards. This is a common cause of confusion between the United States and the United Kingdom, where they speak the same language, but use a different date format. No need to assume that if two countries speak the same language, then they will definitely understand everything the same way.

Similar things happen with numbers. The number 1,000 can be read as “one” or as “thousand”, depending on which separator is used to separate the fractional part. For example, in Korea, a dot is the decimal separator, and in Germany it is used to separate thousands.

Special attention should be paid to the measurement system. The simplest solution is to display the user's height at the same time in feet and centimeters, but it looks unnatural. You can make a switch that allows the user to select convenient values, and set the default value based on the selected language. This refers to measures of length (height), weight, temperature scale, etc.

Stylistics


In different components of Badoo, a different style of text can be used: somewhere more formal, and somewhere more youthful and conversational. For example, in terms of using the service and other official documents it is better to translate you as “you”, while in interfaces of an entertaining nature, you are often used.

In addition, it is very important not to be confused in terminology and to translate well-established words and phrases everywhere the same. For example, the casual dating service on Badoo in English is called Encounters. This word can be translated in different ways, but we stick to the translation of "Dating." This is extremely important, otherwise the user may not understand the promotional text, calling for some action, or an error message. To solve this problem, we use two mechanisms. The first is a separate group of short tokens, which are either used very often or may depend on gender and number. We will talk more about this group in the next section.

The second mechanism we call TranslationMemory. It performs two functions at once:

The logic of TranslationMemory is quite simple, but the implementation may be an interesting topic, and we will definitely tell about this in more detail in the future. In short, when translating a token, we parse the original text and the translation into smaller “threads” (parts of phrases and whole sentences) by punctuation marks, tags, line breaks and some other delimiters. Why threads? Because they can intersect, intertwine and include any number of other threads. The collection of all threads in a lexeme is called the structure of a lexeme.

If we can clearly compare the structure of the threads from the original and translation to each other, we save the pairs of threads. In the future, when new lexemes appear in the translation system, we try to find a translation for each thread. Combining the options found, we select the most complete translation. The translator may choose as the basis of a new translation one of several most complete translations collected in pieces from different threads.
For example, once having translated two different tokens, Hello world and My name is John, the translator can do almost nothing for the Hello world! My name is John. TranslationMemory will offer ready translation. The translator will only have to make sure that the punctuation marks match the language.

Sex addiction


In different languages, sex is indicated differently: articles and prepositions are used somewhere, endings are somewhere, and everything is right away. For example, in Slavic languages, almost all parts of speech may depend on gender. In addition, complex phrases may depend not only on the gender of the object, but also on the gender of the subject. The rules in some languages ​​can be so complicated that sometimes you have to duplicate the English text for several combinations of objects and subjects of different sexes and, accordingly, modify the application.

Such situations are almost impossible to predict without being a polyglot. Moreover, we believe that developers and should not think about it. Therefore, our translators have a special tool in the translation interface that allows you to “order” the division of a token by gender: a development ticket is automatically created with a description of the problem.

Dependence on the number and declination


In most languages, there are only two forms of dependence on the number: singular and plural. Russian is an excellent example of complex rules depending on the number: 1 user, 2 users, 5 users. Moreover, 21 (31, 41, 101) users, but 11 users. The rules themselves are not very complicated, but we dig deeper.

Usually applications consider what is important to them. Social networks count users, photos, posts and likes. In the financial sector consider transactions, currency and customers. GPS navigators count minutes and kilometers (or miles). Those quantities are calculated whose names and units of measure are found everywhere in the application. These are the most frequently used lexemes that have been repeatedly mentioned in this article. Dependence on the number is one of the reasons why we created a separate tool for manipulating such lexemes.

The second reason is “Ivan Born a Girl, He Brought a Diaper”, i.e. declination. Interesting fact: in the Hungarian language 17 declensions are a record-breaker among the languages ​​into which we translate the site and applications. For rarely encountered words and phrases, you can do a regular text translation without software binding. For frequently occurring words and phrases, it is helpful to have a tool that gets a grammatically correct version. For example, the phrase “you liked 2 girls” warms the soul not only with the pleasant fact of the upcoming acquaintance, but also with clear and understandable Russian.

Our toolkit allows you to perform two important operations. For developers, get a ready-made word or phrase in a grammatically correct form (more precisely, a universal container). For translators, use these correct forms in translations of ordinary lexemes. For example, the above lexeme in the translation system in Russian will look like "You liked {{users_num}} {{users_word # Dative}}". This gives us a certain freedom: the translator can, at her own discretion, rephrase the token and change the case.

This is a fairly good solution, but it requires interaction between translators and developers. Now we are working on a system that will allow changing the entire lexeme or its part based on the variables it contains without the participation of the developers, only by the translators.

Context and length of tokens


Often the same phrase (not to mention individual words) can be translated differently depending on the context. A short search can be both a noun “search” and the verb “search”. In pursuit of the reuse of identical tokens and translations, it is important to monitor the context. To help translators understand the context of a phrase correctly, we usually use a screenshot of the lexeme usage example. We even created an automatic system for collecting screenshots at the testing stage of the task, but this is discussed in a separate article.

When working with mobile projects you need to pay special attention to the length of the lines. Space on the screen will be in short supply, and you need to make sure that a piece of text fits in the space assigned to it. Often, a term that is a single word in English may turn out to be a whole sentence in other languages. The length of the lines can be limited to both characters and pixels (if the size and type of the font are known in advance and rarely change).

Limiting the length of the translation, as a rule, is a recommendation. If the limit is exceeded, the translator will see the corresponding warning, but will still be able to save the translation.

Multi-version and fault tolerance


When you have more than a hundred developers in your team, this requires some caution when working with translations: the same template with translatable text (as well as the mobile application dictionary) can be changed in different tasks. The translation system should be able to distinguish between different versions of files and understand what translation should be given to the user.

For a large team, it is also important to make the translation system as convenient and fault-tolerant as possible. Convenience allows new team members to get started as quickly as possible. Fault tolerance is needed to reduce the influence of the human factor: the system must independently cope with human errors and either correct them where possible, or swear loudly and beat them with current.

Let users translate


It is possible to search for translators to the state or to freelance for a long time and painfully, invent a quality control system for translations and suffer in every possible way whenever you want to add support for a new language. But if your application is entertaining and the audience is large enough, then it is perfectly acceptable to involve users in translations. Facebook and WhatsApp are translated in this way, and Badoo has recently been translated.
We attach great importance to the quality of translations, and it was scary to launch such a scheme. However, this approach has a number of strengths:

We encourage the most active participants, but for the most part users work for the idea of ​​making Badoo available in their own language. Users now work with seven languages, of which three (Finnish, Malay and Vietnamese) are already available to the entire Badoo community. Translations for the remaining four (Basque, Bengali, Icelandic and Swahili) are not yet good enough to include in support for all users, but it is a matter of time.

Conclusion


The goal of localization is to make users feel comfortable in your application, regardless of language and place of residence. Often this requires unobvious and difficult decisions, but, based on our seven-year experience, we can safely say that it is worth it.
The translation system in Badoo has been lining up all these years and continues to evolve. In the future we will try to tell in more detail about our technical and organizational solutions. What the next article will be about is up to you!

Gleb Deikalo, PHP developer

Source: https://habr.com/ru/post/223767/


All Articles