A couple of words about application internationalization

I have been reading Habr regularly for a long time and noticed that there are quite a few intelligible articles about software localization focused on developers. In my experience of managing localization projects, I can say that localization is not only translating strings and adapting an application to the context of a given country, but also constant confrontation (in ideal cases - equal interaction) with developers.
In this article I will try to show with an example how to create a so-called localization-friendly code, that is, to organize resources in such a way as to substantially facilitate the localization of the application, reducing redundant time and financial costs.
Immediately make a reservation that the discussion will focus primarily on internationalization , that is, on taking into account all the linguistic features at the design stage. If the resources of your project did not initially imply localization, and subsequently you decided on it, then their “sharpening” under localization can be much more expensive than the income from it.

Use Unicode

In most cases, the question of the UTF-8 (or UTF-16) encoding arises when planning localization into Asian languages, where the number of characters can reach several thousand. Even if at the moment localization into Korean or Chinese is not planned, it is worthwhile to take care of the universal encoding in advance. If the strategy of localization of your product changes, then jumping on the move to another encoding will be much more difficult. Tip: for all resources, use Unicode by default, even if the project is currently only in Russian / English / any other language.
By the way, for example, JSON and YAML specifications (these formats are often used to store localized resources) prescribe the use of Unicode.
')

Take care of fonts

This seemingly trifle is often a critical factor inhibiting localization. Make sure that the fonts you use have characters for localization languages (first of all, again, Asian languages, as well as Hebrew, Arabic and European diacritics).
remember, that
ä, à or ą ≠ a
as well as in Russian, “e” is not always equal to “e.”
In my practice there was a case when the developers themselves drew a font containing a set of letters only for the English language. When it came to localization in German and Polish, they had to draw letters with accents.

Leave space for maneuver

In addition to fonts, the translation of the texts of the application is prepared by another underwater stone for layout.
Compare how one menu item can be translated into different languages.

ru: Save As
en: Save as
fi: Tallenna nimellä
zh:另存为

If for Chinese we need only 3 characters, then for Finnish there are already 16! In addition to the number of characters, the features of a font are also important.

Compare the Finnish and Chinese strings by length (font for both languages Arial Unicode MS, 12 pt) - Finnish text (114 pixels) is 2.5 times longer than Chinese (45 pixels).
Therefore, in the interface elements it is very important to have a place in reserve in order to avoid cutting off the displayed text. If in certain cases there is not enough space, you can use automatic text fit to size. However, this decision will lead to the fact that in different elements of the interface with a high probability will be displayed text of different sizes.

Pseudolocalization

Such a pseudolocalization will help to see problem areas before the transfer begins. It is one of the methods for testing applications to test their readiness for localization. Its essence lies in the fact that instead of transferring to resources, a text in a pseudo language is substituted, created using a special algorithm (depends on the software used). The most primitive example: instead of the English text, transliteration / transcription is substituted in Cyrillic letters:
Save as -> Save As
Save as -> Save as
This method allows you to check the following points:

Whether the diacritical marks are displayed correctly (for example, German, Polish);
Whether languages with other fonts are displayed correctly (for example, Chinese, Russian);
are there any problems with the display of interface elements for languages with the right-to-left text direction (for example, Arabic);
Are there any problems with the display of non-standard characters (for example, in user names);
whether all localized resources are extracted into separate files (using text directly in the code carries a lot of problems, see below about hardcoding).

When pseudolocalization often use machine translation into the desired language. On the one hand, this is a simple solution if there are no special tools for generating pseudo translation. On the other hand, I have already seen more than once how developers confused localized resources with pseudolocalized ones and even inadvertently replaced normal translation with machine in repositories. In addition, machine translation does not always allow one to evaluate the display of all characters of a language (for example, the letter œ not so often found in texts, but its display must also be tested).
For example, the pseudo-translation plugin interface in memoQ looks like this:

And here is the result with these settings:

External resources

In order to have a complete overview of the materials being localized, it is necessary to separate all resources from the code. Multimedia information containing text (most often these are images, as well as video and audio, for example, in games) should also be stored separately, sorted by locale. First, it will significantly simplify the work of content creators, they will not need to dig into the code to correct any system message. Secondly, it will allow the localization manager to accurately calculate the time and budget for each language. Thirdly, it will allow to be incomparably flexible in working with multilingual content.
The most popular formats for exchanging localized data are XLIFF and .po files . Anyway, modern automated translation systems are able to convert any files into formats understandable for translators.
Google and Apple also strongly advise developers to extract all the resources for localization to the outside: recommendations for Android developers , recommendations from Apple for internationalization .

Hardcoding in internationalization

In continuation of the previous paragraph, it is worth mentioning an important point. Localization involves not only the translation of words, but also the adaptation of numbers, units of measurement, date and time formats, as well as punctuation marks to local standards.

Punctuation marks

Many developers like to “sew” punctuation marks into the code, thinking that the dots and question marks are certainly the same in all languages. But compare:

ru:

  Are you sure?

en:

  Are you sure?

fr:

  Êtes-vous sûr?

es:

  ¿Está seguro?

ar:

  ل أنت متأكد

In French, the question mark is separated by a space (by the way, Habr stubbornly removed the space before the question mark, I had to conjure with tags). In Spanish, the question mark consists of inverted at the beginning and usual at the end of a phrase, and in Arabic it generally stands on the left and faces the other way. If a question mark is inserted from the code, then not all users will be comfortable with reading such a message (unless the code does not spell dependency on locale, but why such perversions?).

In addition to the punctuation marks, you need to be careful with spaces, trusting the code to force them. After all, there are languages where spaces between words are not used - for example, Japanese.
It is said that the localization of Japanese or Chinese applications into European languages can become a living hell, if the developers did not take into account the nuance that other languages share words with spaces.
So, punctuation is part of the text and it should be put into external resources.

Numbers

Numbers, like words, also need translation. Many developers forget about it, displaying numeric variables in their usual formats. Let's compare:

en: 18,765.22
en: 18,765.22
de: 18.765.22
he: 18,765.22
el: 18.765,22
fa: 18 ٫ 765.22

Notice which character is used as thousand and fractional delimiters. In English and Hebrew, the full stop and comma stand very differently than in German and Greek. And in Russian, a space (non-breaking) is used as the thousand separator for numbers> 9999. And in Farsi, thousands are separated by the special symbol “mommaye” (U + 066B), but there is no special standard for this language, the separators can also be a comma and a space.
You can, of course, assume that these are trifles and "who need it, and in this form will understand." However, such trifles can sometimes lead to serious misunderstandings, especially when it comes to prices or important engineering calculations.
Speaking of prices, let's compare:

en: 2,25 €
en: € 2.25
de-at: € 2,25
de-de: 2,25 €
lv: € 2,25
lt: 2,25 €

In different languages, currency signs are arranged in different ways, from which it follows that the hardcoding of these characters is also not worth it. And, as you can see, the norms differ not only among languages, but also among language variants (in Austria and Germany). Even neighboring Latvia and Lithuania have different norms.

Units

Sometimes it is necessary to adapt to national standards not only the appearance of the number, but also the number itself. We are talking about units of measurement. If they are used in your project, you should always find out which system is adopted in a particular country in order to clearly inform the user about speed, length, mass, temperature, etc.
The message “ You are moving at a speed of 62 miles per hour ” will not tell the driver from Pskov about anything. Also, the message “ You are moving at a speed of 100 kilometers per hour ” can lead to a stupor from a Chicago driver.
In this case, it is not enough just to give a numeric variable, you should dig deeper and change the calculation formula depending on the locale. True, the ideal solution would still be to provide the choice of the measurement system to the user in the application settings, making this setting independent of locale. In any case, local units of measurement should always be taken into account.

Not all languages have the same grammar.

Forced line breaking

Some developers, when organizing text strings, do not take into account the grammar of other languages and break the text in a string into several values. As a result, text messages are collected from several pieces according to the rules of Russian syntax (or the developer’s native language). If the translation into English is also possible to get out somehow (which is also rarely possible), then, for example, in German with its strict rules about word order, when assembling fragments into a single sentence, you get complete nonsense. Well, in Arabic, where everything is generally written in the other direction, this option of organizing content is generally unacceptable.

Quite a common example. The Russian-speaking user sees the message: “ There are 5 days left until the end of the test period. Please enter a valid key . ”. In resources, this message looks like this:

 'trialexpires_1': "Until the end of the test period"
 'trialexpires_2sg': "stayed"
 'trialexpires_2pl': "left"
 'trialexpires_4sg': "day."
 'trialexpires_4pl2': "of the day."
 'trialexpires_4pl3': "days."
 'enterkey': "Please enter a valid key."

In principle, it is possible to contrive and translate these “scraps” of the text into English so that the translation will be quite correct. With Arabic, where the direction of the text is different, this trick will not work. In German, now and then separated verb prefixes tend to run away at the end of a sentence. By the way, once again compare the duration of this phrase in different languages - the German version is 30% longer than English. Verbs are highlighted in bold. As you can see, in German they can consist of two parts, one of which can be quite far from the other.

en: Your trial period expires in 5 days. Please enter the valid key.
de: Ihre Testversion läuft in 5 Tagen ab . Bitte geben sie einen gültigen Produktschlüssel ein .

Another disadvantage is that with such a representation, the translator cannot always grasp the logic of the sentence and add a correct translation. Imagine how easy it is to get confused in such pieces of lines when there are thousands of commercials 5.
All this tells us that, if possible, it is necessary to output the entire line to resources, so that it not only has the most universal format, but also is understandable to the person who will translate it.
The solution for the situation described would be the following:

 'trialexpires': "Until the end of the test period [count: left | left] {% n} [count: day | day | days]."
 'enterkey': "Please enter a valid key."

The count operator (or whatever you call it) substitutes the desired text value depending on the numeric variable% n. With this view, the Arabic translator, who writes from right to left, will not have problems - he will simply rearrange the variable places.

Layout with forced line breaks

Quite a common problem is the desire of developers to provide the necessary representation of text in the interface using forced line breaks. Immediately give an example.
The user sees the text like this:

This text is so big
and the window is so small
that I have to break
it in rows.

In resources, it might look like this:

 'menubox_string1': "This text is so big,"
 'menubox_string2': "and the window is so small,"
 'menubox_string3': "that I have to break"
 'menubox_string4': "its in the lines."

The translator will spend several times more time translating such a disgrace, thinking how to adapt it to his own language. If the text is longer (German or French), then four lines may not be enough. If the text is shorter (Japanese or Chinese), then a couple of lines will be left blank. Not to mention the fact that if automated translation technology is used (in which the translation of each line is added to the translation memory and used in similar or similar lines again), such a partition will not help make the localization effective.
There can be two ways out: either use automatic text fitting to the window dimensions; or, if you don’t want to trust the machine transfers, use \ n.
Then the text in the resources will look like this:

 'menubox': "This text is so big, \ na window is so small, \ nthat I have to break it in lines."

In this case, the line break will be more flexible. For example, a translator can be told the maximum number of characters per line and ask for the most logical breaks.

Redundant optimization

This mistake is made by too diligent content managers. Especially those who optimize English texts. In the over-optimized resources, everything that is possible (all keywords, and sometimes expressions) is replaced with permanent ones, which, when localized, can be substituted without taking into account the case, articles and other features of the grammatical system of the translation language. Of course, this allows for better control over the consistency of the use of terminology, and can also significantly reduce translation costs. But any optimization should be reasonable. Let's look at an example:

The user sees the following text:
You can launch the application from the terminal. Press F2 to access the terminal.

In the resources it is collected from the following pieces:

 'cmd': "the terminal"
 'app': "the application"
 'act_42': "Press F2"
 'run_from_terminal': "You can launch {app} from {cmd}. {act_42} to access {cmd}."

Suppose the interface uses words and phrases many times, which the content manager has replaced with permanent ones. He uses these constants in texts, since it's comfortable. If one day it is decided that the word “terminal” is unacceptable, and you need to use the “command line” or replace the terminal in the system with, say, a menu, then you will not need to process a huge array of text. It is enough just to replace the value with a constant. An additional advantage is the reduction in the total number of words. After all, the cost of translation is most often calculated by the number of words (much less often by the number of lines), which means that the total costs of localization can be reduced. But it was not there. Remember, I already wrote that not all languages work according to the same grammar rules? Here it is also very important.

Let's see how resources in this form will be translated into Russian.

 'cmd': "terminal"
 'app': "app"
 'act_42': "Press F2"
 'run_from_terminal': "You can run {app} from {cmd}. {act_42} to open {cmd}."

The user will see the following:

“ You can start the application from the terminal. Press F2 to open the terminal . ”

If you replace the word "application" with the word "program", it will become even worse, but more clearly:

“ You can run the program from the terminal. Press F2 to open the terminal . ”

Obviously, the case category is not taken into account in this approach.
There is no need to go far for such examples. Just look at the disgustingly localized Foursquare:

Or, take a look at the filter names. Not all of them are a continuation of the phrase “Show places ...”. These are probably constants used in other places. Well, or just mindless translation and lack of localization testing.

Facebook is constantly improving the localization by volunteer users (not so long ago, they did post a localization manager vacancy, I hope, soon everything will be even better), but, for example, this line doesn’t look quite like Russian, but is built according to the rules of the original language.

In the Russian version, all the same it would be better to write “Place of study:% Universities%”
A similar example from another section:

Conclusion: the use of text constants is certainly useful, but they should take into account other grammars. Ideal approach: use numeric constants, constant units of measurement (taking grammatical features into account for each language, for example, in Russian 2 plural numbers: 1 level , 2 levels , 5 levels), proper names (names of software products), keyboard shortcuts.

Conclusion

Traditionally, software localization was separated from development, moreover, many product managers see localization as a simple replacement of the original text with a foreign one. As a result, the product as a whole suffers, as:

non-optimized resources increase labor costs for localization;
The bugs revealed in the localization process increase the time for the product to enter the market and again increase the labor costs for their elimination;
localization budget is constantly growing;
The “curve” localization affects the number of purchases / downloads of an application in a particular region and gives competitors an extra chance. Personally, my opinion is that a poorly localized product is much worse than a non-localized one.

Even if the application is written for the local market, then in this case localization may be necessary. It is quite possible that in a couple of years in Moscow there will be a great need for Yandex.Maps in the Tajik language.
Try to develop applications taking into account internationalization and interact with your localization manager or translation agency already at the design stage to save yourself time, resources and money, as well as to ensure the highest quality of local versions of your products.

Source: https://habr.com/ru/post/165705/

All Articles