📜 ⬆️ ⬇️

Arabic localization: words, words, words



This second part of the story about the localization of ABBYY Sprint Reader in Arabic. Unlike the first part , where actually there was little about the language, and a lot about the windows, here we will talk about writing and its features.

Arabic Arabic numerals


More precisely Indo-Arab. While the Europeans and the half-world that joined them use Arabic numerals, the Arabs themselves (in most countries) prefer to use other, albeit (remotely) similar digits: ٠١٢٣٤٥٦٧٨٩. A little more detail, for example, in Wikipedia . But in Wikipedia, an amazing fact is not described: in modern Arabic, the number is likely to be recorded from left to right - regardless of which numbers are used. I did not find any detailed rules where in the interfaces what numbers to apply. Those patterns that we used were derived empirically, in the process of looking at Windows 8 with the Arabic language of the interface. And not so many of them were found:
')
• If the number is part of the English text, then it remains “English.” For example, F1 or A4.
• If numbers are entered in the text entry control, they are “Arabic”. Yes, in the IP address control too.
• Anyway, Arabic numbers are almost always used.

By the way, good news. Judging by our experience, in order to fulfill these rules, there is no need to do anything on purpose. Here is what msdn writes about this: "... we most often leave the numbers in ANSI, leaving the operating system to print the correct numbers depending on the system settings."

Pro keyboard accelerators


We are talking about those underlined letters that can be seen in the dialogs and menus (if not visible, you need to press Alt). In Microsoft terms, Access Keys. Just in case, I will describe how this works (from the user's point of view) using the example of Notepad:



If you open the menu, as in the picture, and click 'X', the application closes, if 'P' - the print dialog opens, etc. By the way, to get a menu of this type, you need to press “Alt, F” (you can not simultaneously). Support for this feature is “built-in” in WinAPI: all that needs to be done during development is to indicate in the corresponding resource of the menu or dialogue which letter will be the accelerator for a particular item. This is done using the character '&', placed in front of the desired letter. Those. In the resources of the Notepad menu, it is simply written: “E & xit”.

And what if the application language is Russian? Yes the same. Difficulties begin when the keyboard layout used for the selected interface language does not involve entering a character with one click. For example, if the application is localized in Japanese (or Chinese). But even in such cases, this problem is solved:



The solution, as can be seen in this screenshot, is not the most beautiful, but working: a Japanese letter is added to the Japanese name of the menu item in brackets, which will be the accelerator.

So, back to the Arabic. Fortunately, almost all Arabic characters are entered with one keystroke. Exceptions, i.e. Those characters that should not be done with accelerators when localized in Arabic are:

• ligatures (written Arabic loves ligatures very much, but, fortunately, only lam-alif is mandatory, the rest are almost never encountered when typing from the keyboard),
• characters that are entered with Shift pressed,
• characters that are combined in a letter in a construction that makes it difficult to understand which character is underlined (for example, an alif with Hamza: أ),
• Latin characters - if it is possible to avoid this (so, if part of the name of the menu item is an English word, do not put an accelerator on it).

More details can be found in the Arabic Style Guide , under Software Consideration, Keys.

By the way, for some reason, the technicians decided that the Arabic language in the context of such accelerators is similar to Japanese: إنهاء (X). And this mistake had a chance to live for a long time unnoticed, if not for the more obvious problem with brackets, described in the next section.

Language mixing


Sometimes (and in the case of localization of programs quite often) English words need to be embedded in Arabic text: company names (although ABBYY has a translation of the name into Chinese , but the Arabic product is still called ABBYY FineReader Sprint), technologies (for example, TCP / IP ) and much more. At the same time, the main Arabic text will be read, as it should be, from right to left, and the English words in it will be read from left to right. For example, “ABBYY Home Page” will be written as “زيارة موقع ABBYY على الويب”. By the way, try copying this phrase into a text editor, and “walking” on it with the cursor in order to understand the whole logicality of the order of reading in mixed texts.

But it was still not a problem: the mixed text itself does not bring trouble. But it stands next to the English words in the Arabic text to be punctuation marks or numbers ... For example, in our application (because it supports scanning) there is a line describing the popular paper size: “A4 (210 x 297 mm)”. The name of the size is not translated, but the unit of measurement is yes. As a result, in Arabic, this line looks like this (the screenshot is made on the English system, therefore, it is not necessary to pay attention to the "non-Arabic" numbers):



To our surprise, when the resources came from the translation, the text in the application did not look like this:



And, to heighten the fun, in Windows 8 (the previous screenshot was taken in Windows 7), the same lines looked like this:



Worse, in Windows 7, even the text without a single Arabic character was also drawn incorrectly (this is the scanner's selection dialog):



So who is to blame and what to do? The bidi algorithm is responsible for the output of the mixed text. The description of the algorithm was already on Habré, albeit with an emphasis on the use of html. I will briefly retell Wikipedia, and go to how it works in Windows.

In this algorithm, the characters are divided into strong (those whose direction is obviously known - letters), weak (those whose reading direction can change - punctuation marks) and neutral (those to which the concept of direction does not apply - whitespace). Symbols are stored in memory in order of reading: first letter, second letter, etc. Well, the algorithm displays the letters in visual order: LTR words from left to right, RTL - from right to left. Several words of the same orientation will be displayed in the order of this orientation. The mutual arrangement of parts of text of different directions is determined by the general direction of the text. The following picture shows this clearly: the order of words is indicated by numbers, the order of reading (ie, letters in words) - by arrows, the general direction of the text is indicated above.



What happens to weak characters? Their directionality is determined by the strong symbols surrounding them: if on both sides the nearest strong symbols (weak and neutral symbols are simply skipped) are RTL letters, then this weak symbol will be RTL, and vice versa. If the weak symbol appears on the border between pieces of different directions, or at the beginning or at the end of a line, then it will take the general direction of the text. It should be borne in mind that asymmetric punctuation marks in Arabic are mirrored, and for parentheses, this means that the opening bracket simply “turns” into a closing one, and vice versa.

And again about the numbers. The description of the algorithm on the wiki says that the numbers are weak characters. But it is not so. As I already mentioned, regardless of whether Arabic or Indo-Arabic numerals are used, numbers will be written from left to right. And the direction of weak characters should depend on the environment. But when calculating the direction of other weak symbols, the numbers are not taken into account, so that, in fact, they are allocated to a separate category of directional weak symbols. The experiments also showed that if the “boundary” number follows the LTR text (in the storage order in memory), then it is output as part of the LTR text even in the RTL environment.

It was experimentally found that Windows XP / Vista / 7 / 7.1 uses one version of the algorithm, and Windows 8 / 8.1 another. Starting from Windows 8, apparently, the algorithm has been improved: if there are two pair of weak characters in the line (for example, opening and closing brackets), one of which is located on the border, and the second between strong characters of the same directivity, then both pair characters drawn in the direction determined by strong symbols.

What to do if the position of the symbol defined by the rules in drawing does not correspond to the expected? The Unicode standard offers several ways to do this. Of these, in my opinion, the simplest (for understanding) and quite sufficient for any situation is the use of the symbols “Left-to-right mark” (LRM, U + 200E) and “Right-to-left mark” (RLM, U + 200F). These are invisible strong characters with a proper orientation. They simply need to be inserted into those places where the directionality of the boundary weak symbol is determined incorrectly. These MS Visual Studio symbols show, however, like Notepad, if you enable the corresponding setting in the context menu. In the same menu there is an insert command LRM and RLM:



An alternative way to enter these characters (besides the obvious: copying the desired character from another place :-), is input via Alt code. To do this, add the HKEY_CURRENT_USER \ Control Panel \ Input Method \ EnableHexNumpad string key with the value 1 to the registry and restart the computer. After that, the LRM and RLM symbols can be entered using the “Alt + Num +, 200e” and “Alt + Num +, 200f” sequences, respectively. Details: you need to simultaneously press Alt and + on the additional keyboard, release, then dial the hexadecimal code of the character, for numbers using the additional block.

Based on this, you can understand what was wrong with the above lines, and how to fix it:



In this example, the extreme bracket was incorrectly defined as RTL and, therefore, “left” to the left and unfolded. It is enough to add LRM after the bracket, and everything is formed.



In these three problems with paper sizes (the Tabloid 11x17 inches size is shown below), a combination of two factors works. First, instead of the multiplication symbol '×' (U + 00D7), a lowercase Latin “X” is used. It did not cause any problems earlier, but it did cause them here. At least to solve problems with Win8 and Tabloid it turned out to be enough to replace the 'x' with a multiplication sign. Such a strange "leaving" 'x' from the surrounding figures is also due to the fact that the width (210) was defined as RTL (by the bracket in the case of Win8 and the name of the paper size in the case of the Tabloid), the height in both cases was recognized as one LTR piece with 'x'.

Well, and secondly, in order for the problem to improve on Windows 7 (and in earlier versions), you must also add RLM before the opening bracket.

The result was this:



Rich edit


The last problem we encountered was that the resource editor defaults to Rich Edit control version 2.1. And in this version the control is poorly supported by the Arabic language. As a result, the Arabic EULA (which is the only use of Rich Edit in a product) looked rather crooked. Manual correction of “RichEdit20W” on “RichEdit50W” has corrected the situation, but only on Windows 8. Some minor, but not all, jambs have recovered from the younger versions. However, Wordpad of these systems when opening the EULA file in it (and the file was created in some older version of Word) showed an equally sad picture. The only solution we came up with is to edit this file in Wordpad in Windows XP so that it looks decent.

Source: https://habr.com/ru/post/256159/


All Articles