📜 ⬆️ ⬇️

From right to left. What is dir = rtl and how to tame Arabic


Hi, Habr. We recently translated into Arabic 2GIS Online, and want to share our experience adapting the interface under RTL (right-to-left). This will be true both for Hebrew and Persian.


I will divide this experience into two articles - theoretical and practical. Today - more about the theory. I’ll tell you why we needed to turn the whole interface, what the interface developer means to make the Arabic version and how to deal with the Arabic language mixed with English. Particular attention will be paid to the algorithm by which the display of the text of a mixed focus is built - the unicode bidirectional algorithm.


Why is this all about?


It seems that the value in adapting the interface under "right to left" is the same as that of adapting to any other popular language, but this is not quite so.


The difference between the English and Russian versions is small - most often, it is just a translated text. The user experience, with the exception of rare trifles, is generally no different. The difference between the Arabic and English versions is huge.


Only 0.6% of Internet resources in the world contain Arabic content. However, more than 5% of Internet users speak Arabic, and this share is growing rapidly. The familiar reading direction for them is from right to left. What are their feelings from the modern web? Exactly the same as that of the Russian speaker when using the RTL interface. Choose your own metaphor yourself - maybe it's like getting behind the wheel of a right-hand drive car, when you constantly drive a left-hand drive. Or how to enter once at 2GIS and see that the cards and the search are on the right:



If we want our service to be as convenient as possible for all users, it is necessary to adapt it for RTL.


What is the problem?


At first glance, it seemed immense to me - you need to redo the entire interface to meet the requirements that no one can properly explain.


After seeing a few examples of Arab sites, I understand that to make an Arabic version is:


  1. Translate the data into Arabic . This part is clearer, but it doesn’t get easier - these are huge amounts of data;
  2. Translate the interface into Arabic . It’s not so easy for us, because before that we only translated from Russian, and we don’t have translators from Russian to Arabic. You will have to first translate the lines and comments into English, and then from English into Arabic;
  3. Adapt the entire interface under the "right to left . " It seems to be just "turn everything in the other direction." We need to figure out how this happens. And for this, there are definitely some ready-made solutions.

With translations like everything is clear. With overturning the interface - nothing is clear. Let us dwell on this in more detail.


First of all, I added the dir = "rtl" attribute to the html tag:


<html dir="rtl"> 

Everything has changed, but not quite as I expected. I realized that I did not understand what was happening. By what principle are the elements lined up one after the other?


Base direction


Consider the same simple piece of layout in LTR and RTL. It is not very meaningful, but clear:


 <table> <tr> <td> Hello world </td> <td> <button>Hello</button> <button>world</button> </td> </tr> </table> 


As can be seen in the screenshot, the dir attribute (as well as the css- direction property) sets:



The elements changed the order, but the characters in the words are still arranged as usual. Because the order of characters in a string is determined by another algorithm.


The sequence of characters within the string


Physically, the characters in the string are arranged in series, but for the final display of this sequence on the screen is the unicode bidirectional algorithm .


In short:


  1. For each character in the string, the directivity is calculated;
  2. A string beats on blocks of the same direction;
  3. The blocks are arranged in the order given by the base direction.

The directionality of each character is influenced by its type and directionality of adjacent characters.


Three types of characters


1) Strongly directed (or strongly typed, strongly typed) - for example, letters. Their orientation is predetermined - for most of the characters this is LTR, for Arabic and Hebrew - RTL.


The words in the picture are entirely strictly typed:



2) Neutral - for example, punctuation or spaces. Their orientation is not given explicitly, they are directed in the same way as adjacent strongly directed symbols .


A comma between left-to-right “o” and “w” in the “Hello, world” line takes their directivity both at the base LTR and at RTL:



But what if a neutral directional symbol falls between two highly directional symbols of different directions? Such a symbol takes a base focus.


Here, the location of “++” in one case between unidirectional “C” and “a”, and in the other - between multidirectional “C” and Arabic “و”, leads to different results:



The same happens with neutral characters at the end of a line:



3) Weakly directed (or weakly typed, weakly typed) - for example, numbers. They have their own focus, but do not affect the surrounding characters.


Continuous words of numbers line up from left to right, but two numbers in a row, separated by a neutral symbol, will follow each other from right to left if the base RTL directionality is set:



An even more obvious case is the number in which the digits are separated by a space:



In this case, it is allowed to separate numbers with a point, a comma, a colon - these delimiters are also poorly directed (for more details see the specification ):



Directional blocks


Successive characters of the same directionality are combined into blocks (directional run). These blocks are arranged one after the other in the order determined by the basic direction:



Weakly directed numbers, despite the fact that they have their own direction, do not affect the formation of blocks, which can lead to the following result - they continue the previous directional block:



Mirror Symbols


Some characters in different contexts have different forms - for example, the opening bracket in RTL will look like the closing bracket in LTR (which is logical, because the content in brackets will go after — that is, to the left of it).


In most cases, this does not create problems, but if the brackets happen to be different directions, they will visually look in one direction. For example, if the bracket hangs at the end of the line:



Take order under control


As we saw above, often the text according to these rules is not formatted the way we would like.


In this case, we can use tools to embed the desired direction in the existing context or redefine the directions of specific characters.


Isolate (isolate)


With the task of the basic direction, we have already met above: it makes the attribute dir. This is a global attribute ; it applies to any element.


dir creates a new embedding level and isolates the content from the external context. Content inside is directed according to the attribute value, and the external orientation of the container itself becomes neutral.


Explicitly setting the dir attribute avoids almost all the mixed text formatting problems:


 أنا أحب <span dir="ltr">C++</span> و Java 


If the content direction is not known in advance, you can specify auto as the value of the dir attribute. Then the direction of the content will be determined with the help of “some heuristics” - it will simply be taken from the first strongly typed character.


 <p dir="auto">{comment}</p> 

Similarly, the <bdi> and the css-rule unicode-bidi: isolate :


 <span>Landmark: <bdi>{name}</bdi> — {distance}</span> 


Embed


You can open a new level of embedding without isolation - the unicode-bidi: embed rule in combination with the desired value of the direction rule determines both the direction inside the element and its directionality outside. But this in practice is almost never needed.


Override


<bdo dir="rtl"> or unicode-bidi: bidi-override; direction: rtl unicode-bidi: bidi-override; direction: rtl . Overrides the direction of each character within an element. It is necessary to use extremely rarely (for example, if you need to swap two specific characters) and do not forget to isolate the child elements.


 <bdo dir="rtl">Hello, world!</bdo> 


At the same time, the element outside is treated as strongly directed. To make it behave like isolate outside, but like bidi-override inside, you need to use unicode-bidi: isolate-override .


Control characters (marks)


Inserting control characters is an unpleasant way, but it’s useful when we don’t have access to the markup, but we have access to the content. For example, these may simply be invisible, highly directional symbols, &lrm; and &rlm; ( &#8206; / &#8207; or \u200e / \u200f ). They help to set the desired direction to the neutral symbol.


For example, in this case, in order for the exclamation mark at the end of the line to accept the direction of LTR, it is necessary that it be between two LTR characters:


 <span dir="rtl">Hello, world!&lrm;</span> 

Also, any logic described above is implemented through control characters. For isolation, LRI / RLI, for redefinition, LRO / RLO, etc. - See the detailed guide to control characters.


Browser Support


Unfortunately, in IE the <bdi> , dir="auto" and the corresponding CSS rules are not supported. In addition, the specification of these rules is still at the Editor's Draft stage.


If you need an analog dir="auto" that works in any browser, you can parse the content with a regular schedule and set the dir attribute yourself. But it is better, of course, not to do so.


HTML or CSS?


Definitely, if possible, you should control the direction of the text through the HTML – dir attribute and the <bdi> , and not through the CSS rules. The text direction is not stylization, it is part of the content. The page can be inserted through any instant view or be read through RSS – reader.


Before the conclusion: a little pain


We met with the theory. But knowledge of the theory does not free one from the need to suffer.


The main problem I encountered in the very first minutes of development for RTL – language is its alienity. We write code from left to right. My system, browser and editor work from left to right, all our internal products are left to right. Therefore, as soon as the Arabic language gets into this space, everything is bad and painful:


Text manipulations


If the characters on the screen are not in the order in which they are actually located in the line, what will happen if you try to edit the bidirectional text? Or at least select and copy part of it?


Nothing good. Try it yourself:


Landmarks: دبي مارينا مول - 600 m, داماك العقارية - 1.2 km
azbycxdwevfugthsirjqkplom n


Code manipulation


And the same thing when editing the code in the editor and the code review is a pain.


Even in the order of the elements in the array, one cannot be sure:



Or worse, the code does not look valid at all:



You can bring to the point of absurdity :



Take control again


In any situation, there is only one source of truth - the logical arrangement of characters in the string. Characters can be positioned visually correctly in the editor or in the mail client, but on the summary page they cannot be positioned correctly, because the logical order is broken.


To see how the characters are initially arranged in a line and why they are visually arranged in this way, the tool on the Unicode website allows: http://unicode.org/cldr/utility/bidi.jsp



Total


We got acquainted with the rules by which the order of elements and the order of characters in a string are determined, and we learned how to influence it.


What you need to remember:


  1. dir="rtl" sets the flow direction, aligns the text as text-align: right , changes the order of table cells, flexes and grids. The unicode bidirectional algorithm is responsible for the sequence of characters in the string;
  2. Each character has a direction type - strictly typed (letters), weakly typed (numbers) and neutral (punctuation and spaces);
  3. Problems arise most often on the border between different types of characters.

In practice, this whole theory translates into one simple rule:


With a mixed focus, you need to explicitly isolate the levels of embedding using the dir attribute, and if the content is of an undefined focus, use <bdi> and dir="auto" .

You also need to prepare for the fact that the text in the editor may not look exactly as it would look in the browser. And it will surely hurt you.


What's next?


In the next article I will talk about our practical experience. How to quickly make a prototype of the RTL version and how to choose solutions for production. How to be prepared in advance, but what moments can not be foreseen.


')

Source: https://habr.com/ru/post/358148/


All Articles