Emoji.prototype.length - a story about emotional characters in Unicode
Habr is rather hostile to Emoji (they are simply not displayed here), considering them to be something of a “Padonkaff” language.Not for serious people.After all, both appeared about the same time.And if the “Olbansky” yezyg quickly went into oblivion, then Emoji evolved from simple semicolons and parentheses to full-fledged characters in Unicode.The author of this article proposes to look at the fact that these small entities are “under the hood” (hereinafter, the notes of the translator are in italics).
Emoji is the basis of the textual communication of our day. Without these little symbols, many conversations in chat rooms today would be filled with awkward silence or misunderstanding. I still remember the good old days when SMS was a cool thing.
')
The offer to chat in a chat without emoticons is likely to lead to the message "Are you kidding?". Everyone quickly realized that humor and sarcasm (by the way, it would not hurt us to be less sarcastic) is not easy to convey, using only written signs. At some point, the first Emoji came into being, and they quickly became one of the fundamental components of any conversation in text format.
Even though I use Emoji every day, I never wondered how they work. Obviously, they are somehow related to Unicode, but I had no idea what was going on under the hood. And I, frankly, did not care.
Everything changed when I came across a Wes Bos tweet in which he showed several JavaScript operations on a line containing the Emoji family.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Suppose that the use of a spread-operator in such a line did not surprise me much, but the fact that one visible character was divided into three characters and two empty lines, somewhat puzzled me. And the fact that the property of the length string returned the value of 8 surprised me even more, since there were 5 values in the array that returned the spread-operator, but not 8 in any way.
Without thinking twice, I opened the console, and made sure that everything happens exactly as Wes described. So what happens here? I decided to dig deeper into Unicode, JavaScript, and the Emoji family to figure it all out.
Unicode to the rescue
To understand why JavaScript processes Emoji in this way, we need to look deeper into Unicode itself.
Unicode is an international standard for character encoding in the IT industry. It establishes a correspondence between each letter, character or symbol and a numerical value. Thanks to Unicode, we can share documents that contain, for example, special German characters ( umlauts ) ß, ä, ö, with people whose systems do not use them. Thanks to Unicode, the encodings work on different platforms and environments.
In Unicode, 1 114 112 different characters are defined, and they are usually represented using U+ followed by a number in hexadecimal notation . The range of Unicode characters begins with U+0000 and ends with U+10FFFF .
The entire code space (more than one million characters) is divided into 17 tons. “Planes”, and each plane includes over 65,000 characters. The most important is the zero, “Basic Multilingual Plane” (“Basic Multilingual Plane”, BMP). Its range is from U+0000 to U+FFFF .
The base plane contains symbols of almost all modern languages, plus a large number of other symbols. The remaining 16 planes are called optional and are used for various purposes, such as - you yourself could guess - the definition of most of the characters of Emoji.
How are Emoji defined?
As we know, Emoji is defined by at least one character from the Unicode set. If you look at all the Emoji presented in the Emoji Complete List , you will notice that there are a lot of them. And the word "many" I mean really a lot. You may ask yourself how many different Emoji is defined in Unicode today? The answer to this question, as it often happens in IT, is “It depends on ...”, and we have to deal with this before we get the answer.
As I wrote above, Emoji is defined by at least one character. This means that there are some Emoji that are defined by a combination of several other Emoji and characters. These combinations are called sequences. Thanks to the sequences, you can change the neutral Emoji (usually displayed with yellow skin) and make it more personal.
Sequence modifier for different skin colors
I still remember that moment when I noticed that I can change the thumb up icon in the chat so that it matches my skin color. It gave me a sense of belonging, and I felt that this thumb was closer to me than all my previous posts.
In Unicode, there are five modifiers for modifying a neutral Emoji and representing the whole variety of human skin colors. Modifiers range from U+1F3FB to U+1F3FF and are based on the Fitzpatrick scale .
With the help of these modifiers, we can turn a neutral Emoji into the same, but with a different skin color. Let's look at an example:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When we took Emoji's girl, the symbol of which is U+1F467 and applied a skin color modifier ( U+1F3FD ) to it, we automatically got a girl with this skin color for those systems that support this sequence.
ZWJ sequences for even more variety.
Skin color is not the only thing that distinguishes people from each other. When we recall the example of a family, it becomes clear that not every family consists of a man, a woman, and a boy.
In Unicode, there is a symbol to describe an ordinary family ( U+1F46A - ), but not every family looks like that. We can create any family using the so-called Zero-Width-Joiner (ZWJ) sequence.
This is how it works: there is a special character called a zero-width combiner ( U+200D ). This symbol works like glue, indicating that two characters should be displayed as one when possible.
Logically, what could we put together to show the family? The answer is simple - two adults and a child. Using ZWJ sequences, we can easily map different families.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If you look at the list of all possible sequences , you can see that there are even more options, for example, one father with two girls. Unfortunately, at the time of this writing, the support for these sequences is not very good, but the ZWJ sequences degrade gradually ( Graceful degradation ), returning the sequence of individual Emoji. This allows you to maintain semantic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Another cool thing is that the principles of association apply not only to the family of Emoji. For example, let's take the famous Emoji David Bowie (the real name is “singer”). This is also a ZWJ sequence consisting of a male ( U+1F468 ), a ZWJ combiner and a microphone ( U+1F3A4 ).
And, as you might have guessed, if we replace a man ( U+1F468 ) with a woman ( U+1F469 ), we get a singer (or the female version of David Bowie). You can also add a skin color modifier, then we get a black singer. Great!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Unfortunately, at the time of writing, support for these new characters also leaves much to be desired.
Different amount of Emoji
So, the answer to the question of how much Emoji exists today depends on what you think of Emoji. Is it the number of characters that were used to display Emoji? Or will we consider all the variants of Emoji that can be displayed?
If we count all the variants of Emoji that can be displayed (including sequences and variations), we get 2 198. If you are interested in the counting process, then here is a whole section about it on unicode.org .
Also to the question "How to count" you can add the fact that new Emoji and Unicode characters are added to the specification all the time, which makes tracking their exact number even more difficult.
Returning to strings in javascript and 16-bit encoding
In UTF-16, the string format used in JavaScript, one 16-bit code value (2 bytes) is used to represent most characters. This means that just over 65,000 different code values can fit into one JavaScript character. This is exactly the same as the Base Multilingual Plane (BMP). So let's try to match Unicode characters with a few characters defined in BMP.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When we apply the length property to these strings, we get one, and this fully meets our expectations. But what happens if I want to use a symbol in javascript that is outside the bmp range?
Surrogate couples rush to the rescue
Two characters defined in the Base Plane can be combined to display another character that lies outside of it. This combination is called a surrogate pair.
The characters lying in the range from U+D800 to U+DBFF are reserved for the so-called senior or "leading" surrogates, and the characters in the range from U+DC00 to U+DFFF for the younger or "closing".
These two symbols must always be used in pairs, starting with the eldest and ending with the minor surrogate. Then a special formula is applied to decode characters that are out of range.
Let's look at an example:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
An ordinary man in Emoji is represented by the symbol U+1F468 . This character cannot be represented by a single 16-bit JavaScript character. Therefore, to display one character outside BMP ( U+1F468 ), a surrogate pair consisting of two characters included in the BMP ( U+D83D and U+DC68 ) should be used.
There are two methods for analyzing characters in JavaScript. We can use charCodeAt , which returns surrogate pseudo-character codes if they are used to compose a common character. The second method is codePointAt , which will return the code of the combined pair of surrogate characters if we “hit” the “leading” surrogate symbol or return the code of the “closing” surrogate symbol if we “hit" it.
Do you think this is terribly confusing? I also think so, and I highly recommend that you carefully read the articles on MDN about these two methods ( charCodeAt , codePointAt ) (you can also read about this at learn.javascript.ru ).
Let's take a closer look at the character of the man in Emoji and calculate. Using charCodeAt, we can get the codes of "surrogate" pseudo-characters used in the surrogate pair.
The first character is 55357, which corresponds to D83D in hexadecimal notation. This is the “leading” pseudo-symbol. The second value 56424 corresponds to DC68 and is a “trailing” pseudo-character. This is a classic surrogate couple, which as a result of the calculation by the formula will give the result 128104, which corresponds to the character of a man in Emoji.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Having dealt with Unicode codes and characters, we can proceed to the strange behavior of the length property. It turns out that it returns the number of Unicode code values, and not the characters that we see, as we thought at the beginning. This can lead to difficulties in catching bugs when working with Unicode in javascript strings - so be careful when you are dealing with characters outside of BMP.
Conclusion
Let's go back to the example of the Weight that started it all.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Emoji family we see here consists of a man, a woman, and a boy. The spread operator will return individual Emoji characters. Blank lines are not really empty - they are ZWJ combiners. The length property, in this case, returns 2 for each character of Emoji and 1 for ZWJ combiners. As a result, we get 8.
I really enjoyed my immersion in Unicode. If you are also interested in this topic, I would recommend a @fakeunicode Twitter account. There are many interesting things about what Unicode is capable of. By the way, did you know that there are even podcasts and conferences about Emoji? I will continue to follow all this, because I am very interested to learn more about these little symbols that we use everywhere. Perhaps this topic interested you.