Unicode tutorial

We finally did it! For a long time, the shameful legacy of CP1251 irritated developers, suggested that, how could that be? The Unicode era has come a long time ago, and we still use single-byte encoding and place crutches in different places for compatibility with external systems. But the reason for this was quite rational: to translate a large project into Unicode, into which My World has evolved, is very laborious. We estimated it at half a year and were not ready to spend so many resources on a feature that does not bring significant benefits to the Russian-speaking audience.

But the story makes its own adjustments, often quite unexpected. It is no secret that the My World project, which is the most popular social network in this country, is very popular in Kazakhstan. And we always wanted our Kazakh users to have the opportunity to use symbols of the Kazakh alphabet from the extended Cyrillic set, which, unfortunately, had no place in CP1251. And an additional incentive for us, which finally allowed us to justify the long-term development, was the further growth of the project’s popularity outside our country. We realized that it was time to take a step towards foreign users.
')
Of course, the first thing that was needed to internationalize a project was to start receiving, transmitting, processing, and storing data in UTF-8. This procedure for a large project is not easy and time consuming, along the way we had to solve some rather interesting tasks, about which we will try to tell.

Database recoding

The first choice we encountered was fairly standard - from which end to start: from page display or from data stores. We decided to start with the repositories for the reason that this is the most protracted and time-consuming process, requiring coordinated actions of developers and administrators.

The situation was further complicated by the fact that in our social network, for speed reasons, a large variety of specialized repositories are used. Of course, not all of them contained text fields that are subject to internationalization, but still they turned out to be quite a few. And the first thing that had to be done was to conduct a complete inventory of all our repositories for the content of text lines in them. We have to admit that we have learned a lot of new things.

Mysql

We started converting storage into UTF-8 with MySQL. The reason for this was that, in general, a change in the encoding of this base is supported natively. But in practice, everything was not so simple.

Firstly, it was necessary to carry out the conversion of the base without downtime at the time of conversion.

Secondly, it turned out that for all the tables alter table `my_table` convert to character set utf8; not rational and, moreover, impossible. It is not rational because the index for a UTF-8-field always takes 3 * length_in_characters bytes, even if the field contains only ASCII characters. And we had a lot of such fields, including index fields, especially those that contained hex strings. It is impossible due to the fact that the maximum length of the index key in MySQL is 767 bytes, and the indices (especially multi-column ones) no longer fit. In addition, it was found that in some text fields binary data is stored by mistake and vice versa, and each field must be carefully checked.

After we collected information about the tables there from our databases, it became clear that most of them were not used. So it turned out as a result, we deleted about half of all the tables in them from the databases. In order to find unused tables, we used the following technique: using tcpdump, we collected all queries to our databases in 24 hours, then crossed the list of tables from this dump with the current database schema and searched for unused tables by code (at the same time, cleaned up the code). Tcpdump was used because, unlike writing all requests to the log using MySQL, it does not require restarting the database and does not affect the speed of processing requests. Of course, it was scary to delete the tables right away, so at first they just renamed the tables with a special suffix, waited several weeks and then deleted (by the way, it was not for nothing that they reinsured, a couple of extra little-used ones were hooked by mistake, had to be returned).

Then we started the actual writing of the DDL for database conversion. For this, several standard patterns were used:

if there were no text fields in the table, then (just in case, all of a sudden we ’ll add) just executed the query: alter table `my_table` default character set utf8;
If the table contains only varchar text fields requiring internationalization, then: alter table `my_table` convert to character set utf8;
fields containing only ASCII characters converted to ASCII: alter table `my_table` modify` my_column` varchar (n) character set ascii ...;
fields requiring internationalization are standard: alter table `my_table` modify` my_column` varchar (n) character set utf8 ...;
but, for some fields with a unique index, because of the equality in collation utf8_general_ci (as opposed to cp1251_general_ci ) of the letter e and e, it was necessary to crutch: alter table `my_table` modify` my_column` varchar (n) charater set utf8 collate utf8_bin ...;
for index fields, which, after conversion, stopped getting into the index, I also had to crutch: alter table `my_table` drop index` my_index`, modify `my_column` varchar (n) character set utf8 ..., add index` my_index` (`my_column` ( m)); (where m <n , and the index, as a rule, by several fields);
text fields containing binary data were translated into binary and varbinary;
binary fields containing text strings in CP1251 were converted in two ways : alter table `my_table` modify` my_column` varchar (n) character set cp1251; alter table `my_table` modify` my_column` varchar (n) character set utf8 ; This is necessary for the first MySQL query to understand that the data is in cp1251 encoding, and the second to convert it to utf8.
text blobs had to be processed separately, since when converting to character set utf8 MySQL expands the blob to the minimum necessary in order to fit the text of maximum length, all of which is three byte characters. That is, the text automatically expands to mediumtext . This is not exactly what we wanted in a number of cases, so we gave it explicitly: alter table `my_table` alter` my_column` text character set utf8;
and, of course, for the future, the final chord: alter database `my_database` default character set utf8;

The task of converting the base to UTF-8 without downtime at the time of conversion was decided in the usual way: through a replica. But not without its features. First, in order for the strings to be automatically converted when the replica is picked up from the wizard, it is necessary that replication be sure to be in statement mode, and in raw mode it is not converted. Secondly, to switch to statement replication, you also need to change the transaction isolation level from the default repeatable read to read commited .

Actually converted as follows:

Switch the master to statement replication mode.
Raise a temporary copy of the database for conversion, run the conversion on it.
At the end of the conversion, we transfer the copy to the replica mode from the main database, the data is caught up, the strings are also converted on the fly.
For each base replica:
- we transfer the load from the replica to the temporary replica to UTF-8;
- Pour all replicas from scratch from the time base, turn on replication from it;
- return the load back to the replica.
We transfer the temporary database to the master mode, transfer requests from the old master to the temporary one using NAT.
The old master is poured from the time base, we catch up with replication.
We switch the master back, remove the NAT, return the replication back to mixed.
Disable temporary database.

As a result, for three months of hard work, we managed to convert all 98 masters (plus a bunch of replicas) with fifteen different database schemes (one especially large 750GB base was converted to almost two weeks of computer time). Admins cried, didn’t sleep at night (sometimes the developers weren’t allowed to sleep either), but the process wasn’t as fast as we wanted. Initially we wanted the best way and carried out the conversion according to the above scheme; to speed up the process, we used machines with SSD disks. But at the end of the third month, realizing that in this situation it would take two more months to work, they did not stand, threw the entire load from the replicas to the masters and began to convert directly on old replicas. Fortunately, there were no abnormal situations during this time on the masters, and in a week (mainly because the replicas were spinning on rather weak old wheelbarrows) the conversion was completed.

In addition to the conversion of the bases themselves, it also took the support for UTF-8 in the code, as well as to ensure a smooth and imperceptible transition. With MySQL, everything is true, simple. The fact is that he has a separate encoding in which he stores the data, and a separate encoding in which he gives the data to the client. Historically, the servers we have been registered that character_set_ * = cp1251 . For the parameters character_set_client, character_set_connection, character_set_results, we did not change anything so as not to break old customers, and left cp1251. The rest was replaced by utf8. As a result, old clients working in cp1251 still receive data in cp1251, regardless of whether the base is converted or not, and new ones working in UTF-8, after establishing the connection, immediately execute the set names utf8 command; and begin to enjoy all the benefits of this encoding.

Tarantool

What is a tarantula, I think, you can not tell. This brainchild of My World has already gained sufficient fame and has grown into a good open source project .

Over the years of its use, we managed to accumulate a huge amount of information in it, and when it turned out that we had 400 pieces of tarantula instances, frankly, it became scary that the conversion would take a long time. But, fortunately for us, it turned out that only 60 of them have text fields (mostly user profiles).

We have to admit that transcoding tarantulas was a really interesting task. And the decision turned out quite elegant. But, of course, not quite out of the box. Immediately, I’ll make a reservation that historically it turned out that after tarantool began to develop as an open source project, it turned out that the needs of the community and ours do not coincide a bit. The community needs an understandable product, a key-value repository that works out of the box, we need a product with a modular architecture (a framework for writing repositories), additional highly specialized features, and performance optimizations. Therefore, somewhere we continued to use tarantool, and somewhere we began to use its fork octopus , which is being developed by the author of the very first tarantula. And this greatly simplified the conversion process. The fact is that in Octopus it is possible to write replication filters on lua, that is, to transfer not original commands from snapshot and xlog masters, but those that have been modified using the lua function. This possibility was once added long ago in order to be able to raise partial replicas containing not all the data from the master, but only certain tuple fields. And we had the idea that, in the same way, we can recode texts in the process of replication on the fly.

And yet, the octopus for this task had to be slightly finished: although the feeder (the master process feeding the xlog and replica) had already been implemented as a separate octopus module mod_feeder for a long time, it still could not be started separately without storage (in this case, the key value, implemented by the mod_box module), and it was necessary that changes in the replication mechanism did not require restarting the wizard. Well, of course, I had to write replication filters on lua, which for each namespace converted the necessary fields from CP1251 to UTF-8.

In addition to actually converting data in tarantulas and octopuses, it was necessary to ensure that the code works transparently with their shards that have already been converted and not yet, as well as to ensure atomic switching from working in CP1251 to working in UTF-8. Therefore, it was decided to put a special transcoding proxy in front of the repositories, which, depending on the checkbox in the client's request, converted the data from the encoding of the base into the encoding of the client. Here, Octopus came again to help us, or rather, its mod_colander module, which allows you to write fast proxy servers, including lua (since octopus uses luajit and ffi, it turns out really productive).

Total, the tarantool / octopus to UTF-8 conversion scheme is as follows:

Configure utf8proxy on the master and replicas. We raise it on the port that the tarantula had previously listened to; we tangent the tarantula itself to another port. From this point on, customers can execute queries in both CP1251 and UTF-8.
On the server with the master, we launch the reconverting utf8feeder , configure it to read snapshot'y and xlog'i from the same directories where the master writes them.
On another server, aside, we raise a temporary replica of the wizard, configure it to replicate from the converting feeder. In a temporary replica, the data will already arrive in UTF-8 encoding.
utf8proxy replicas set up to replicate from a temporary replica, the old replica is poured from a temporary replica, then we return the load back.
Firewall port on utf8proxy wizard (so that there are no conflicts on updates), utf8proxy is reconfigured to a temporary replica, we make a temporary replica a temporary master, we extinguish the old master, decapitate the port to utf8proxy .
We translate the new master from the temporary one, switch the replicas to the replication from it.
We make the new master a master with utf8proxy , turn off the temporary master. In this step, all instances contain data in UTF-8, you can start writing non-Cyrillic texts from clients.
After switching all clients to UTF-8, remove utf8proxy .

The whole process of recoding tarantulas / octopus took about a month. Unfortunately, there were no overlays: since they converted several shards in parallel, they managed to confuse the two shards in places when switching masters. By the time the problem was discovered, a significant amount of data changes had already occurred. I had to analyze xlogs from both shards and restore justice.

Memcached

At first glance, it seems (to us, at least, it seemed so at first) that with the conversion of caches it will be easiest: either we write UTF-8 to keys with a different name, or to other instances. But in practice this does not work. There are two reasons for this: first, it will take twice as many caches, and second, when switching the encoding, the caches will be cold. If the second problem can be fought by smoothly switching across several servers, then from the first, given the large number of caches, it is much more difficult.

Therefore, we chose to mark each key with a flag about the encoding it contains. Moreover, the pearl barley client to the Cache :: Memcached :: Fast memkesh already has this feature: when saving a string in a memkey in one of the key flags ( F_UTF8 = 0x4 ), it writes the internal pearl-barley flag of the string SVf_UTF8, which is set if the string contains multibyte characters . Thus, if the flag is set, then the string is uniquely in UTF-8, if not, then everything is a bit more complicated: this is either a string of text in CP1251, or a binary one. The text strings, of course, we convert if necessary, but with the binary ones there is a difficulty: in order not to break them with unnecessary conversion, we had to separate the set / get methods (and so on) for text strings and binary ones, to find all the saving binary strings into memeshed and receive them, and replace them with appropriate methods without automatic transcoding. In the code, they did the same and added support for the F_UTF8 flag.

Other self-storage

In addition to the above-mentioned standard repositories, we use a huge number of self-stored repositories used to store the “what's new” tape, comments, message queues, dialogs, search, and other things. We will not dwell on each of them in more detail; we only mention the main cases and the ways of solving them.

Storage is difficult to convert without downtime, or soon we plan to transfer the data to a new storage, or data with a short lifespan. In such cases, the data was not converted, but the new records were marked with the encoding feature in one of two ways: either with a flag indicating what encoding the entire record is in, or with a BOM marker at the beginning of each string field if it is in UTF-8.
The storage does not store the strings themselves, but the hash sum from them. Used by us to search. There they simply went through the entire repository with a script that replicated the hash sums from the original lines converted to UTF-8. At the time of conversion, I had to perform two database queries for each search query: one in CP1251, the other in UTF-8.
A proxy is already installed in front of the repository, and all requests to the repository go through it. In this case, a conversion to a proxy was implemented, similar to how it was done for a tarantula, with the only difference that if this is a temporary functionality for a tarantula, then in this case it will remain until the data stored in the database is relevant.

UTF-8 support in code

In parallel with the way our administrators converted databases, the developers adapted the code to work with UTF-8 encoding. Our entire code base is conditionally divided into three parts: Perl, C, and templates.

When designing the procedure for switching a project to UTF-8 encoding, one of the key requirements for us was the ability to switch on one server. It was necessary first of all in order to ensure the possibility of testing a project in UTF-8 using combat bases, first by our testers, and then by a few percent of our users.

Perl and UTF-8

To adapt the pearl barley code to work with UTF-8, it was necessary to solve several basic tasks:

convert the Cyrillic strings scattered across the code;
take into account the server encoding when establishing a connection to all storages and services;
consider that the parameters of HTTP requests may not come in the encoding in which the server operates;
it is necessary to give the content in the server encoding and use the correct templates;
it is necessary to unambiguously logically separate byte strings and character strings, decode UTF-8 (from bytes to characters) at the input and encode it at the output.

We decided to convert the perl code from CP1251 to UTF-8 in a somewhat nontrivial way: we started by converting modules on the fly when compiling using filters (see perlfilter and Filter :: Util :: Call ; perl allows you to modify the source codes between reading disk and compilation). This was needed in order to avoid multiple conflicts with the merge of the repository branches, which would arise if we tried to convert repo in one single branch and keep it aside during the development and testing process. The entire testing process and the first week after launch, the source codes continued to remain in CP1251 and were converted directly on the combat servers when the daemons were started, if the server was configured as UTF-8. A week after the launch, we converted the repository and immediately set the result to master. As a result, conflicts with merge arose only for those branches that were in development at this moment in time.

The most routine was the process of adding to all necessary places the automatic conversion of rows for storage, which we did not convert to UTF-8 entirely. But even in those cases where the conversion of strings in the pearl was not needed, it was still necessary to take into account the fact that in the pearl there is a difference (and significant) between byte and character strings. Of course, we wanted all text strings to automatically become character after reading from the database, which required analyzing all I / O for binary data or text data, had to go through all the pack / unpack calls, and after unpacking, mark all the necessary strings as symbolic (or, on the contrary, before packing, make a byte string so that the length is counted in bytes, not characters).

The problem that the HTTP request parameters can come either to CP1251 or to UTF-8 (depending on the encoding of the referer page) was first decided to be solved by passing an additional parameter in the request. But then, after analyzing how CP1251 and UTF-8 is encoded, they came to the conclusion that we can always uniquely distinguish Cyrillic in CP1251 from Cyrillic in UTF-8 by checking the line for whether it is valid UTF-8 (only Russian letters in CP1251 is almost impossible to make valid UTF-8).

In general, the way work with UTF-8 in pearl is organized is convenient enough, but nevertheless it is often magical, and you should keep in mind that it is necessary:

forget about the fact that the lines have the SVf_UTF8 flag (it is useful only for debugging), instead, treat the lines as byte and character, but forget that the internal representation of the pearl-barley string with the SVf_UTF8 flag is UTF-8;
forget about the functions Encode :: _ utf8_on (), Encode :: _ utf8_off (), utf8 :: upgrade (), utf8 :: downgrade (), utf8 :: is_utf8 (), utf8 :: valid ();
use utf8 :: encode () when converting a character Unicode string to UTF-8;
take into account that for a pearl the UTF-8 and utf8 encoding are slightly different encodings: for the first, only the code point <= 0x10FFFF are valid (as defined by the Unicode standard), and for the second, any IV (int32 or int64 depending on the architecture), encoded using the UTF-8 encoding algorithm;
accordingly, utf8 :: decode () can only be used for decoding from trusted sources (own databases), in which there can be no invalid UTF-8, and when decoding external input, always use Encode :: decode ('UTF-8 ', $ _) to protect against invalid, in terms of Unicode, code points;
do not forget that the result of the utf8 :: decode () function is sometimes useful to check to see if the byte string was utf8 valid. For similar purposes of checking for valid UTF-8, you can use the third parameter in Encode :: decode ();
Note that the upper half of the latin1 table contains the same characters as the Unicode code points with the same numbers, but they will be encoded differently in UTF-8. This affects the result of the erroneous double call utf8 :: decode () : for the rows containing only the code point from the ASCII table or containing at least one character with code point> 0xff everything will be fine, but if the string contains only the characters with the code point from the upper half of the latin1 and ascii tables, the symbols from latin1 will be beaten.
use the latest version of pearl. On perl 5.8.8, we attacked a remarkable bug: a combination of use locale and some regular expressions with the correct input data leads to an infinite loop of the regular. It was necessary to limit the scope of the use of use locale only for the strictly necessary set of functions: sort, cmp, lt, le, gt, ge, lc, uc, lcfirst, ucfirst.

C and UTF-8

In our C code, fortunately, there were not as many lines as in the pearl, so we went along the classical path: we took all the Cyrillic lines into a separate file. This allowed limiting potential conflicts in merge in the framework of a single file, as well as simplified subsequent localization. In the process of converting repo to UTF-8, they found something amusing - the Russian-language comments in the code were in all 4 Cyrillic encodings: cp1251, cp866, koi8-r and iso8859-5. When converting, it was necessary to use auto detection of the encoding of each specific string.

In addition to the repo conversion, C also needed support for basic string functions: defining the length in characters, adjusting the registers, cutting the string to length, etc. To work with Unicode, C has a wonderful libicu library, but it has a certain inconvenience: UTF-16 uses internal representation. Of course, we wanted to avoid the overhead of transcoding between UTF-8 and UTF-16, so for the most frequently used simple functions, we had to implement analogs that work directly with UTF-8 without transcoding.

Templates, javascript and UTF-8

With patterns, fortunately, everything was quite simple. In production, they are decomposed into rpm packages, so the logical solution was to transcode the rpm into the build process. We added another package with templates in UTF-8, which were installed in the next directory, and the code (both pearl barley and sishny) after that simply selected the template from the corresponding directory.

With javascript out of the box did not work. Most browsers when downloading javascript take into account its Content-Type, but still there are some old instances that do not do this, but focus on the encoding of the page. Therefore, we put a crutch: when building a package with javascript, we replaced all non-ASCII characters with their escape sequences in the form of code points numbers. With this approach, the size of js increases, but then any browser loads it correctly.

What is the result

In the end, after six months solitaire converged. The admins have just finished recoding a couple of hundred bases, the developers have finished the code, the testing process is also complete. We gradually switched knobs on the World control panels: first, all our colleagues' accounts were transferred to UTF-8, then one percent of users, after which they began to switch backend servers and 10 servers and, finally, frontendes. Visually, nothing changed, neither the project page nor the load schedules, which could not but rejoice. The only external change, according to which it was clear that half a year was not by chance, is a change in the Content-Type line charset = windows-1251 to charset = UTF-8 .

Three months have passed since then, our Russian-speaking users have already appreciated the ability to insert emoji and other ryushech into the text, and the Kazakh have begun to correspond in their own language and, more recently, they have had the opportunity to use the web interface and mobile applications in their native language. There were also enough interesting tasks in the process of internationalization and localization of the project that followed unicode, and we will try to devote a separate article to this.

Source: https://habr.com/ru/post/235209/

All Articles