📜 ⬆️ ⬇️

A couple of words about UTF-8

Perl didn't know anything about encodings for a long time. The string was just a sequence of bytes, everyone kept everything they wanted, and only occasionally had to think about what kind of encoding this data had. Times have changed, UTF has appeared; perists had to support him. As is usually the case, in a perl way. I hope this article will save some health for those who are still in the dark about how to implement UTF-8 in Perl.

Actually, the implementation of UTF-8 in Perl was two. The first appeared in Perl 5.6, but was rather raw and uncomfortable. Beginning with Perl 5.8, the unicode mechanism was radically revised, and the modules on CPAN were filled with amusing checks on the interpreter version. Everything that is written below refers specifically to this second implementation.

Pros and cons


If you still haven’t thought about encodings, quietly developed monolingual applications and are going to continue in the same vein, you almost certainly don’t need a unicode. Data in single-byte encodings is in any case more compact, they are processed faster, and it is easy and pleasant to deal with them.

You will probably need UTF-8 if you do not know in advance what form the next portion of data will come to the application, or develop an international project. After all, even if your site is in English, any German with umlauts in your full name, or even a resident of heaven, can easily register on it. The easiest way to not think about what happens after this in the database (well, how you will show the name of the Chinese in your favorite latin-1) is to work in an encoding that supports many languages.
')
And one more case, when you cannot do without familiarity with Perl UTF - integration with third-party components working in this format. For example, the XML::LibXML returns the results of parsing XML files in this format.

The perl way


Probably, the maintainers argued something like this: we stored byte strings in variables, now we need to learn how to store characters there. The length of a character in UTF-8 is variable and may be more than one byte. If regulars and functions for working with strings (such as length , substr ) start behaving differently, they won't thank us. So, you need to make two types of strings - for working in the old scheme, with bytes , and for working with the new scheme, with characters . How to do it? And let's introduce a hidden flag for scalars. If the flag is set, the string is perceived as consisting of logical characters (let's call it Perl Internal Format ), if not - bytes.

If you take two identical unicode variables and one of them simply omits the flag, the variables will be processed by pearl differently (for example, they will most likely have different lengths). However, the data itself does not change - it can be seen, for example, if both variables are output to a file, or to the screen.

It is worth mentioning that UTF-8 characters are often called wide characters in Perl terminology. If you come across varnings with these words, then it comes to unicode strings.

There are several options for working with unicode data in Perl. The main ones are:
  1. the compulsory indication of unicode characters in a string is through a construct of the form \x{0100} ;
  2. manual transcoding of a string using the Encode module, or functions from the utf8 package;
  3. including the use utf8 pragma - the flag is raised for all constants that are encountered in the code;
  4. reading from the I / O descriptor indicating IO-Layers :encoding or :utf8 - all data is automatically recoded into internal format.
With point number 1, I hope everything is clear and it does not cause questions. Just in case, mention that curly braces are required. The remaining options will take a closer look.

Encode module

The module is included in the delivery of Perl 5.8, so it makes sense to use it not only for Unicode, but also for any other encoding transformations. Working with the module is not too complicated. The only problem is to learn not to confuse the function encode with the function decode :-). Their interface is the same, and the logic of the name is not as obvious as we would like. Since the format of strings with a unicode flag is considered an internal format , it is necessary to decode data from arbitrary encoding (including UTF-8 without a flag), and vice versa, if you want to convert the data to some external encoding, you need to encode them from internal format her It looks like this:

$bytes = encode('cp1251', $string); # cp1251
$string = decode('cp1251', $bytes); #


Since not all characters can be lost without loss from one encoding to another, there is also a third parameter that determines how to behave in case of problems. You can read about it in the documentation for the Encode module , a whole section is devoted to it.

If you are sure that your variable contains bytes in UTF-8, you can simply raise the flag of the variable without recoding and checking it with _utf8_on . The is_utf8 function will help determine whether a line has a flag (and, if desired, check the validity of the data lying there). Well, the flag is reset, as you might guess, through _utf8_off . The only "but" - these functions are marked as INTERNAL , and you should not count on their immutability.

Beginning with Perl 5.8.1, some of the functions of the Encode module became available in the utf8:: namespace utf8:: these are is_utf8 , encode , decode . The last two differ from the synonyms from the Encode module in that they change the value of the passed variable instead of returning the result, and do not require encoding (it is understood that the work takes place with UTF-8 data without a raised flag). All these functions are built into the interpreter, and you do not need to write use utf8 to access them - moreover, this can lead to additional effects (about them a bit later).

use utf8;

The use utf8 tells the interpreter that all constants and regular expressions written in its range and having non-ASCII characters should be treated as unicode and automatically converted to internal format. To cancel the pragma action, as usual, use the no utf8 construction.

There is also an opposite by sense use bytes pragma, in the coverage of which even data with the UTF-8 flag are treated as consisting of bytes.

Perlio

The Perl IO Layers theme basically deserves a separate article. The idea is that for some time now the good old open function has acquired a three-argument syntax:

open $fh, $mode, $filename

In addition to the standard values ​​of the type '>' and '<' in $mode you can also specify the file encoding. At the same time, the loaded data is automatically converted into the internal Perl format:

open $fh, "<:encoding(cp1251)", $filename

If we are talking about a file that contains data in UTF-8, the code can be slightly simplified:

open $fh, "<:utf8", $filename

Of course, these modifiers can also be used to modify files - the effect will be the opposite.

By the way, in Perl it is possible to make I / O streams unicode once and for all using the -C command line key. Details can be seen, as always, in perldoc .

Rake


Of course they are. In general, sometimes there is a feeling that at each turn of development Perl scatters around a lot of different rakes, which programmers then diligently collect (sometimes twice, if the first rakes were experimental).

First, some functions by definition work with bytes, not characters, and the lines in the internal representation get them across the throat. These functions include frequently used functions from the Digest::MD5 module. So, the given example will fall off with the Wide character in subroutine entry at test.pl line 3. .:

use Digest::MD5 'md5_hex';
print md5_hex("\x{400}");


Secondly, the data do not always come in the form in which the program expects to be seen. It would be naive to expect, for example, that valid UTF-8 will always come to an HTML form handler. The results of excessive trust in the sources can be quite diverse, starting with data corruption and ending with fatal errors when trying to recode them to a different encoding (for example, when creating an email).

Finally, the most frequent and interesting problem occurs when you try to concatenate two strings, only one of which is stored in the internal pearl-barley format. Suppose we have such a file (recorded in UTF-8):

use Encode;
$a = decode('utf8', " "); #
$b = " "; # 15
$c = $a.$b;


In the last line, Perl tries to bring the lines to a common denominator format. Since he sees $b as a chain of bytes, each byte of this string is encoded in UTF-8. The result will be approximately the same porridge (with the flag raised, by the way):

$c = " на Хабре"

The glitch is quite clearly visible with the naked eye on unicode-specific krakozyabram - you will not confuse with anything.

Conclusion


The article remained undisclosed, many subtleties. A number of utilities from the Encode modules, utf8 remained behind the scenes. There was no place for mentioning a variation of the internal format that is sensitive to UTF-8 characters that are not valid from the point of view of UTF-8. The questions related to regular expressions are completely omitted. If you want to understand this topic to the end, pay attention to the manuals:
If you have any questions, I will try to answer them.

UPD: codesign habrayuser has sent links to their developments on the same topic, I recommend:

Source: https://habr.com/ru/post/53578/


All Articles