Perl didn't know anything about encodings for a long time. The string was just a sequence of bytes, everyone kept everything they wanted, and only occasionally had to think about what kind of encoding this data had. Times have changed, UTF has appeared; perists had to support him. As is usually the case, in a perl way. I hope this article will save some health for those who are still in the dark about how to implement UTF-8 in Perl.
Actually, the implementation of UTF-8 in Perl was two. The first appeared in Perl 5.6, but was rather raw and uncomfortable. Beginning with Perl 5.8, the unicode mechanism was radically revised, and the modules on CPAN were filled with amusing checks on the interpreter version. Everything that is written below refers specifically to this second implementation.
Pros and cons
If you still haven’t thought about encodings, quietly developed monolingual applications and are going to continue in the same vein, you almost certainly don’t need a unicode. Data in single-byte encodings is in any case more compact, they are processed faster, and it is easy and pleasant to deal with them.
You will probably need UTF-8 if you do not know in advance what form the next portion of data will come to the application, or develop an international project. After all, even if your site is in English, any German with umlauts in your full name, or even a resident of heaven, can easily register on it. The easiest way to not think about what happens after this in the database (well, how you will show the name of the Chinese in your favorite latin-1) is to work in an encoding that supports many languages.
')
And one more case, when you cannot do without familiarity with Perl UTF - integration with third-party components working in this format. For example, the
XML::LibXML
returns the results of parsing XML files in this format.
The perl way
Probably, the maintainers argued something like this:
we stored byte strings in variables, now we need to learn how to store characters there. The length of a character in UTF-8 is variable and may be more than one byte. If regulars and functions for working with strings (such as length
, substr
) start behaving differently, they won't thank us. So, you need to make two types of strings - for working in the old scheme, with bytes , and for working with the new scheme, with characters . How to do it? And let's introduce a hidden flag for scalars. If the flag is set, the string is perceived as consisting of logical characters (let's call it Perl Internal Format ), if not - bytes.If you take two identical unicode variables and one of them simply omits the flag, the variables will be processed by pearl differently (for example, they will most likely have different lengths). However, the data itself does not change - it can be seen, for example, if both variables are output to a file, or to the screen.
It is worth mentioning that UTF-8
characters are often called
wide characters in Perl terminology. If you come across varnings with these words, then it comes to unicode strings.
There are several options for working with unicode data in Perl. The main ones are:
- the compulsory indication of unicode characters in a string is through a construct of the form
\x{0100}
; - manual transcoding of a string using the
Encode
module, or functions from the utf8
package; - including the
use utf8
pragma - the flag is raised for all constants that are encountered in the code; - reading from the I / O descriptor indicating IO-Layers
:encoding
or :utf8
- all data is automatically recoded into internal format.
With point number 1, I hope everything is clear and it does not cause questions. Just in case, mention that curly braces are required. The remaining options will take a closer look.
Encode
module
The module is included in the delivery of Perl 5.8, so it makes sense to use it not only for Unicode, but also for any other encoding transformations. Working with the module is not too complicated. The only problem is to learn not to confuse the function
encode
with the function
decode
:-). Their interface is the same, and the logic of the name is not as obvious as we would like. Since the format of strings with a unicode flag is considered an
internal format , it is necessary to
decode data from arbitrary encoding (including UTF-8 without a flag), and vice versa, if you want to convert the data to some external encoding, you need to encode them from internal format her It looks like this:
$bytes = encode('cp1251', $string); # cp1251
$string = decode('cp1251', $bytes); #
Since not all characters can be lost without loss from one encoding to another, there is also a third parameter that determines how to behave in case of problems. You can read about it in the
documentation for the Encode
module , a whole section is devoted to it.
If you are sure that your variable contains bytes in UTF-8, you can simply raise the flag of the variable without recoding and checking it with
_utf8_on
. The
is_utf8
function will help determine whether a line has a flag (and, if desired, check the validity of the data lying there). Well, the flag is reset, as you might guess, through
_utf8_off
. The only "but" - these functions are marked as
INTERNAL , and you should not count on their immutability.
Beginning with Perl 5.8.1, some of the functions of the
Encode
module became available in the
utf8::
namespace
utf8::
these are
is_utf8
,
encode
,
decode
. The last two differ from the synonyms from the
Encode
module in that they change the value of the passed variable instead of returning the result, and do not require encoding (it is understood that the work takes place with UTF-8 data without a raised flag). All these functions are built into the interpreter, and you do not need to write
use utf8
to access them - moreover, this can lead to additional effects (about them a bit later).
use utf8;
The
use utf8
tells the interpreter that all constants and regular expressions written in its range and having non-ASCII characters should be treated as unicode and automatically converted to internal format. To cancel the pragma action, as usual, use the
no utf8
construction.
There is also an opposite by sense
use bytes
pragma, in the coverage of which even data with the UTF-8 flag are treated as consisting of bytes.
Perlio
The
Perl IO Layers theme basically deserves a separate article. The idea is that for some time now the good old
open
function has acquired a three-argument syntax:
open $fh, $mode, $filename
In addition to the standard values of the type
'>'
and
'<'
in
$mode
you can also specify the file encoding. At the same time, the loaded data is automatically converted into the internal Perl format:
open $fh, "<:encoding(cp1251)", $filename
If we are talking about a file that contains data in UTF-8, the code can be slightly simplified:
open $fh, "<:utf8", $filename
Of course, these modifiers can also be used to modify files - the effect will be the opposite.
By the way, in Perl it is possible to make I / O streams unicode once and for all using the
-C
command line key. Details can be seen, as always, in
perldoc .
Rake
Of course they are.
In general, sometimes there is a feeling that at each turn of development Perl scatters around a lot of different rakes, which programmers then diligently collect (sometimes twice, if the first rakes were experimental).First, some functions by definition work with bytes, not characters, and the lines in the internal representation get them across the throat. These functions include frequently used functions from the
Digest::MD5
module. So, the given example will fall off with the
Wide character in subroutine entry at test.pl line 3.
.:
use Digest::MD5 'md5_hex';
print md5_hex("\x{400}");
Secondly, the data do not always come in the form in which the program expects to be seen. It would be naive to expect, for example, that valid UTF-8 will always come to an HTML form handler. The results of excessive trust in the sources can be quite diverse, starting with data corruption and ending with fatal errors when trying to recode them to a different encoding (for example, when creating an email).
Finally, the most frequent and interesting problem occurs when you try to concatenate two strings, only one of which is stored in the internal pearl-barley format. Suppose we have such a file (recorded in UTF-8):
use Encode;
$a = decode('utf8', " "); #
$b = " "; # 15
$c = $a.$b;
In the last line, Perl tries to bring the lines to a common
denominator format. Since he sees
$b
as a chain of bytes, each byte of this string is encoded in UTF-8. The result will be approximately the same porridge (with the flag raised, by the way):
$c = " на Хабре"
The glitch is quite clearly visible with the naked eye on unicode-specific krakozyabram - you will not confuse with anything.
Conclusion
The article remained undisclosed, many subtleties. A number of utilities from the
Encode
modules,
utf8
remained behind the scenes. There was no place for mentioning a variation of the internal format that is sensitive to UTF-8 characters that are not valid from the point of view of UTF-8. The questions related to regular expressions are completely omitted. If you want to understand this topic to the end, pay attention to the manuals:
If you have any questions, I will try to answer them.
UPD: codesign
habrayuser has sent links to their developments on the same topic, I recommend: