Working with Perl encodings

On Habré there is already a good article about using UTF-8 in Perl - habrahabr.ru/post/53578 . I'm still a little different
I would like to talk about encodings.

A lot of questions related to the variety of encodings, as well as the terminology used. In addition, many of us have encountered problems with encodings. I will try in this article to write in an understandable form information on this issue. I'll start with the question of automatic detection of text encoding.

Determining the encoding of the source file. The definition of the encoding of the original document, a task that is quite common in practice. Take as an example the browser. In addition to the html file, it may also receive a header in the HTTP response, which specifies the encoding of the document and this header may not be correct, so you cannot rely only on it, as a result, browsers support the ability to automatically determine the encoding.
')
In Perl, you can use Encode :: Guess for this, but the more “advanced” industrial version is Encode :: Detect :: Detector. As it is written in its documentation, it provides an interface to the Mozilovsky universal coding identifier.

If you study the source code, pay attention to the vnsUniversalDetector.cpp file and method

nsresult nsUniversalDetector::HandleData(const char* aBuf, PRUint32 aLen)

From this method begins all the work on the definition of encoding. First, it is determined whether there is a BOM header, and if so, the further definition of the encoding is performed by simply comparing the initial data bytes:

EF BB BF UTF-8 encoded BOM
FE FF 00 00 UCS-4, occount order BOM (3412)
FE FF UTF-16, big endian BOM
00 00 FE FF UTF-32, big-endian BOM
00 00 FF FE UCS-4, unusual octet order BOM (2143)
FF FE 00 00 UTF-32, little-endian BOM
FF FE UTF-16, little endian BOM

Then each data byte is analyzed and whether the character is related to non-US-ASCII (codes from 128 to 255). If so, then class objects are created:

nsMBCSGroupProber;
nsSBCSGroupProber;
nsLatin1Prober;

each of which is responsible for the analysis of encoding groups (MB - multibyte, SB - single-byte).

If it is US-ASCII, then there are 2 options here, either it is ordinary ASCII (pure ascii) or a file containing escape sequences and refers to such encodings as ISO-2022-KR, etc. (for more details see en.wikipedia.org/wiki/ISO/IEC_2022 ). In this case, the detector is used implemented by the class nsEscCharSetProber.

nsMBCSGroupProber supports such encodings as: “UTF8”, “SJIS”, “EUCJP”, “GB18030”, “EUCKR”, “Big5”, “EUCTW”.

nsSBCSGroupProber - such as Win1251, koi8r, ibm866 and others.

The definition of single-byte encoding is based on the analysis of the frequency of occurrence of 2-character sequences in the text.

It should be said that all these methods are probabilistic in nature. For example, if there are not enough words to define, no algorithm can automatically determine the encoding. Therefore, in various programming environments, the issue with encodings is solved in its own way, but there is no such that everything is determined by itself.

Unicode and Perl. Historical perspective. According to Unicode www.unicode.org/glossary there are 7 possible coding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE. For the term Unicode itself, the following definition is given: "... a standard for the digital representation of characters that are used in a letter by all languages of the world ...". In addition, there is also UTF-7, which is not part of the standard, but is supported by Perl - Encode :: Unicode :: UTF7 (see also RFC 2152).

UTF-7 is almost never used. Here is what is written in Encode :: Unicode :: UTF7 - “... However, if you want to use UTF-7 for documents in mail and web pages, do not use it unless you make sure that recipients and readers (in the sense of these documents) can process this encoding ... ".

Perl developers, following the progress on the universal implementation of Unicode encodings in applications, have also implemented Unicode support in Perl. In addition, the Encode module also supports other encodings, both single-byte and multibyte, the list of which can be viewed in the package Encode :: Config. For working with letters, the following MIME encodings are supported: MIME-Header, MIME-B, MIME-Q, MIME-Header-ISO_2022_JP.

It should be said that UTF-8 is very widely used as a coding for web documents. UTF-16 is used in Java and Windows, UTF-8 and UTF-32 are used by Linux and other Unix-like systems.

Starting with Perl version 5.6.0, the ability to work with Unicode was originally implemented. However, Perl 5.8.0 was recommended for more serious work with Unicode. Perl 5.14.0 is the first version in which Unicode support is easily (almost) integrated without several pitfalls (exceptions are some of the differences in quotemeta). Version 5.14 also fixes several bugs and deviations from the Unicode standard.

Visual Studio 2012 and encoding (for comparison with Perl). When we write some C # application in Visual Studio, we do not think about the coding in which all this is stored and processed. When you create a document in Vistual Studio, it will create it in UTF8 and also add a UTF8 BOM to the header — a sequence of bytes 0xEF, 0xBB, 0xBF. When we convert the source file (already open in Visual Studio), for example, from UTF8 to CP1251, we get an error message
Unicode substitution character while loading ... with Unicode (UTF-8) encoding. Preserve the original file contents.

If you open an existing file in cp1251 - ToUpper (), for example, it will work correctly, and if you convert the file to KOI8-R and then open it in Visual Studio and run, there is no question of any correct work, the environment does not know what is it KOI8-R, and how can she find out?

“Unicode Bug in Perl”. Just like in Visual Studio, something similar happens with a Perl program, but Perl developers can explicitly specify the encoding of the source code of an application. That is why when beginners begin to program in perl they open their favorite editor on Russian-language Windows XP and write something in the spirit of ANSI (ie cp1251)

 use strict; use warnings; my $a = ""; my $b = ""; my $c = “word”; print "Words are equal" if uc($a) eq uc($b);

and the output is that the lines in the variables are not equal, they are at first difficult to understand what is happening. Similar things happen with regular expressions, string functions (but uc ($ c) will work correctly).

This is the so-called “Unicode Bug” in Perl (see the documentation for details), due to the fact that for different single-byte encodings, characters with codes from 128 to 255 will have different meanings. For example, the letter P in cp1251 - has the code 0xCF, whereas in CP866 - 0x8F, and in KOI8-R - 0xF0. How, then, is it possible to work out such string functions as uc (), ucfirst (), lc (), lcfirst () or \ L, \ U in regular expressions?

It is enough to “prompt” the interpreter that the encoding of the source file is cp1251 and everything will work correctly. More precisely, in the code below, the variables $ a and $ b will store strings in the internal Perl format.

 use strict; use warnings; use encoding 'cp1251'; my $a = ""; my $b = ""; print "equal" if uc($a) eq uc($b);

Perl's internal string format. In not very old versions of Perl, strings can be stored in a so-called internal format (Perl's internal form). Note that they can also be stored as a simple set of bytes. In the example above, where the source file encoding was not explicitly specified (using use encoding 'cp1251';), the $ a, $ b, $ c variables store just a set of bytes (in the Perl documentation the term octet sequence is used - a sequence of octets).

The internal format differs from the set of bytes by using the UTF-8 encoding and the UTF8 flag is enabled for the variable. I will give an example. Change the source code of the program to the next

 use strict; use warnings; use encoding 'cp1251'; use Devel::Peek; my $a = ""; my $b = ""; print Dump ($a);

This is what we get as a result.

SV = PV (0x199ee4) at 0x19bfb4
REFCNT = 1
FLAGS = (PADMY, POK, pPOK, UTF8)
PV = 0x19316c "\ 321 \ 201 \ 320 \ 273 \ 320 \ 276 \ 320 \ 262 \ 320 \ 276" \ 0 [UTF8 "\ x {441} \ x {43b} \ x {43e} \ x {432} \ x {43e} "]
CUR = 10
LEN = 12

Please note that FLAGS = (PADMY, POK, pPOK, UTF8). If we remove use encoding 'cp1251';
then we get

SV = PV (0x2d9ee4) at 0x2dbfc4
REFCNT = 1
FLAGS = (PADMY, POK, pPOK)
PV = 0x2d316c "\ 321 \ 201 \ 320 \ 273 \ 320 \ 276 \ 320 \ 262 \ 320 \ 276" \ 0
CUR = 10
LEN = 12

When we specify that the source code of a file encoded in cp1251 or any other, then Perl knows that it is necessary to convert string literals in the source code from the specified encoding to an internal format (in this case from cp1251 to an internal UTF-8 format) and does .

A similar problem of encoding determination arises when working with data received “from the outside”, for example, files or the web. Consider each of the cases.

Suppose we have a file in the cp866 encoding, which contains the word "When" (in the text file, the word When with a capital letter). We need to open it and analyze all the lines to find the word "when." Here's how to do it right (with the source code itself must be in utf8).

 use strict; use warnings; use encoding 'utf8'; open (my $tmp, "<:encoding(cp866)", $ARGV[0]) or die "Error open file - $!"; while (<$tmp>) { if (//i) { print "OK\n"; } } close ($tmp);

Please note that if we do not use "<: encoding (cp866)", and specify use encoding 'cp866' then regular expressions will work, but only with a set of bytes and / i will not work. The “<: encoding (cp866)” construct tells Perl that the data is in a text file in the CP866 encoding, so it correctly performs the transcoding from CP866 to the internal format (CP866 -> UTF8 + turns on the UTF8 flag).

The following example, we get a page using LWP :: UserAgent. Here is the correct example of how to do this.

 use strict; use warnings; use LWP::UserAgent; use HTML::Entities; use Data::Dumper; use Encode; use Devel::Peek; my $ua = LWP::UserAgent->new(); my $res = $ua->get("http://wp.local"); my $content; if (!$res->is_error) { $content = $res->content; } else { exit(1); } #     UTF8,   cp1251 - $content = decode('cp1251',$content); # decode   utf8  ( )    Perl $content = decode('utf8',$content); #   $content     ,      ,  , , HTML::Entities,    ,    .. decode_entities($content);

Notice the call to $ content = decode ('utf8', $ content).

LWP :: UserAgent works with bytes, it does not know, and it is not his concern, in what encoding the page is in single-byte cp1251 or in UTF8, we must explicitly indicate this. Unfortunately, a lot of literature contains examples in English and for older versions of Perl, as a result, in these examples there is nothing about transcoding.

For example, search engine robots (or other code) should not only correctly determine the encoding of pages, not using server response headers or the contents of the HTML meta tag, which can be erroneous, but also determine the page language. So do not think that all of the above should be done only by Perl programmers.

Using the example of obtaining external data from a web site, we have come to consider using the Encode module. Here is its main API, which is very important in the work of any Perl programmer:

 $string = decode(ENCODING, OCTETS[, CHECK]).     ()   ENCODING    Perl; $octets = encode(ENCODING, STRING[, CHECK]).      Perl      ENCODING. [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK]).        .

In the example in which we opened a text file in CP866, we can omit <: encoding (cp866). Then, at each read operation, we will receive a set of bytes in CP866. We can convert them to internal format using

 $str = decode('cp866',$str)

continue to work with the $ str variable.

Some may assume that it is possible to use utf8 as the source code of the program, and in addition, recode from cp866 to utf8 and everything will work as it should. This is not so, consider an example (in the text file the word When with a capital letter).

 use strict; use warnings; use encoding 'utf8'; use Encode; #open (my $tmp, "<:encoding(cp866)", $ARGV[0]) or die "Error open file - $!"; open (my $tmp, "<", $ARGV[0]) or die "Error open file - $!"; while (<$tmp>) { my $str = $_; Encode::from_to($str,'cp866','utf8'); if ($str=~//i) { print "OK\n"; } } close ($tmp);

$ str after executing Encode :: from_to ($ str, 'cp866', 'utf8') contains data in utf8 but as a sequence of bytes (octets) therefore / i does not work. For everything to work as you need to add a call

 $str = decode('utf8',$str)

Of course the simpler option is one line instead of two

 $str = decode('cp866',$str)

The internal format of perl strings in more detail. We have already said that regular expressions, some modules and string functions work correctly with strings that are stored not as a set of bytes, but in the internal representation of Perl. It was also said that UTF-8 is used as an internal format for storing strings in Perl. This encoding is chosen for a reason. Part of the character codes in this encoding from 0-127 coincides with ASCII (US-ASCII), which are exactly responsible for the English alphabet, which is why calling uc for a string with codes from 0 to 127 works correctly and it will work regardless of single-byte encoding in which the source code is stored. For UTF8, it still works correctly.

However, this is not all you need to know.

UTF-8 vs utf8 vs UTF8. The UTF-8 encoding has become more “strict” over time (for example, the presence of certain characters was prohibited). Therefore, the implementation of UTF-8 in Perl is outdated. As of Perl 5.8.7, “UTF-8” means modern “dialent” is more “strict”, whereas “utf8” means more “liberal old dialect”. Here is a small example.

 use strict; use warnings; use Encode; #      UTF-8 my $str = "\x{FDD0}"; $str = encode("UTF-8",$str,1); #  $str = encode("utf8",$str,1); # OK

Thus, the hyphen between “UTF” and “8” is important; without it, Encode becomes more liberal and possibly overly permissive. If you run

 use strict; use warnings; use Encode; my $str = sprintf ("%s | %s | %s | %s | %s\n", find_encoding("UTF-8")->name , find_encoding("utf-8")->name , find_encoding("utf_8")->name , find_encoding("UTF8")->name , find_encoding("utf8")->name ); print $str;

We get the following result - utf-8-strict | utf-8-strict | utf-8-strict | utf8 | utf8.

Work with the console. Consider the Windows OS OS console. As everyone knows, Windows has the notion of Unicode, ANSI, OEM encoding. The OS API itself supports 2 types of functions that work with ANSI and Unicode (UTF-16). ANSI depends on the localization of the OS, for the Russian version the encoding CP1251 is used. OEM is the encoding used for console I / O, for Russian-language Windows it is CP866. This is the encoding that was proposed in the Russian-language MS-DOS, and later migrated to Windows for backward compatibility with old software. That's why, the next program in utf-8

 use strict; use warnings; use Encode; use encoding 'utf8'; my $str = ' '; print $str;

will not print the coveted line, we also output UTF8, when you need CP866. Here you need to use the Encode :: Locale module. If you look at its source code, you can see that for Windows it defines the ANSI encoding and the console and creates aliases console_in, console_out, locale, locale_fs. All that remains to do is to change our program a little.

 use strict; use warnings; use Encode::Locale; use Encode; use encoding 'utf8'; my $str = ' '; if (-t) { binmode(STDIN, ":encoding(console_in)"); binmode(STDOUT, ":encoding(console_out)"); binmode(STDERR, ":encoding(console_out)"); } print $str;

PS This article is for those who start working with Perl and maybe it’s a bit sherehovata. Ready to listen and realize the wishes regarding the expansion of the article.

Source: https://habr.com/ru/post/163439/

All Articles

Working with Perl encodings

More articles: