The whole truth about the UTF-8 flag

A common misconception is that character strings, unlike byte strings, have the UTF-8 flag set.
Many people realize that if the data is ASCII-7-bit, then the UTF-8 flag is simply not important.

However, in fact, it can be installed or reset, as well as in characters, and absolutely arbitrary binary data.
')

Marc Lehmann, widely known in the Perl community, comments on this in the JSON :: XS module documentation.

You can have it. Other possibilities exist, too.

Consider the case when the ASCII-7bit data has the UTF-8 flag set.

use utf8; use strict; use warnings; my $u = ""; # unicode  my $ascii = "x"; #  ASCII  my ($ascii_u, undef) = split(/ /, "$ascii $u"); die unless $ascii_u eq "x"; #   ASCII  print "UTF-8 flag set!" if utf8::is_utf8($ascii_u); #      UTF-8

This code displays the “UTF-8 flag set!”. That is, the ASCII-7bit string received this flag after the split operation split the Unicode string (with the UTF-8 flag) into parts. It can be said that the programmer does not control whether his ASCII data will have a UTF-8 flag or not, it depends on where it came from and how the data was received, and on what data was next to it.

The same effect is obtained if decoding ASCII-7bit bytes to ASCII-7bit characters using Encode :: decode ()

 use strict; use warnings; use Encode; my $ascii = 'x'; # ASCII  my $ascii_u = decode("UTF-8", encode("UTF-8", "$ascii")); die unless $ascii_u eq "x"; #   ASCII  print "UTF-8 flag set!" if utf8::is_utf8($ascii_u); #      UTF-8

Those. re-encoding back and forth does not change the data (this is expected), but sets the UTF-8 flag.
(however, this decode () behavior contradicts its own documentation , which, in turn, contradicts the idea that there should be no documentation and guarantees regarding the utf-8 flag in ASCII)

The reasons for the appearance of the UTF-8 flag can be explained by efficiency considerations. It is too expensive after split to analyze a string to see if it consists only of ASCII characters and whether it is possible to reset the flag.

This behavior of the UTF-8 flag is similar to a virus — it infects all the data with which it comes in contact.

Consider the case where non-ASCII, Unicode characters do not have the UTF-8 flag.

 use strict; use warnings; use Digest::SHA qw/sha1_hex/; use utf8; my $s = "µ"; my $s1 = $s; my $s2 = $s; my $digest = sha1_hex($s2); #     print "utf-8 bit ON (s1)\n" if utf8::is_utf8($s1); print "utf-8 bit ON (s2)\n" if utf8::is_utf8($s2); print "s1 and s2 are equal\n" if $s1 eq $s2;

prints:

 utf-8 bit ON (s1)
 s1 and s2 are equal

That is, the function call of the third-party module has reset the UTF-8 flag. At the same time, the lines with the flag and without, turned out to be completely identical.
This can only happen with> 127 and <= 255 (i.e. Latin-1).

Actually, the utf8 :: downgrade operation occurred with the string $ s2

This function is described in the documentation as changing the internal representation of the string:

Conversion in Latin America (Latin-1 or EBCDIC). The logical character sequence itself is unchanged.

In principle, the Digest :: SHA module documents this behavior, although it is not required to:

Be aware that the digest routines silently convert UTF-8 input into its
equivalent byte sequence in the native encoding (cf. utf8 :: downgrade). This
side effect
the data intact.

In general, any 3-rd party function can make a downgrade line, without informing the documentation (or, for example, do it only occasionally).

Consider the case when absolutely arbitrary binary data has a UTF-8 flag.

 use utf8; use strict; use warnings; #   bytes::length  ,  '()'  bytes      use bytes (); my $u = ""; #  ASCII  # ,   my $bin = "\xf1\xf2\xf3"; ##   ASCII   UTF-8  my $ascii = "x"; #  ASCII  my ($ascii_u, undef) = split(/ /, "$ascii $u"); die unless $ascii_u eq "x"; #   ASCII  die unless utf8::is_utf8($ascii_u); #      UTF-8  ## // print "original bin length:\t"; print length($bin) . "\t" . bytes::length($bin) ."\n"; my $bin_a = $bin.$ascii; #   ,  ASCII  print "bin_a length:\t"; print length($bin_a) . "\t" . bytes::length($bin_a) ."\n"; my $bin_u = $bin.$ascii_u; #    ,  ASCII  print "bin_u length:\t"; print length($bin_u) . "\t" . bytes::length($bin_u) ."\n"; print "bin_a and bin_u are equal!\n" if $bin_a eq $bin_u; open my $f, ">", "file_a.tmp"; binmode $f; print $f $bin_a; close $f; open $f, ">", "file_u.tmp"; binmode $f; print $f $bin_u; close $f; system("md5sum file_?.tmp"); # md5sum -  linux

gives out:

 original bin length: 3 3
 bin_a length: 4 4
 bin_u length: 4 7
 bin_a and bin_u are equal!
 33818f4b23aa74cddb8eb625845a459a file_a.tmp
 33818f4b23aa74cddb8eb625845a459a file_u.tmp

The result is that the binary data, after concatenating with an ASCII string, increased its internal size in bytes (but not in characters) from 4 to 7, but only if the meaningless UTF-8 flag of ASCII was set .

However, when comparing this data with each other, they are identical, as well, when outputting both strings to the file, even without specifying the encoding, the files were also identical.

Thus, binary data can increase in size and get the UTF-8 flag, while there is no bug, all Perl built-in functions process them exactly as if the flag were not (if there are exceptions, then a bug in them).

Any other perl code must also process such data without errors (if it does not try to analyze the internal structure of the string, or at least analyzes it correctly)

In fact, what happened to the binary data is analogous to the utf8 :: upgrade operation. The data was interpreted as Latin-1, converted to UTF-8, and the UTF-8 flag set. This operation is the opposite of utf8 :: downgrade , described above. utf8 :: downgrade can only be done with Latin-1 characters. And utf8 :: upgrade can be done
with any bytes (since any byte corresponds to a character from Latin-1).

This can be important if you have a large amount of binary data in your memory. It’s not at all cool if a 400 megabyte blob suddenly turns into a 700 megabyte, just because you added one ASCII-7bit byte with the UTF-8 flag. A good way out of the situation here is unit tests or runtime assertions with checking the UTF-8 flag.

In general, it is impossible to distinguish between bytes and characters.

Consider the task: to write a function, the input of which will be XML, if the XML is bytes, look at the encoding in the tag "xml" and transcode them into characters. If it is already a character, do nothing.

This function will not work. For example, for the character string "Hello, München", the function will not
distinguish the characters from this, or bytes encoded in CP1251, or in KOI8-R (in case the line is downgraded, and the programmer in general does not control this).

For characters> 255, the UTF-8 flag is always set (you cannot make utf8 :: downgrade with them). For characters with the code <= 127, UTF-8 bits are not important, in the sense that they can be viewed as both binary data and symbols. For Latin1 characters, it is not possible to distinguish from bytes.

Distinguishing bytes from characters in Perl is the same as distinguishing a file name from an email and from a person’s name. Sometimes it is possible, but generally not. The programmer himself must remember in which variable what is in it.

This is in the documentation:

perldoc.perl.org/perlunifaq.html

How can I determine if a string is a string?

You can't. Flag for this, but it makes it possible to use modules like Data :: Dumper look bad. Because of the 8 bit encoding (by default ISO-8859-1), it is useless for this purpose.

This is something you need to keep track of; sorry. You could consider adopting a kind of "Hungarian notation" to help with this.

If you still need to do this, you can create your own class, which will contain a string of bytes or characters, and a flag indicating that it is (the same trick is suitable for email vs file name vs name of person).

Wide characters not issued for Latin-1 characters

The following example only issues warning Wide characters in print if we print $ s2

 use strict; use warnings; use utf8; my $s1 = "ß"; my $s2 = ""; my $s = $ARGV[0] ? $s1 : $s2; print $s;

If we print $ s1, Perl converts the Unicde character µ (U + 00DF, UTF-8 \ xC3xF9) to byte \ xDF and tries to display it on the screen.
The same behavior applies to all functions that accept bytes, not characters (print, syswrite without encoding, checksums SHA, MD5, CRC32, MIME :: Base64).

Viral downgrade

At the beginning of the article, the “viral” behavior of the UTF-8 bit in ASCII characters (viral utf8 :: upgrade ) was described. Now consider the “viral” reset of UTF-8 bits in Latin-1 characters (utf8 :: downgrade ).

Imagine that we are writing a function that is defined only on bytes and not on characters; a good example is the hash functions, encryption, archiving, Mime :: Base64, etc.

1. Since it is impossible to distinguish binary data from characters, you must treat the input data as bytes.
2. Bytes may have an upgrade form (because with the UTF-8 flag). The result should be the same as for the downgrade form.

Therefore, you need to do utf8 :: downgrade and give an error if it fails.

For algorithms, such as hash functions, performance concerns are typical. It is not efficient to make a second copy of the data in memory, so, in most cases, the function modifies the parameter passed to it.

As many probably know, in Perl all parameters are passed by reference, but are usually used by value.

 sub mycode { $_[0] = "X"; #    ,      }

 sub mycode { my ($arg1) = @_; #       $arg1 = "X"; #     ,     }

Thus, when creating a code that works exactly in accordance with the Perl specification, code is created that implicitly makes utf8 :: downgrade over the actual parameters, regardless of the caller’s will, thereby possibly creating a bug in some other place that handled strings incorrectly and worked well up to that point.

For file names, this does not work.

Functions that take file names as arguments ( open , file tests -X ), as well as that return file names ( readdir ), do not obey these rules (this is noted in the documentation).

They simply interpret the name of the file as it is in memory.

The algorithm of their work can be described as follows:

 sub open { my ( ... $filename) = @_; utf8::_utf8_off($filename); #     _open($filename);

There are several reasons for this:

1. On many POSIX systems (Linux / * BSD), on many file systems, a file name may be an arbitrary sequence of bytes, not necessarily a sequence of characters in any encoding.
2. There is no portable way to determine the file system encoding.
3. There may be several file systems with different encoding on the machine.
4. You can not rely on the assumption that the encoding of file names matches the locale encoding.
5. Must be compatible with the old code.

As a result, the programmer must determine the encoding himself and communicate it to the interpreter, but the API for this has not yet been done.

We modify our example, where we “accidentally” stumbled upon downgrade character strings.

 use strict; use warnings; use Digest::SHA qw/sha1_hex/; use utf8; my $s = "µ"; my $s1 = $s; my $s2 = $s; my $digest = sha1_hex($s2); #     print STDERR "s1 and s2 are equal\n" if $s1 eq $s2; open my $f, ">", "$s1.tmp" or die "s1 failed: $!"; print $f "test"; close $f; open $f, "<", "$s2.tmp" or die "s2 failed: $!"; print STDERR "Done\n";

Result of work:

 s1 and s2 are equal
 s2 failed: no such file or directory

those. the lines s1 and s2 are the same, but they point to different files, if you remove the sha1_hex export, then to the same files.

You can stumble on this same rake, turning to any modules that work with files (for example, File :: Find )

When else is it not working

In the Encode module There is a decode_utf8 function
documented as:

Equivalent to $ string = decode ("utf8", $ octets [, CHECK])

But actually, if $ octets is set to the UTF-8 flag, the function simply returns them unchanged (although it should try to make utf8 :: downgrade and work with them as with binary data, and if downgrade fails, give the Wide characters error) .

This bug was noticed ( RT # 61671 RT # 87267 ) as it appeared right away - in 2010.

But the miner rejects all such bug reports. At the same time, the essence of reports is not even that the function behaves correctly (in accordance with the idea of Perl), and not even that it contains documentation describing this behavior, but that, at least, this behavior is not must contradict existing documentation. Maintainer believes that the functions are documented as equivalent, and this does not mean identical (although mine can be regarded as equivalence as similarity and identity). Perhaps in mathematics the equivalence does not even contain a hint of identity ... If someone can solve this riddle, I will be very grateful.

The unicode bug

In a downgraded form, Latin-1 cannot be distinguished from bytes, therefore, in this form, some metacharacters in regular expressions do not work well, the functions uc , lc , quotemeta .

Workaraund - utf8 :: upgrade , or, in new versions of Perl - some directives that allow you to make this behavior consistent.

See the Perl documentation for details.

What to do with all this?

1. Do not use (if you do not know exactly what you are doing) the following functions: utf8 :: is_utf8 , Encode :: _ utf8_on , Encode :: _ utf8_off , and all the functions from the bytes module (the documentation for all these functions does not recommend their use, except as for debugging)

2. Use utf8 :: upgrade , utf8 :: downgrade , whenever required by the Perl specification

3. To convert from characters to bytes, use Encode :: encode , Encode :: decode

3. If you use someone else's code that violates these rules, check it for the presence of bugs, use workarounds.

4. When working with file names, you either have to use a wrapper on all functions, or, using tests, make sure that the internal representation of the file names does not change during the operation of the code.

There are several examples when violation of these rules seemed to me justified.

 Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) && (bytes::length($_[0]) == length($_[0]));

(clears the UTF-8 flag for ASCII-7bit text (thus achieving a 30% increase in performance of regexps, in all Perl, except 5.19)

 defined($_[0]) && utf8::is_utf8($_[0]) && (bytes::length($_[0]) != length($_[0]))

(Returns TRUE if the string has the UTF-8 flag set, nor is it ASCII-7bit. Can be used in unit tests to ensure that your 400 megabytes of binary data is not turning into 700)

There is another option to do nothing. Honestly, it will take quite a long time before you stumble upon any bug (but by that time it will be too late). This option is not recommended for library developers.

Source: https://habr.com/ru/post/190584/

All Articles