Unicode for Dummies

I myself do not really like headlines like “Pokemons in their own juice for dummies \ pots \ pans”, but this seems to be the case - let's talk about basic things, working with which often lead to a compartment full of cones and a lot of wasted time around "Why is it not working?" If you are still afraid and \ or do not understand Unicode - I ask under cat.

What for?

The main question of a beginner, who meets with an impressive number of encodings and seemingly confusing mechanisms for working with them (for example, in Python 2.x). The short answer is because it happened :)

The coding, who does not know, refers to the way in which the computer’s memory (read, in zero-units / numbers) numbers, letters, and all other characters is called. For example, a space is represented as 0b100000 (in binary), 32 (in decimal) or 0x20 (in hexadecimal).
')
So, once there was very little memory and 7 computers were enough for all computers to represent all the necessary characters (numbers, lowercase / uppercase Latin alphabet, a bunch of characters and so-called manageable characters - all possible 127 numbers were given to someone). The encoding at this time was one - ASCII . As time went on, everyone was happy, and who was not happy (read - who lacked the "©" sign or the native letter "y") - they used the remaining 128 characters at their discretion, that is, they created new encodings. That is how ISO-8859-1 and our (that is, Cyrillic) cp1251 and KOI8 appeared . Together with them appeared the problem of interpreting bytes of type 0b1 ******* (that is, characters \ numbers from 128 to 255) - for example, 0b11011111 in the cp1251 encoding is our native "I", at the same time in the ISO- 8859-1 is the ~~Greek~~ German Eszett (suggested by Moonrise ) "ß". It is expected, network communication and just file sharing between different computers turned into damn-knows-what, despite the fact that headers like 'Content-Encoding' in the HTTP protocol, email-letters and HTML pages saved the situation a bit.

At that moment, bright minds gathered and proposed a new standard - Unicode . This is exactly the standard, not the encoding - Unicode itself does not determine how the characters will be stored on the hard disk or transmitted over the network. It only defines the connection between a character and a certain number, and the format according to which these numbers will be converted into bytes is determined by Unicode encodings (for example, UTF-8 or UTF-16 ). Currently, there are a little more than 100 thousand characters in the Unicode standard, while UTF-16 allows you to support more than one million (UTF-8 - and even more).

Fuller and more fun on the topic I advise you to read the magnificent Joel Spolsky The Absolutely Positive Must Know About Unicode and Character Sets .

Get to the point!

Naturally, there is Unicode support in Python. But, unfortunately, only in Python 3 all lines became Unicode, and beginners have to be killed about errors like:

>>> with open('1.txt') as fh: s = fh.read() >>> print s  >>> parser_result = u'-' #   ,  ,     -  >>> parser_result + s

 Traceback (most recent call last): File "<pyshell#43>", line 1, in <module> parser_result + s UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)

or so:

 >>> str(parser_result)

 Traceback (most recent call last): File "<pyshell#52>", line 1, in <module> str(parser_result) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Let's see, but in order.

Why would anyone use Unicode?

Why does my favorite html parser return Unicode? Let him return the usual string, and I'll figure it out with her already! Right? Not really. Although each of the characters existing in Unicode can (probably) be represented in some single-byte encoding (ISO-8859-1, cp1251, and others are called single-byte encoding, since they encode any character in exactly one byte), but what to do if the string should contain characters from different encodings? Assign a separate encoding to each character? No, of course, you need to use Unicode.

Why do we need a new type of "unicode"?

So we got to the most interesting. What is a string in Python 2.x? This is just bytes . Just binary data that can be anything. In fact, when we write something like:

 >>> x = 'abcd' >>> x 'abcd'

the interpreter does not create a variable that contains the first four letters of the Latin alphabet, but only the sequence

 ('a', 'b', 'c', 'd')

with four bytes, and the Latin letters here are used exclusively to denote this particular byte value. That is, 'a' here is just a synonym for writing '\ x61', and not a bit more. For example:

 >>> '\x61' 'a' >>> struct.unpack('>4b', x) # 'x' -    signed/unsigned char- (97, 98, 99, 100) >>> struct.unpack('>2h', x) #   short- (24930, 25444) >>> struct.unpack('>l', x) #   long (1633837924,) >>> struct.unpack('>f', x) #  float (2.6100787562286154e+20,) >>> struct.unpack('>d', x * 2) #    double- (1.2926117739473244e+161,)

And that's it!

And the answer to the question - why do we need a “unicode” is more obvious - we need a type that will be represented by characters, not bytes.

Well, I understood what the string is. Then what is Unicode in Python?

“Type unicode” is primarily an abstraction that implements the idea of Unicode (a set of characters and numbers associated with them). An object of the “unicode” type is no longer a sequence of bytes, but a sequence of actual characters without any idea how these characters are effectively stored in the computer’s memory. If you wish, this is a higher level of abstraction than byte strings (this is how Python 3 refers to regular strings used in Python 2.6).

How to use Unicode?

A unicode string in Python 2.6 can be created in three (at least naturally) ways:

u "" literal:
```
 >>> u'abc' u'abc' 
```
The “decode” method for a byte string:
```
 >>> 'abc'.decode('ascii') u'abc' 
```
Unicode feature:
```
 >>> unicode('abc', 'ascii') u'abc' 
```

ascii in the last two examples is specified as the encoding that will be used to convert bytes into characters. The stages of this transformation look like this:

 '\x61' ->  ascii ->   "a" -> u'\u0061' (unicode-point   )  '\xe0' ->  c1251 ->   "a" -> u'\u0430'

How from an unicode-line to receive the normal? Encode it:

 >>> u'abc'.encode('ascii') 'abc'

The coding algorithm is naturally the reverse of the above.

We remember and do not confuse - Unicode == characters, string == bytes, and bytes -> something meaningful (characters) is de-coding (decode), and characters -> bytes - coding (encode).

Not encoded: (

Let us examine the examples from the beginning of the article. How does string and unicode string concatenation work? A simple string must be turned into a unicode string, and since the interpreter does not know the encoding, it uses the default encoding - ascii. If this encoding fails to decode the string, we get an ugly error. In this case, we need to cast the string to the unicode string using the correct encoding:

 >>> print type(parser_result), parser_result <type 'unicode'> - >>> s = '' >>> parser_result + s

 Traceback (most recent call last): File "<pyshell#67>", line 1, in <module> parser_result + s UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)

 >>> parser_result + s.decode('cp1251') u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0\u043a\u043e\u0449\u0435\u0439' >>> print parser_result + s.decode('cp1251') - >>> print '&'.join((parser_result, s.decode('cp1251'))) -& #   :)

"UnicodeDecodeError" is usually evidence that you need to decode a string into Unicode using the correct encoding.

Now use "str" and unicode strings. Do not use "str" and unicode strings :) In "str" there is no possibility to specify the encoding, accordingly the default encoding will always be used and any characters> 128 will lead to an error. Use the "encode" method:

 >>> print type(s), s <type 'unicode'>  >>> str(s)

 Traceback (most recent call last): File "<pyshell#90>", line 1, in <module> str(s) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

 >>> s = s.encode('cp1251') >>> print type(s), s <type 'str'>

“UnicodeEncodeError” is a sign that we need to specify the correct encoding during the conversion of a unicode string into a regular one (or use the second parameter 'ignore' \ 'replace' \ 'xmlcharrefreplace' in the “encode” method).

I want more!

Well, use Baba Yaga from the example above again:

 >>> parser_result = u'-' #1 >>> parser_result u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0' #2 >>> print parser_result áàáà-ÿãà #3 >>> print parser_result.encode('latin1') #4 - >>> print parser_result.encode('latin1').decode('cp1251') #5 - >>> print unicode('-', 'cp1251') #6 -

The example is not quite simple, but there is everything (or almost everything). What's going on here:

What do we have at the entrance? The bytes that IDLE transmits to the interpreter. What is needed at the exit? Unicode, that is, characters. It remains to turn the bytes into characters - but you need the encoding, right? What encoding will be used? We look further.

Here is an important point:

 >>> '-' '\xe1\xe0\xe1\xe0-\xff\xe3\xe0' >>> u'\u00e1\u00e0\u00e1\u00e0-\u00ff\u00e3\u00e0' == u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0' True

As you can see, Python does not bother with the choice of encoding - bytes simply turn into Unicode points:

 >>> ord('') 224 >>> ord(u'') 224

But the problem is that the 224th character in cp1251 (the encoding used by the interpreter) is not at all the same as 224 in Unicode. It is because of this that we get cracks when trying to print our unicode string.
How to help a woman? It turns out that the first 256 Unicode characters are the same as in the ISO-8859-1 \ latin1 encoding, respectively, if we use it to encode a Unicode string, we get those bytes that were entered by ourselves (who are interested - Objects / unicodeobject.c , look for the definition of the function “unicode_encode_ucs1”):
```
 >>> parser_result.encode('latin1') '\xe1\xe0\xe1\xe0-\xff\xe3\xe0' 
```

How to get a woman in Unicode? You must specify which encoding to use:

 >>> parser_result.encode('latin1').decode('cp1251') u'\u0431\u0430\u0431\u0430-\u044f\u0433\u0430'

The way from point # 5 is certainly not so hot, it is much more convenient to use use built-in unicode .

In fact, not everything is so bad with "u" literals, since the problem occurs only in the console. After all, if non-ascii characters are used in the source file, Python will insist on using a header like "# - * - coding: - * -" ( PEP 0263 ), and unicode strings will use the correct encoding.

There is also a way to use “u” to represent, for example, Cyrillic, and not to specify the encoding or unreadable unicode points (that is, “u '\ u1234'”). The method is not entirely convenient, but interesting - use unicode entity codes:

 >>> s = u'\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER SHCHA}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER SHORT I}' >>> print s

Well, everything seems to be. The basic advice is not to confuse "encode" \ "decode" and understand the differences between bytes and characters.

Python 3

There is no code here, because there is no experience. Witnesses claim that everything is much simpler and more fun. Who will take on the cats to demonstrate the differences between here (Python 2.x) and there (Python 3.x) - Respect and respect.

Useful

Since we're talking about encodings, I would recommend a resource that helps to overcome crackers from time to time - http://2cyr.com/decode/?lang=ru .

Once again, link to Spolsky's article - The Absolutely Minimum Every Software Developer .

Unicode HOWTO - the official document on where, how and why Unicode in Python 2.x.

Thanks for attention. I would be grateful for comments in private.

PS They threw a link to the translation of Spolsky - the Absolute Minimum that Every Software Developer Must Know about Unicode and Character Sets .

Source: https://habr.com/ru/post/135913/

All Articles