📜 ⬆️ ⬇️

Unicode crash in Python3

From the translator: Armin Ronacher is a fairly well-known developer in the Python community (Flask, jinja2, werkzeug).
He started a peculiar crusade against Python3 for quite a long time, but it is not so easy to accuse him of hysteria and retrograde: his objections are dictated by his serious experience of development, he argues his point of view in some detail. A little about terminology:
I translated coercion as a forced conversion of encodings, and byte string as byte strings, since the term “raw” strings nevertheless means something else.
“Historical” note: in 2012, Armin proposed PEP 414, which contained a number of measures to eliminate problems with Unicode, the PEP was confirmed quite quickly, but it is still there, since the text below was written on January 5, 2014

It is becoming more and more difficult to get a reasonable discussion about the differences between Python 2 and 3, since one language is already dead,
and the second is actively developing. When someone starts a discussion of Unicode support in two branches of Python, this is a very complex topic. Instead of looking at Unicode support in two versions of the language, I’ll consider the basic text and byte string processing model.


')
In this post I will show on the example of the decisions of the developers of the language and the standard library,
That Python 2 is better suited for working with text and byte strings.


Since I had to accompany a large amount of code that worked directly with the conversion between byte strings and Unicode, the deterioration that occurred in Python3 caused me a lot of sadness. I am particularly annoyed by the materials of the main python development team, which urge me to believe that python 3 is better than 2.7.

Text presentation model



The main difference between Python 2 and Python 3 is the base types that exist for working with strings and byte strings. In Python 3, we have one string type: str , which stores data in Unicode, and two byte types: bytes and bytearray .


On the other hand, in Python 2 we have two string types: str , which is sufficient for any goals and tasks, limited to ASCII + strings with some undefined data that exceed the interval of 7 bits. Together with the str type, Python2 has a unicode data type equivalent to the Pyr 3 str data type. To work with bytes in Python 2, there is one type: bytearray, taken from Python 3. Looking at the situation, you can see that from Python 3 something what was removed: support for string data is not in unicode. The sacrifice compensation was the hashed byte data type ( bytes ). The data type bytarray is changeable, and therefore it cannot be hashed. I very rarely use binary data as dictionary keys, and therefore the possibility or impossibility of hashing binary data does not seem to me very serious. Especially in Python 2, since bytes can be put into a variable of type str without any problems.

Lost type



Python 3 excludes support for byte strings, which in the 2.x branch were of type str . On paper, there is nothing wrong with this decision. From an academic point of view, strings always represented in Unicode are wonderful. And this is true if the whole world is your interpreter. Unfortunately, in the real world, everything happens differently: you have to regularly work with different encodings, in this case, the Python 3 approach to working with strings is cracking at the seams.

I’ll be honest with you: how Python 2 handles Unicode errors, and I fully approve of the improvements in Unicode processing. My position is that, as it is done in Python 3, is a step backwards and causes even more errors, and therefore I absolutely hate working with Python 3.

Errors when working with Unicode



Before I get into the details, we need to understand the difference between Unicode support in Python 2 and 3,
as well as why the developers decided to change the mechanism of Unicode support.

Initially, Python 2, like many other languages, was created before it without the support of processing a stream of different encodings.
The string is the string, it contains bytes. This required developers to work correctly with various
encodings manually. It was quite acceptable for many situations. For many years, the Django web framework
did not work with Unicode, but used only byte strings.

Meanwhile, Python has been improving internal Unicode support for 2 years. Improved Unicode Support
allowed to use it for a uniform presentation of data in different encodings.

The approach to handling strings that use a specific encoding in Python 2 is quite simple:
you take a string (byte) that you could get from anywhere, and then convert
its from the encoding that is typical for the source of the string (metadata, headers, and others)
to unicode string. By becoming a Unicode string, it supports all the same operations.
which is byte, but now it can store a larger range of characters.
When you need to pass a string to be processed elsewhere, then you are again
convert it to the encoding used by the host,
and before us again byte string

What features are associated with this approach? In order for this to work at the core language level,
Python 2 should provide a way to go from a world without unicode to a beautiful world with unicode.
This is possible due to the forced conversion of byte and non-byte strings. When it happens
and how does this mechanism work?

The main point is that when a byte string participates in a single operation with a Unicode string,
then the byte string is converted to a Unicode string using the implicit decoding process of the string, which uses the default encoding. This encoding defaults to ASCII. Python provided the ability to change the default encoding using a single module, but now the functions for changing the default encoding have been removed from the site.py module, it is installed in ASCII. If you run the interpreter with the -s flag, then the sys.setdefaultencoding function will be available to you and you can experiment to find out what happens if you set the default encoding to UTF-8. In some situations, problems may arise when working with the default encoding:

1. implicit specification and conversion of encoding during concatenation:

>>> "Hello " + u"World" u'Hello World' 


Here, the left string is converted, using the default encoding, to a Unicode string. If the string contains non-ASCII characters, then under normal program execution conditions, the conversion stops with the exception UnicodeDecodeError, because the default encoding is ASCII

2. Implicit specification and conversion of encoding when comparing strings
 >>> "Foo" == u"Foo" True 


It sounds more dangerous than it actually is. The left side is converted to Unicode, and then a comparison is made. If the left side cannot be converted, the interpreter issues a warning, and the lines are considered unequal (False is returned as the result of the comparison). This is quite a sensible behavior, even if it does not seem that way when you first meet him.


3. Explicit specification and conversion of the encoding, as part of a mechanism using codecs.

This is one of the most sinister things and the most common source of all Unicode failures and misunderstandings in Python 2. To overcome problems in this area in Python 3, they took a crazy step by removing the .decode () method from the Unicode strings and the .encode () method from the byte strings This caused the greatest confusion and annoyance with me. From my point of view, this is a very stupid decision, but I have been told many times that I don’t understand anything, there will be no going back.

The explicit conversion of the encoding when working with codecs looks like this:
 >>> "foo".encode('utf-8') 'foo' 


This string is obviously a byte string. We require it to be converted to UTF-8. Eno in itself is meaningless, since UTF-8 codec converts a string from Unicode to a byte string with UTF-8 encoding. How does this work? The UTF-8 codec sees that the string is not a Unicode string, and therefore a forced conversion to Unicode is first performed. As long as “foo” is only ASCII data and the default encoding is ASCII, the forced conversion is successful, and after this Unicode the string u “foo” is converted to UTF-8.

Codec mechanism



Now you know that Python 2 has two approaches to representing strings: bytes and Unicode. Conversion between these representations is carried out using the codec mechanism. This mechanism does not impose a Unicode-> byte conversion scheme or a similar one. A codec can convert byte-> byte or Unicode-> Unicode. In fact, the codec system can implement conversion between any type of Python. You can have a JSON codec that converts a string to a complex Python object based on it, if you consider that such conversion is necessary for you.

This state of affairs can cause problems with understanding the mechanism, starting with its foundations. An example of this might be a codec called 'undefined', which can be set as the default encoding. In this case, any forced conversion of encoding strings will be disabled:

 >>> import sys >>> sys.setdefaultencoding('undefined') >>> "foo" + u"bar" Traceback (most recent call last): raise UnicodeError("undefined encoding") UnicodeError: undefined encoding 


And how does Python 3 solve the problem with codecs? Python 3 removes all codecs that do not perform transformations of the form: Unicode <-> byte, and in addition, the byte string method .encode () and the string method .decode () are no longer needed. This is a very bad decision, as it was very
many useful codecs. For example, it is very common to use conversion with the hex codec in Python 2:
 >>> "\x00\x01".encode('hex') '0001' 

While you can say that in this particular case, the problem can be solved with a module like binascii, but the problem is deeper, the modules with codecs are available separately. For example, libraries that implement reading from sockets use codecs to partially convert data from zlib library data streams:
 >>> import codecs >>> decoder = codecs.getincrementaldecoder('zlib')('strict') >>> decoder.decode('x\x9c\xf3H\xcd\xc9\xc9Wp') 'Hello ' >>> decoder.decode('\xcdK\xceO\xc9\xccK/\x06\x00+\xad\x05\xaf') 'Encodings' 

In the end, the problem was recognized and these codecs were restored in Python 3.3. However, now we are re-introducing the user into confusion, since codecs do not provide meta-information about the type that they can process before calling the functions. For this reason, Python can now throw the following exceptions:
 >>> "Hello World".encode('zlib_codec') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' does not support the buffer interface 


(Please note that the codec is now called zlib_codec instead of zlib, since Python 3.3 did not retain the old codec notation)

And what happens if we return the .encode () method to byte strings, for example? This is easy to check, even without the Python interpreter hacks. Write a function with the same behavior:

 import codecs def encode(s, name, *args, **kwargs): codec = codecs.lookup(name) rv, length = codec.encode(s, *args, **kwargs) if not isinstance(rv, (str, bytes, bytearray)): raise TypeError('Not a string or byte codec') return rv 


Now we can use this function as a replacement for the .encode () method of byte strings:

 >>> b'Hello World'.encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'bytes' object has no attribute 'encode' >>> encode(b'Hello World', 'latin1') Traceback (most recent call last): File "<stdin>", line 4, in encode TypeError: Can't convert 'bytes' object to str implicitly 


Aha Python 3 is already able to work with this situation. We get a nice error notification. I believe that even “Can't convert 'bytes' object to str implicitly” is much better and clearer than “'bytes' object has no attribute 'encode'”.

Why not return these encoding conversion methods (encode and decode) back? I really do not know and do not think about it anymore. I have already been repeatedly explained that I do not understand anything and I do not understand beginners, or the fact that the “text model” has changed and my requirements for it are meaningless.

Byte strings lost



Now, following the regression of the codec system, string operations have also changed: they are defined only for Unicode strings. At first glance, this looks quite reasonable, but in reality it is not. The interpreter used to have implementations for byte and Unicode strings. This approach was completely obvious for programmers, if the object needed to have a representation in the form of a byte or Unicode string, two methods were defined: __ str__ and __unicode__. Yes, of course, a forced change of the encoding was used, which confused newbies, but we had a choice.

Why is this useful? Because, for example, if you are working with low-level protocols, you often need to deal with numbers in a specific format within a byte string.

The native version control system used by Python developers does not work on Python 3, because over the years the Python development team has not wanted to return the formatting capability for byte strings .

All of the above shows: Python 3's string data handling model does not work in the real world. For example, in Python 3, some APIs were updated, making them work only with Unicode, and therefore they are completely unsuitable for use in real work situations. For example, now you can no longer analyze bytes using the standard library, but only the URL. The reason for this is in the implicit assumption that all URLs are represented only in Unicode (in this situation, you will no longer be able to work with email messages in non-Unicode encoding, unless you completely ignore the existence of binary attachments to the letter).

Previously, this was fairly easy to fix, but since byte strings are now lost for developers, the URL processing library now has two implementations. One for Unicode, and the second for byte objects. Two implementations for the same function lead to the fact that the result of data processing can be very different:
 >>> from urllib.parse import urlparse >>> urlparse('http://www.google.com/') ParseResult(scheme='http', netloc='www.google.com', path='/', params='', query='', fragment='') >>> urlparse(b'http://www.google.com/') ParseResultBytes(scheme=b'http', netloc=b'www.google.com', path=b'/', params=b'', query=b'', fragment=b'') 

Looks like enough? Not at all, because as a result we have completely different data types for the result of the operation.
One of them is a tuple of strings, the second is more like an array of integers. I already wrote about this earlier and this condition causes me suffering. Now, writing Python code gives me serious discomfort or becomes extremely inefficient, since you now have to go through a large number of data encoding transformations. Because of this, it becomes very difficult to write code that implements all the necessary functions. The idea that all Unicode is very good in theory, but completely inapplicable in practice.

Python 3 is riddled with a bunch of crutches to handle situations where it is impossible to handle Unicode, and for people like me who work a lot with such situations, all this causes terrible irritation.

Our crutches do not work



Unicode support in branch 2.x is imperfect and far from ideal. These are missing API, problems coming from different sides, but we, as programmers, did all this work. Many of the methods we used to do this can no longer be applied in Python 3, and some APIs will be modified to work well with Python 3.

My favorite example is the processing of file streams, which could be either byte or text, but there was no reliable method to determine which type of stream we are in front of. The trick I helped popularize is reading zero bytes from a stream to determine its type. Now this trick doesn't work . For example, passing a request object to the urllib library of the Flask function that processes JSON does not work in Python 3, but it works in Python 2:

 >>> from urllib.request import urlopen >>> r = urlopen('https://pypi.python.org/pypi/Flask/json') >>> from flask import json >>> json.load(r) Traceback (most recent call last): File "decoder.py", line 368, in raw_decode StopIteration 


During the processing of the thrown exception, another one is thrown:
 Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: No JSON object could be decoded 


And what?



In addition to the problems that I described above, Python 3 with Unicode support also has a bunch of other problems. I started to unsubscribe from the Python developer twitter because I was tired of reading which Python 3 is wonderful, as it contradicts my experience. Yes, in Python 3 there are a lot of goodies, but that’s what they did with processing byte strings and Unicode doesn’t apply to them.

(Worst of all, many of the really cool features of Python 3 usually work just as well in Python 2. For example, yield from, nonlocal, SNI SSL support, etc.)

In light of the fact that only 3% of Python developers are actively using Python 3 , and Python developers on Twitter loudly declare that migration to Python 3 is proceeding as planned, I feel disappointed because I described my experience with Python 3 in detail and how I want it to get rid of.

I don’t want to do it now, but I want the Python 3 development team to listen to the community a bit more. For 97% of us, Python 2, a cozy little world in which we worked for years, and therefore quite painful is the situation when they come to us and declare: Python 3 is beautiful and is not discussed. This is simply not the case in the light of many regressions. Together with those people who are starting to discuss Python 2.8 and Stackless Python 2.8, I don’t know what a failure is if it’s not him.

Source: https://habr.com/ru/post/208192/


All Articles