This post is devoted to the perennial problem of all pythonists - encodings. Recently, I received a letter in which my friend complained that his program produced lines of the form:
u'\xd0\x9a\xd1\x83\xd1\x80\xd1\x83\xd0\xbc\xd0\xbe\xd1\x87'
Have you noticed that something is wrong? And here I am. The lines are like unicode, but inside them encoded utf-8 bytes. Something is wrong here. Understanding further and demanding the script that this generates, it becomes clear that the data is taken from the web. Quite the usual way through
urllib
and then fed to
lxml.html
for parsing. Since
urllib
operates with only byte strings, it could not turn them into a unicode in this way, which means
lxml
is to blame for
lxml
.
In general,
lxml
very cool library - and fast, and functional, and can mimic the interface under the
ElementTree
, and interact with
BeatifulSoup
. It has long been popular with pythonists, when it is necessary to somehow conveniently work with
xml
.
But here is a slightly different case. It uses the html parser. And it is in it that these unpleasant metamorphoses with strings occur.
')
I decided to understand what was the matter and how to overcome this behavior.
For a start, I went to
yandex.ru and looked at what html is being sent there. The content encoding is utf8. The first thing that caught my eye was the absence of a coding declaration. It is not mandatory, but it is still used quite often. Making a similar html:
data = """<html>
<head>
</head>
<body> </body>
</html>"""
html = lxml.html.document_fromstring(data)
and
lxml.html
it in
lxml.html
, got, alas, the expected result:
>>> s
u'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82 \xd0\xbc\xd0\xb8\xd1\x80'
>>> print s
Привет м
s - this is exactly the line "Hello world", torn through xpath. As you can see, it is not decoded. By and large, this problem can be solved on the spot. There is such a special codec raw-unicode-escape, which from such a line will make a byte but also without conversion:
>>> print s.encode('raw-unicode-escape')
But this decision is bad. We need to somehow force
lxml.html
not to mock non-ASCII characters.
What will happen if you specify the encoding in my unloved meta header html?
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body> </body>
</html>
Everything immediately falls into place:
>>> print s
Of course, it would be more logical to take information about the encoding from the http headers, but for lxml.html, the protocol by which the mystery data came and it cannot rely on it.
Another solution is to input a unicode, not a byte line, at the input of lxml.html (unless of course you know the encoding yourself):
>>> html = lxml.html.document_fromstring(data.decode('utf-8'))
...
>>> print s
In my opinion, it would be better if
lxml.html
did not try to “survive at any cost” and spoil the content, but explicitly report that the encoding is not specified - as by the way, it also comes in when parsing xml. But in any case, there are workarounds.
Be carefull.