LXML - problems with encoding when parsing HTML

This post is devoted to the perennial problem of all pythonists - encodings. Recently, I received a letter in which my friend complained that his program produced lines of the form:

u'\xd0\x9a\xd1\x83\xd1\x80\xd1\x83\xd0\xbc\xd0\xbe\xd1\x87'

Have you noticed that something is wrong? And here I am. The lines are like unicode, but inside them encoded utf-8 bytes. Something is wrong here. Understanding further and demanding the script that this generates, it becomes clear that the data is taken from the web. Quite the usual way through urllib and then fed to lxml.html for parsing. Since urllib operates with only byte strings, it could not turn them into a unicode in this way, which means lxml is to blame for lxml .

In general, lxml very cool library - and fast, and functional, and can mimic the interface under the ElementTree , and interact with BeatifulSoup . It has long been popular with pythonists, when it is necessary to somehow conveniently work with xml .

But here is a slightly different case. It uses the html parser. And it is in it that these unpleasant metamorphoses with strings occur.
')
I decided to understand what was the matter and how to overcome this behavior.

For a start, I went to yandex.ru and looked at what html is being sent there. The content encoding is utf8. The first thing that caught my eye was the absence of a coding declaration. It is not mandatory, but it is still used quite often. Making a similar html:

data = """<html>
<head>
</head>
<body> </body>
</html>"""
html = lxml.html.document_fromstring(data)

and lxml.html it in lxml.html , got, alas, the expected result:

>>> s
u'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82 \xd0\xbc\xd0\xb8\xd1\x80'
>>> print s
ÐŸÑ€Ð¸Ð²ÐµÑ‚ Ð¼

s - this is exactly the line "Hello world", torn through xpath. As you can see, it is not decoded. By and large, this problem can be solved on the spot. There is such a special codec raw-unicode-escape, which from such a line will make a byte but also without conversion:

>>> print s.encode('raw-unicode-escape')

But this decision is bad. We need to somehow force lxml.html not to mock non-ASCII characters.

What will happen if you specify the encoding in my unloved meta header html?

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>

<body> </body>
</html>

Everything immediately falls into place:

>>> print s

Of course, it would be more logical to take information about the encoding from the http headers, but for lxml.html, the protocol by which the mystery data came and it cannot rely on it.

Another solution is to input a unicode, not a byte line, at the input of lxml.html (unless of course you know the encoding yourself):

>>> html = lxml.html.document_fromstring(data.decode('utf-8'))
...
>>> print s

In my opinion, it would be better if lxml.html did not try to “survive at any cost” and spoil the content, but explicitly report that the encoding is not specified - as by the way, it also comes in when parsing xml. But in any case, there are workarounds.

Be carefull.

Source: https://habr.com/ru/post/128381/

All Articles

LXML - problems with encoding when parsing HTML

More articles: