Simple HTML parsing library

Recently released Leaf, this is a small library for parsing HTML in Python.
It has been covering all my parsing needs for quite a while, but there are still ideas for development.
This library is essentially a wrapper for lxml , which makes working with it much more pleasant.

Functions

Convenient access to CSS selectors like in jQuery
Easy access to element attributes
The ability to convert HTML to other markup languages (bbcode, markdown, etc.)
Several functions for working with text
And of course, all the features of lxml itself

Description

To parse html, you need to pass a string with it to leaf.parse:

import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .

In addition, now access to the attributes of the element is now more convenient:

print link.onclick
print link.id
* This source code was highlighted with Source Code Highlighter .

All standard lxml methods are available (and the elements resulting from their execution retain all the benefits of the library):

link = document .xpath ( 'body / div / ul / li [@ class = "active_link"]' ) [0]
link.get ( 'a' ) .text
* This source code was highlighted with Source Code Highlighter .

Well, perhaps the most interesting functionality is the conversion of html to bbcode and other markup languages. In the future, methods will be added to convert to popular markup languages, but for now you can very simply write a function for the desired method.

# An example of the function of the converter from html to some
# markup language that supports only
# links enclosed in [url] [/ url]
def omgcode_formatter (element, children):
# Replace <br> with a newline character
if element.tag == 'br' :
return '\ n'
# We put links in [url] [/ url]
if element.tag == 'a' :
return u "[url = link}] {text} [/ url]" .format (link = element.href, text = children)
# For all other elements we return the result.
# processing all children.
if children:
return children
* This source code was highlighted with Source Code Highlighter .

This function will be called recursively, taking as parameters the element (this is the element of the html tag) and children (the result of the execution of this function on all children of this element).
To convert an element (by the way, you can use both a separate layer and the whole tree):

document .parse (omgcode_formatter)
* This source code was highlighted with Source Code Highlighter .

where document is an object of the leaf.Parser class.
Well, a couple of functions for working with text:

to_unicode - Translate a string into unicode
strip_accents - Removes accents, umlauts and similar things from a string
strip_symbols - Removes different Unicode specials from a string. characters and stuff
strip_spaces - Removes extra spaces
strip_linebreaks - Removes unnecessary line breaks

More detailed examples are in the tests.

Conclusion

The library is available at:

Source: https://habr.com/ru/post/115135/

All Articles

Simple HTML parsing library

Functions

Description

Conclusion

More articles: