📜 ⬆️ ⬇️

Simple HTML parsing library

Recently released Leaf, this is a small library for parsing HTML in Python.
It has been covering all my parsing needs for quite a while, but there are still ideas for development.
This library is essentially a wrapper for lxml , which makes working with it much more pleasant.

Functions



Description


To parse html, you need to pass a string with it to leaf.parse:
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
  1. import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
  2. import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
  3. import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
  4. import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .
import leaf document = leaf.parse(sample) links = document ( 'div#menu a' ) # CSS link = document .get( 'div#menu a' ) # None ( ) * This source code was highlighted with Source Code Highlighter .

In addition, now access to the attributes of the element is now more convenient:
  1. print link.onclick
  2. print link.id
* This source code was highlighted with Source Code Highlighter .

All standard lxml methods are available (and the elements resulting from their execution retain all the benefits of the library):
  1. link = document .xpath ( 'body / div / ul / li [@ class = "active_link"]' ) [0]
  2. link.get ( 'a' ) .text
* This source code was highlighted with Source Code Highlighter .

Well, perhaps the most interesting functionality is the conversion of html to bbcode and other markup languages. In the future, methods will be added to convert to popular markup languages, but for now you can very simply write a function for the desired method.
  1. # An example of the function of the converter from html to some
  2. # markup language that supports only
  3. # links enclosed in [url] [/ url]
  4. def omgcode_formatter (element, children):
  5. # Replace <br> with a newline character
  6. if element.tag == 'br' :
  7. return '\ n'
  8. # We put links in [url] [/ url]
  9. if element.tag == 'a' :
  10. return u "[url = link}] {text} [/ url]" .format (link = element.href, text = children)
  11. # For all other elements we return the result.
  12. # processing all children.
  13. if children:
  14. return children
* This source code was highlighted with Source Code Highlighter .

This function will be called recursively, taking as parameters the element (this is the element of the html tag) and children (the result of the execution of this function on all children of this element).
To convert an element (by the way, you can use both a separate layer and the whole tree):
  1. document .parse (omgcode_formatter)
* This source code was highlighted with Source Code Highlighter .

where document is an object of the leaf.Parser class.
Well, a couple of functions for working with text:

to_unicode - Translate a string into unicode
strip_accents - Removes accents, umlauts and similar things from a string
strip_symbols - Removes different Unicode specials from a string. characters and stuff
strip_spaces - Removes extra spaces
strip_linebreaks - Removes unnecessary line breaks

More detailed examples are in the tests.

Conclusion


The library is available at:

')

Source: https://habr.com/ru/post/115135/


All Articles