Parsing on Puthon. How to collect the archive Dovecoat

This article describes the development of a script in the language of Python . The script parses HTML-code , compiles a list of site materials, downloads articles and pre-cleans the text of the article from "extraneous" elements. The libraries used are urllib (getting HTML pages), lxml (parsing HTML code, deleting elements and saving “cleaned” article), re (working with regular expressions), configobj (reading configuration files).

For writing a script, basic knowledge of Python, programming skills and debugging code is sufficient.

The article provides explanations on the use of libraries on the example of compiling a list of SM publications. Golubitsky, a link to a working script.
')

Preface or some lyrics

I can hardly be mistaken when I say that many habrazhiteli are familiar with the irrepressible creative work of Sergei Golubitsky . For almost 15 years of computer-related journalism, imenire issued 433 articles on the mountain in the untimely deceased in the Bose paper Computer and over 300 Dovecotes on the Computerra online portal. And this is not counting analytical studies on the heroes of foreign geshefts in the Business Journal, opening the veil over the secrets of creativity in the Home Computer, articles in the Russian Journal, D` and so on. and so on Aspiring to the completeness of life-creativity reviews, those interested will find them through the links above.

Last year, the author’s project “The Old Pigeon and His Friends ” began its work, which was conceived (and became) in particular by a constantly growing archive of the publications of the author himself and a platform for conducting cultural and Lid discussions. As a person who is not indifferent to the themes of network life, social mythology and self-development that the author discovers brilliantly, as well as eager for quality leisure reading, he once became a frequenter of gatherings at Dovecote and me. As far as I can, I try not only to keep the project in sight, but also to somehow participate in its development.

Getting involved in the field of proofreading of articles transferred to the archive from the Computerra-online portal, the first thing I decided to make an inventory of all Pigeonlings.

Formulation of the problem

So, the task, by the example of which we will consider parsing sites on Python, was as follows:

Make a list of all Dovecots posted on Computerra online. The list should include the title of the article, the date of publication, information about the content of the article (text only, images, video), Synopsis, a link to the source.
Add a list of materials published in the paper Computerra, find duplicates.
Add a list of materials from the archive of the site Internettrading.net
Download a list of articles already published on the portal “Old Dovecote”
Download articles to a local disk for further processing, if possible, automatically clearing text from unnecessary elements.

Toolkit selection

In a part of the programming language, my choice immediately and definitely fell on Python. Not only because I studied it at my leisure a few years ago (and then for some time I used the Active Python shell as an advanced calculator), but also for the abundance of libraries, examples of source code and the simplicity of writing and debugging scripts. Not least interested in the prospects for the further use of these skills to solve immediate tasks: integration with the Google Docs API, automation of word processing, etc.

Solving a very specific task practically from scratch, the toolkit was selected in such a way as to spend the minimum time on reading documentation, comparing libraries and, ultimately, implementation. On the other hand, the solution should have universality sufficient for easy adaptation to other sites and similar tasks. Perhaps some tools and libraries are imperfect, but they eventually allowed to complete the plan.

So, the choice of tools began to determine the appropriate version of Python. Initially I tried to use Python 3.2, but during the experiments I stopped at Python 2.7 , because Some examples on the “troika” did not go.

To simplify the installation of additional libraries and packages, I used setuptools, a tool for downloading, building and installing packages.

Additionally libraries were installed:

urllib - getting HTML pages of sites;
lxml is a library for parsing XML and HTML code;
configobj is a library for reading configuration files.

As improvised means were used:

Notepad ++ is a text editor with syntax highlighting:
FireBug - FireFox browser plugin that allows you to view the source code of HTML pages
FirePath - FireFox browser plugin for analyzing and testing XPath:
Built-in Python GUI for debugging code.

Invaluable assistance provided articles and discussions on Habré:

As well as manuals, examples and documentation:

And, of course, the book Python Programming Language

Solution Overview

The task includes four similar procedures for downloading materials from four different sites. Each of them has one or several pages with a list of articles and links to the material. In order not to spend a lot of time on the formalization and unification of the procedure, a basic script was written, on the basis of which, each site developed its own script, taking into account the peculiarities of the structure of the list of materials and the composition of HTML pages. So, parsing materials on Internettrading.net, where HTML was apparently formed by hand, required a lot of additional checks and page parsing scripts, while those formed by CMS Drupal (“Old Dovecote and His Friends”) and Bitrix (“Computerra Online”, archives paper Computerra) pages contained a minimum of features.

In the future, I will refer to the details of the historically most recent script parsing the portal of the Old Dovecote.

The list of articles is displayed in the section “ Prograph ”. Here is the title, link to the article and synopsis. The list is divided into several pages. You can go to the next page by changing the parameter in the address line in a loop (? Page = n), but it seemed to me smarter to get the link to the next page from the HTML text.

On the article page there is a publication date in DD format Month YYYY, its own text and an indication of the source in the signature.

To work with different data types, two objects were created: MaterialList (object) - list of articles (contains a method of parsing a separate page of the _ParseList list and a method of obtaining the URL of the next page _GetNextPage , stores a list of materials and their identifiers) and Material (object) - the article itself ( contains a method for generating an identifier based on the _InitID date, a method of parsing the _ParsePage page, a method for determining the source of the _GetSection publication , and article attributes such as the publication date, type of material, etc.)

In addition, functions for working with elements of the document tree are defined:

get_text (item, path) - getting the text of the item along the path in the item document
get_value (item) - getting node text in item
get_value_path (item, path) - getting node text in item document along path
get_attr_path (item, path, attr) - getting the attribute of an element along the path in the item document

And the function get_month_by_name (month) , which returns the month number by its name for parsing the date.

The main code (the main () procedure) contains the loading of the configuration from a file, a walk through the pages of the list of materials, loading the contents into memory and further saving to the files both the list itself (in CSV format) and the text of articles (in HTML, the file name is formed on based on material identifier).

The configuration file stores the URL of the start page of the content list, all XPath paths for the content pages and the list of articles, file names and directory path for saving articles.

Implementation details

In this part, I will discuss the main points of the code, one way or another caused difficulties or smoking manuals.

To simplify the debugging of paths inside documents and to make the code easier to read, all XPaths are moved to a separate configuration file. The configobj library was quite suitable for working with the configuration file. The configuration file has the following structure:

# Comment
[ Section_1 ]
# Comment
variable_1 = value_1
# Comment
variable_2 = value_2
[[Subsection_1]]
variable_3 = value_3
[[Subsection_2]]
[ Section_2 ]

Nesting of subsections can be arbitrary, comments to sections and specific variables are allowed. An example of working with a configuration file:

from configobj import ConfigObj

#
cfg = ConfigObj('sgolub-list.ini')
# url sgolub
url = cfg['sgolub']['url']

Loading html-page is implemented using the library urllib . With the help of lxml we convert the document into a tree and fix the relative links:

import urllib
from lxml.html import fromstring

# html-
html = urllib.urlopen(url).read();

# lxml.html.HtmlElement
page = fromstring(html)

#
page.make_links_absolute(url)

When parsing the list of publications, we need to loop through all the elements of the list. The lxml.html.HtmlElement.findall (path) method is suitable for this . For example:

for item in page.findall(path):
url = get_attr_path(item,cfg['sgolub']['list']['xpath_link'],'href')

Now is the time to make a comment about the FirePath plugin and its use to build XPath. Indeed, as already mentioned on Habré, FirePath provides paths that differ from paths in lxml. Slightly, but there is a difference. Pretty soon, these differences were revealed and further used with FirePath as amended, for example, replacing the tbody tag with * (the most common problem). At the same time, paths corrected in this way can be checked in FirePath, which significantly speeds up the process.

While page.findall(path) returns a list of items, a find (path) method exists to get a single item. For example:

content = page.find(cfg['sgolub']['doc']['xpath_content'])

The find and findall methods work only with simple paths that do not contain logical expressions in conditions, for example:

xpath_blocks = './/*[@id='main-region']/div/div/div/table/*/tr/td'
xpath_nextpage = './/*[@id='main-region']/div/div/div/ul/li[@class="pager-next"]/a[@href]'

To use more complex conditions, such as
xpath_purifytext = './/*[@id="fin" or @class="info"]'
You will need the xpath (path) method, which returns a list of elements. Here is a sample code that cleans up selected elements from the tree (I still don’t understand how this magic works, but the elements are really removed from the tree):

from lxml.html import tostring

for item in page.xpath(cfg['computerra']['doc']['xpath_purifytext']):
item.drop_tree()
text=tostring(page,encoding='cp1251')

This fragment also uses the lxml.html.tostring method, which saves the tree (without any extra elements!) Into a string in the specified encoding.

In conclusion, I will give two examples of working with the re regular expression library. The first example implements a date parsing in the format "DD Month YYYY":

import re
import datetime

# content lxml.html.HtmlElement
# ,
datestr=get_text(content,cfg['sgolub']['doc']['xpath_date'])
if len(datestr)>0:
datesplit=re.split('\s+',datestr,0,re.U)
self.id = self._InitID(list,datesplit[2].zfill(4)+str(get_month_by_name(datesplit[1])).zfill(2)+datesplit[0].zfill(2))
self.date = datetime.date(int(datesplit[2]),get_month_by_name(datesplit[1]),int(datesplit[0]))
else:
self.id = self._InitID(list,list.lastid[0:8])
self.date = datetime.date(1970,1,1)

The re.split function (regexp, string, start, options) is used , which forms a list of line elements separated by a specific mask (in this case, by a space). The re.U option allows you to work with strings containing Russian characters in Unicode. The zfill (n) function finishes a string with zeros from the left to the specified number of characters.

The second example shows how to use regular expressions to search for substrings.

def _GetSection(item, path):
#
reinfo = re.compile(r'.*«(?P<gsource>.*)».*',re.LOCALE)
for info in item.xpath(path):
src=get_value_path(info,'.').strip('\n').strip().encode('cp1251')
if src.startswith(' '):
parser = self.reinfo.search(src)
if parser is not None:
if parser.group('gsource')=='-':
return '-'
else:
return parser.group('gsource')
break
return ''

In the given example, the code of the _GetSection (item, path) function is shown to which the subtree is passed, indicating the source of the publication, for example, “First published in the Business Journal”. Notice the fragment of the regular expression ? P <gsource> . When placed in brackets, it allows you to define named groups in a string and access them using parser.group ('gsource') . The re.LOCALE option is similar to re.U.

The source code of the parser is laid out in Google Docs . To save the site of the Old Dovecote from the flow of parsers, I post only the code, without the configuration file with links and paths.

Conclusion

The result of the application of technology was the archive of articles from four sites on the hard disk and lists of all publications of Pigeonfish. The lists were manually uploaded to the Google Docs table, articles from the archive are also transferred manually for editing into Google documents.

Plans for solving problems:

Writing a service that automatically tracks new posts
Integration with the Google Docs API to automatically add new publications to the list
Converting archived articles from HTML to XML format with automatic correction of some errors and uploading to Google Docs

PS Many thanks to all for the comments, support and constructive criticism. I hope that most of the comments will be useful to me in the future after careful study.

Source: https://habr.com/ru/post/121815/

All Articles