Basics of parsing with Python + lxml

Good afternoon, dear readers.
In today's article, I’ll show the basics of HTML parsing pages using the lxml library for Python.
In short, lxml is a fast and flexible library for handling XML and HTML markup in Python. In addition, it has the ability to decompose the elements of the document into a tree. In the article I will try to show how easy it is to apply in practice.

Select a target for parsing

Since I actively go in for sports, in particular, BZHZH I wanted to see the pain statistics in all the MMA tournaments in tournaments held.
Searches on the buzz led me to the site with all the official statistics on major international tournaments in mixed martial arts. The only catch was that the information on us was presented in an inconvenient form for analysis. This is due to the fact that the results of the tournaments are located on separate pages. In addition, the date of the tournament with its name is also placed on a separate page on a separate page.
To combine all the information on tournaments into one table, suitable for analysis, it was decided to write a parser described below.

Algorithm parser

First, let's deal with the algorithm of the parser. It will be as follows:

We take as a basis the table with all tournaments and their dates, which is located
by this
address
Fill in the data from this page into a dataset with the following columns:
tournament
description link
date
For each record set (for each tournament) we move across the field
[link to description] for information on battles
We write down information on all fights of tournament
To the data set with information about the battles we add the date of the tournament from
set (2)

The algorithm is ready and you can go to its implementation.
')

Getting started with lxml

To work we need the modules lxml and pandas . Let's load them into our program:

import lxml.html as html from pandas import DataFrame

For the convenience of further parsing, we will move the main domain into a separate variable:

 main_domain_stat = 'http://hosteddb.fightmetric.com'

Now let's get the object to parse. This can be done using the parse () function:

 page = html.parse('%s/events/index/date/desc/1/all' % (main_domain_stat))

Now open the specified table in the HTML editor and study its structure. We are most interested in the block with the classes events_table data_table row_is_link , because It contains the table with the data we need. You can get this block like this:

 e = page.getroot().\ find_class('events_table data_table row_is_link').\ pop()

We will understand what this code does.
First, using the getroot () function, we get the root element of our document (this is necessary for the subsequent work with the document).
Further, using the find_class () function, we find all the elements with the specified classes. As a result of the function, we get a list of such elements. Since after visual analysis of the HTML code of the page, it is clear that only one element is suitable for this criterion, then we extract it from the list using the pop () function.
Now we need to get a table from our div 'a, obtained earlier. To do this, we use the getchildren () method, which returns a list of alternate objects of the current element. AND
because we have only one such object, you we extract this one from the list.

 t = e.getchildren().pop()

Now the variable t contains a table with the information we need. Now, I will get 2 auxiliary dataframe'a , combining that, we will get information about tournaments with their dates and links to the results.
In the first set I will include all tournament names and links to their pages on the site. This is easily done using the iterlinks () iterator, which returns a list of conditioners

(, ,
, )

(, ,
, )

(, ,
, )

within the specified element. Actually, from this tuple, we need the address of the link and its text.
A link test can be obtained by referring to the .text property of the corresponding element. The code will be as follows:

 events_tabl = DataFrame([{'EVENT':i[0].text, 'LINK':i[2]} for i in t.iterlinks()][5:])

The attentive reader will notice that in the loop we exclude the first 5 entries. They contain information that we don’t need, such as field headers, so I got rid of them.
So, we got the links. Now we get 2 subsets of data with dates of tournaments. This can be done like this:

 event_date = DataFrame([{'EVENT': evt.getchildren()[0].text_content(), 'DATE':evt.getchildren()[1].text_content()} for evt in t][2:])

In the code shown above, we go through all the lines ( tr tags) in table t . Then for each row we get a list of child columns ( td elements). And we get the information recorded in the first and second columns using the text_content method, which returns a string from the text of all child elements of the given column.
To understand how the text_content method works, here’s a small example. Suppose we have this structure of the document <tr> <td> <span> text </ span> <span> text </ span> . So, the text_content method returns a string of text, text , and the text method does not return anything, or just text .

Now that we have 2 subsets of data, combine them into a final set:

 sum_event_link = events_tabl.set_index('EVENT').join(event_date.set_index('EVENT')).reset_index()

Here, we first specify the indices to our sets, then combine them and reset the indices of the final set. More information about these operations can be found in one of my past articles . It remains to upload the received dataframe to a text file, for safekeeping:

 sum_event_link.to_csv('..\DataSets\ufc\list_ufc_events.csv',';',index=False)

Event handler for one UFC event

We unloaded a page with a list of tournaments in a convenient format. It's time to sort out the results pages for the competition. For example, take the last tournament and see the HTML code of the page.
You may notice that the information we need is contained in the element with the data_table class row_is_link . In general, the parsing process is similar to the one shown above, with one exception: the results table is not entirely correct.
Its uncorrection is that for each fighter there is a separate line in it, which is not convenient in the analysis. To get rid of this inconvenience when analyzing the results, I decided to use an iterator, only on odd lines. The number of the even number is calculated from the current odd line.
Thus, I will process a couple of lines at once and transfer them to a line. The code will be as follows:

 all_fights = [] for i in sum_event_link.itertuples(): page_event = html.parse('%s/%s' % (main_domain_stat,active_event_link)) main_code = page_event.getroot() figth_event_tbl = main_code.find_class('data_table row_is_link').pop()[1:] for figther_num in xrange(len(figth_event_tbl)): if not figther_num % 2: all_fights.append( {'FIGHTER_WIN': figth_event_tbl[figther_num][2].text_content().lstrip().rstrip(), 'FIGHTER_LOSE': figth_event_tbl[figther_num+1][1].text_content().lstrip().rstrip(), 'METHOD': figth_event_tbl[figther_num][8].text_content().lstrip().rstrip(), 'METHOD_DESC': figth_event_tbl[figther_num+1][7].text_content().lstrip().rstrip(), 'ROUND': figth_event_tbl[figther_num][9].text_content().lstrip().rstrip(), 'TIME': figth_event_tbl[figther_num][10].text_content().lstrip().rstrip(), 'EVENT_NAME': i[1]} ) history_stat = DataFrame(all_fights)

You may notice that for each match the tournament name is additionally recorded. This is necessary in order to determine the date of the match.
Now save the results to a file:

 history_stat.to_csv('..\DataSets\ufc\list_all_fights.csv',';',index=False)

Let's look at the result:

 history_stat.head()

	EVENT_NAME	FIGHTER_LOSE	FIGHTER_WIN	METHOD	METHOD_DESC	ROUND	TIME
0	UFC Fight Night 38: Shogun vs. Henderson	Robbie lawler	Johny hendricks	U. DEC	NaN	five	5:00
one	UFC Fight Night 38: Shogun vs. Henderson	Carlos condit	Tyron woodley	KO / TKO	Knee injury	2	Two o'clock
2	UFC Fight Night 38: Shogun vs. Henderson	Diego sanchez	Myles jury	U. DEC	NaN	3	5:00
3	UFC Fight Night 38: Shogun vs. Henderson	Jake shields	Hector lombard	U. DEC	NaN	3	5:00
four	UFC Fight Night 38: Shogun vs. Henderson	Nikita Krylov	Ovince saint preux	SUB	Other - Choke	one	1:29

It remains only to tighten the date to the fights and unload the final file:

 all_statistics = history_stat.set_index('EVENT_NAME').join(sum_event_link.set_index('EVENT').DATE) all_statistics.to_csv('..\DataSets\ufc\statistics_ufc.csv',';', index_label='EVENT')

Conclusion

In the article, I tried to show the basics of working with the lxml library, which is used for parsing XML and HTML markup. The code specified in the article does not claim to be optimal, but correctly performs the task set for it.
As you can see from the above program, the process of working with the library is quite simple, which helps to quickly write the necessary code. In addition to the above functions and methods, there are other equally necessary.

Source: https://habr.com/ru/post/220125/

All Articles