Good afternoon, dear readers.
In today's article, I’ll show the basics of HTML parsing pages using the
lxml library for Python.
In short,
lxml is a fast and flexible library for handling
XML and
HTML markup in Python. In addition, it has the ability to decompose the elements of the document into a tree. In the article I will try to show how easy it is to apply in practice.
Select a target for parsing
Since I actively go in for sports, in particular,
BZHZH I wanted to see the pain statistics in all the MMA tournaments in tournaments held.
Searches on the buzz led me to the
site with all the official statistics on major international tournaments in mixed martial arts. The only catch was that the information on us was presented in an inconvenient form for analysis. This is due to the fact that the results of the tournaments are located on separate pages. In addition, the date of the tournament with its name is also placed on a separate page on a separate page.
To combine all the information on tournaments into one table, suitable for analysis, it was decided to write a
parser described below.
Algorithm parser
First, let's deal with the algorithm of the parser. It will be as follows:
- We take as a basis the table with all tournaments and their dates, which is located
by this
address - Fill in the data from this page into a dataset with the following columns:
- tournament
- description link
- date
- For each record set (for each tournament) we move across the field
[link to description] for information on battles - We write down information on all fights of tournament
- To the data set with information about the battles we add the date of the tournament from
set (2)
The algorithm is ready and you can go to its implementation.
')
Getting started with lxml
To work we need the modules
lxml and
pandas . Let's load them into our program:
import lxml.html as html from pandas import DataFrame
For the convenience of further parsing, we will move the main domain into a separate variable:
main_domain_stat = 'http://hosteddb.fightmetric.com'
Now let's get the object to parse. This can be done using the
parse () function:
page = html.parse('%s/events/index/date/desc/1/all' % (main_domain_stat))
Now open the specified table in the HTML editor and study its structure. We are most interested in the block with the classes
events_table data_table row_is_link
, because It contains the table with the data we need. You can get this block like this:
e = page.getroot().\ find_class('events_table data_table row_is_link').\ pop()
We will understand what this code does.
First, using the
getroot () function, we get the root element of our document (this is necessary for the subsequent work with the document).
Further, using the
find_class () function, we find all the elements with the specified classes. As a result of the function, we get a list of such elements. Since after visual analysis of the HTML code of the page, it is clear that only one element is suitable for this criterion, then we extract it from the list using the
pop () function.
Now we need to get a table from our
div 'a, obtained earlier. To do this, we use the
getchildren () method, which returns a list of alternate objects of the current element. AND
because we have only one such object, you we extract this one from the list.
t = e.getchildren().pop()
Now the variable
t contains a table with the information we need. Now, I will get 2 auxiliary
dataframe'a , combining that, we will get information about tournaments with their dates and links to the results.
In the first set I will include all tournament names and links to their pages on the site. This is easily done using the
iterlinks () iterator, which returns a list of conditioners
(, ,
, )
(, ,
, )
(, ,
, )
within the specified element. Actually, from this tuple, we need the address of the link and its text.
A link test can be obtained by referring to the
.text property of the corresponding element. The code will be as follows:
events_tabl = DataFrame([{'EVENT':i[0].text, 'LINK':i[2]} for i in t.iterlinks()][5:])
The attentive reader will notice that in the loop we exclude the first 5 entries. They contain information that we don’t need, such as field headers, so I got rid of them.
So, we got the links. Now we get 2 subsets of data with dates of tournaments. This can be done like this:
event_date = DataFrame([{'EVENT': evt.getchildren()[0].text_content(), 'DATE':evt.getchildren()[1].text_content()} for evt in t][2:])
In the code shown above, we go through all the lines (
tr tags) in table
t . Then for each row we get a list of child columns (
td elements). And we get the information recorded in the first and second columns using the
text_content method, which returns a string from the text of all child elements of the given column.
To understand how the
text_content method
works, here’s a small example. Suppose we have this structure of the document
<tr> <td> <span> text </ span> <span> text </ span> . So, the
text_content method returns a string of
text, text , and the
text method does not return anything, or just
text .
Now that we have 2 subsets of data, combine them into a final set:
sum_event_link = events_tabl.set_index('EVENT').join(event_date.set_index('EVENT')).reset_index()
Here, we first specify the indices to our sets, then combine them and reset the indices of the final set. More information about these operations can be found in one of my past
articles . It remains to upload the received dataframe to a text file, for safekeeping:
sum_event_link.to_csv('..\DataSets\ufc\list_ufc_events.csv',';',index=False)
Event handler for one UFC event
We unloaded a page with a list of tournaments in a convenient format. It's time to sort out the results pages for the competition. For example, take the last
tournament and see the HTML code of the page.
You may notice that the information we need is contained in the element with the
data_table class
row_is_link . In general, the parsing process is similar to the one shown above, with one exception: the results table is not entirely correct.
Its uncorrection is that for each fighter there is a separate line in it, which is not convenient in the analysis. To get rid of this inconvenience when analyzing the results, I decided to use an iterator, only on odd lines. The number of the even number is calculated from the current odd line.
Thus, I will process a couple of lines at once and transfer them to a line. The code will be as follows:
all_fights = [] for i in sum_event_link.itertuples(): page_event = html.parse('%s/%s' % (main_domain_stat,active_event_link)) main_code = page_event.getroot() figth_event_tbl = main_code.find_class('data_table row_is_link').pop()[1:] for figther_num in xrange(len(figth_event_tbl)): if not figther_num % 2: all_fights.append( {'FIGHTER_WIN': figth_event_tbl[figther_num][2].text_content().lstrip().rstrip(), 'FIGHTER_LOSE': figth_event_tbl[figther_num+1][1].text_content().lstrip().rstrip(), 'METHOD': figth_event_tbl[figther_num][8].text_content().lstrip().rstrip(), 'METHOD_DESC': figth_event_tbl[figther_num+1][7].text_content().lstrip().rstrip(), 'ROUND': figth_event_tbl[figther_num][9].text_content().lstrip().rstrip(), 'TIME': figth_event_tbl[figther_num][10].text_content().lstrip().rstrip(), 'EVENT_NAME': i[1]} ) history_stat = DataFrame(all_fights)
You may notice that for each match the tournament name is additionally recorded. This is necessary in order to determine the date of the match.
Now save the results to a file:
history_stat.to_csv('..\DataSets\ufc\list_all_fights.csv',';',index=False)
Let's look at the result:
history_stat.head()
| EVENT_NAME | FIGHTER_LOSE | FIGHTER_WIN | METHOD | METHOD_DESC | ROUND | TIME |
---|
0 | UFC Fight Night 38: Shogun vs. Henderson | Robbie lawler | Johny hendricks | U. DEC | NaN | five | 5:00 |
---|
one | UFC Fight Night 38: Shogun vs. Henderson | Carlos condit | Tyron woodley | KO / TKO | Knee injury | 2 | Two o'clock |
---|
2 | UFC Fight Night 38: Shogun vs. Henderson | Diego sanchez | Myles jury | U. DEC | NaN | 3 | 5:00 |
---|
3 | UFC Fight Night 38: Shogun vs. Henderson | Jake shields | Hector lombard | U. DEC | NaN | 3 | 5:00 |
---|
four | UFC Fight Night 38: Shogun vs. Henderson | Nikita Krylov | Ovince saint preux | SUB | Other - Choke | one | 1:29 |
---|
It remains only to tighten the date to the fights and unload the final file:
all_statistics = history_stat.set_index('EVENT_NAME').join(sum_event_link.set_index('EVENT').DATE) all_statistics.to_csv('..\DataSets\ufc\statistics_ufc.csv',';', index_label='EVENT')
Conclusion
In the article, I tried to show the basics of working with the
lxml library, which is used for parsing XML and HTML markup. The code specified in the article does not claim to be optimal, but correctly performs the task set for it.
As you can see from the above program, the process of working with the library is quite simple, which helps to quickly write the necessary code. In addition to the above functions and methods, there are other equally necessary.