📜 ⬆️ ⬇️

Mini web crawler. Downloading a book from the Internet

Since most of the books I read from the handheld, you need to take books somewhere to read. As a rule, I find books that interest me in network libraries in text format (txt, html, fb2). At the same time, there are cases when you want to read a book that is posted on the site, where downloading is not in principle provided, and even divided into several html pages (like this one , for example). In this case, you can save each html page manually, but this method has two important drawbacks. First, if the book is divided into 15-20 pages, saving manually will take a long time and will be annoying. Secondly, together with the text of the book, we get a bunch of all kinds of garbage - text that is not relevant to the book, tables, scripts, links to other sites and other dregs.

To simplify your life, we will write a program that downloads a book for us. It follows from the above that we need a program that: a) downloads in the right order all the pages along which the book is scattered; b) from each page will take only a test and nothing extra and c) save all the text of the book in one html-file.

As an example, we take the book by Vladimir Plungian “Why languages ​​are so different. Popular linguistics . I found only one network library where it can be downloaded in text format, and registration is required there, so we will download it from the link provided with the help of the program, which will be discussed further. To write a program, we use the programming language Python. I used Python version 2.6. This or a newer version can be downloaded for free on the official website .

Download pages
')
To download pages, we will need their addresses. As we can see, the addresses of all pages except the first are the same and differ only in number:

http://profismart.ru/web/bookreader-115980- page_number .php

Due to this, we can not write them manually, but generate with the following code:

for i in range(2,29): url = "http://profismart.ru/web/bookreader-115980-%i.php" % i 

This will give us the addresses of all pages from the second to the twenty-eighth. The address of the first page is different from the rest (http://profismart.ru/web/bookreader-115980.php), so we will insert it into the program manually.

Then we will download each of these pages, and write the html-code into the html variable:

 for i in range(2,29): #   url = "http://profismart.ru/web/bookreader-115980-%i.php" % i #   html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() 


Extract the desired text from the page.

We extract the HTML code containing the text of the book from the page code using regular expressions. In our case, the text is inside the page between the <td style="padding:7px" class="ps24 ps37"> and </td> tags. Each paragraph of text is between the <div> and </div> tags, and the <div> tags themselves without additional parameters are used only in the text of the book. Thus, to extract the text, we just need to find the first <div> and take all the text from it to the first </td> . The regular expression will look like this:

 (<div>.+?</div>)</td> 

A dot means any character, and a plus sign says that there must be at least one character. The question mark after the combination .+ Says that we want to take the minimum number of characters that satisfy our request. Thus, we are looking for all the text from the first <div> tag found in the html-code to the first (thanks to the question mark) combination </div></td> . The brackets indicate the part of the found text that we take. In this case, we take all the found text without the </td> , which we don’t need. To use a regular expression in our program, add the following line to its beginning:

 text_regex = re.compile(u"(<div>.+?</div>)</td>", re.IGNORECASE | re.DOTALL | re.UNICODE) 

We will refer to the expression by the name text_regex . The text of the book we will save in the variable book . With the addition of a regular expression, the code that handles pages 2-28 will look like this:

 for i in range(2,29): #   url = "http://profismart.ru/web/bookreader-115980-%i.php" % i #   html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() #            book = book + text_regex.search(html).group(1) 

We will process the first page separately. From it we take the text of the book and the title for the new html-file, which will create our program. We extract the header using the following regular expression:

 (<html.+?<body>) 

That is, we take all the code from the <html> tag to the <body> inclusive. In the code we get, there will be some garbage, but since it will not affect the display of the page in the browser or on the screen of the PDA, we will leave it as it is. If you really want, later we will manually delete it in some editor. As for the previous expression, add the line to the beginning of the program:

 head_regex = re.compile(u"(<html.+?<body>)", re.IGNORECASE | re.DOTALL | re.UNICODE) 

The code processing the first page will look like this:

 #    url = "http://profismart.ru/web/bookreader-115980.php" html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() #   head = head_regex.search(html).group(1) book = book + head #      book = book + text_regex.search(html).group(1) 


Save the text of the book to a file

Everything is simple here. Add the closing tags for the <html> and <body> to the contents of the variable book and write it to a file. This is done by the following code:

 #     file_out = open('book.html', 'w') #    file_out.write(book) #    file_out.write("\n</body></html>") #   file_out.close() 


Putting the whole program together

In the final version of the program, I added the replacement of the <div>...</div> tags with <p align=justify>...</p> and the output of information about the page that is being processed at the moment. The first is because I like it when the text is aligned to the width of the page. The second is that the program, while working, gives at least some signs of life.

The full text of the program can be saved in a file with the extension .py and run. Here he is:

 import urllib import re text_regex = re.compile(u"(<div>.+?</div>)</td>", re.IGNORECASE | re.DOTALL | re.UNICODE) head_regex = re.compile(u"(<html.+?<body>)", re.IGNORECASE | re.DOTALL | re.UNICODE) book = "" # Download first page url = "http://profismart.ru/web/bookreader-115980.php" print "Page 1" print url html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() print "Page downloaded." # Extract page head and book text book = book + head_regex.search(html).group(1) book = book + text_regex.search(html).group(1) for i in range(2,29): # Generate page url url = "http://profismart.ru/web/bookreader-115980-%i.php" % i print "Page %i" % i print url # Download the page sock = urllib.urlopen(url) html = sock.read() sock.close() print "Page downloaded." # Extract required text book = book + text_regex.search(html).group(1) # Replace <div> with <p align=justify> book = book.replace("<div>", "\n<p align=justify>") book = book.replace("</div>", "</p>") # Write to file file_out = open('book.html', 'w') file_out.write(book) file_out.write("\n</body></html>") file_out.close() 

Source: https://habr.com/ru/post/139673/


All Articles