for i in range(2,29): url = "http://profismart.ru/web/bookreader-115980-%i.php" % i
html
variable: for i in range(2,29): # url = "http://profismart.ru/web/bookreader-115980-%i.php" % i # html = "" sock = urllib.urlopen(url) html = sock.read() sock.close()
<td style="padding:7px" class="ps24 ps37">
and </td>
tags. Each paragraph of text is between the <div>
and </div>
tags, and the <div>
tags themselves without additional parameters are used only in the text of the book. Thus, to extract the text, we just need to find the first <div>
and take all the text from it to the first </td>
. The regular expression will look like this: (<div>.+?</div>)</td>
.+
Says that we want to take the minimum number of characters that satisfy our request. Thus, we are looking for all the text from the first <div>
tag found in the html-code to the first (thanks to the question mark) combination </div></td>
. The brackets indicate the part of the found text that we take. In this case, we take all the found text without the </td>
, which we don’t need. To use a regular expression in our program, add the following line to its beginning: text_regex = re.compile(u"(<div>.+?</div>)</td>", re.IGNORECASE | re.DOTALL | re.UNICODE)
text_regex
. The text of the book we will save in the variable book
. With the addition of a regular expression, the code that handles pages 2-28 will look like this: for i in range(2,29): # url = "http://profismart.ru/web/bookreader-115980-%i.php" % i # html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() # book = book + text_regex.search(html).group(1)
(<html.+?<body>)
<html>
tag to the <body>
inclusive. In the code we get, there will be some garbage, but since it will not affect the display of the page in the browser or on the screen of the PDA, we will leave it as it is. If you really want, later we will manually delete it in some editor. As for the previous expression, add the line to the beginning of the program: head_regex = re.compile(u"(<html.+?<body>)", re.IGNORECASE | re.DOTALL | re.UNICODE)
# url = "http://profismart.ru/web/bookreader-115980.php" html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() # head = head_regex.search(html).group(1) book = book + head # book = book + text_regex.search(html).group(1)
<html>
and <body>
to the contents of the variable book and write it to a file. This is done by the following code: # file_out = open('book.html', 'w') # file_out.write(book) # file_out.write("\n</body></html>") # file_out.close()
<div>...</div>
tags with <p align=justify>...</p>
and the output of information about the page that is being processed at the moment. The first is because I like it when the text is aligned to the width of the page. The second is that the program, while working, gives at least some signs of life. import urllib import re text_regex = re.compile(u"(<div>.+?</div>)</td>", re.IGNORECASE | re.DOTALL | re.UNICODE) head_regex = re.compile(u"(<html.+?<body>)", re.IGNORECASE | re.DOTALL | re.UNICODE) book = "" # Download first page url = "http://profismart.ru/web/bookreader-115980.php" print "Page 1" print url html = "" sock = urllib.urlopen(url) html = sock.read() sock.close() print "Page downloaded." # Extract page head and book text book = book + head_regex.search(html).group(1) book = book + text_regex.search(html).group(1) for i in range(2,29): # Generate page url url = "http://profismart.ru/web/bookreader-115980-%i.php" % i print "Page %i" % i print url # Download the page sock = urllib.urlopen(url) html = sock.read() sock.close() print "Page downloaded." # Extract required text book = book + text_regex.search(html).group(1) # Replace <div> with <p align=justify> book = book.replace("<div>", "\n<p align=justify>") book = book.replace("</div>", "</p>") # Write to file file_out = open('book.html', 'w') file_out.write(book) file_out.write("\n</body></html>") file_out.close()
Source: https://habr.com/ru/post/139673/
All Articles