The topic will be interesting to those who want to index Internet sites at top speeds (homemade search engines, word frequency analyzes, html analysis services, etc.) Threading here does not give speed limits, urllib - all the more ... The solution here is to use asynchronous requests from libcurl.
Speed?
At 500MHZ (
very very weak VPS) - about
100 URLs per second (100 connections, 2 processes).
On Amazon EC2 “High-CPU Medium Instance” (.2 $ / hour) ~
1200 URLs per second (300 connections, 5 simultaneous processes). In one process up to 660 URLs per second.
For pumping out a lot of sites and further processing, I want to share one of my useful functions - multi_get - in fact it is a convenient wrapper for CurlMulti (libcurl), modified from their example CurlMulti.
')
>>> urls = ['http://google.com/', 'http://statcounter.com/']
>>> res = {}
>>> multi_get (res, urls, num_conn = 30, timeout = 5, percentile = 95)
>>> res['http://google.com/']
'<html><title>Google....
# res, HTML URL'
This code downloads two sites in 30 connections. More precisely, of course in two connections, but I just did not have enough space here to write 10,000 urls.
Goodies and usefulness:
0. With num_conn = 1, the function turns into a serial (not parallel) download, but with all the advantages below (cookies, user-agents, unconditional timeouts)
1. If we allow res in advance to define 'http://google.com/' as having some value - this address will not be downloaded (will be skipped). The bottom line is that if you have res not just an ordinary dict, but somehow persistent (for example, is stored in a file or in some kind of SQL), then only those sites that have not been downloaded before will be downloaded at each call.
2. multi_get (res, url,
debug = 1 ) - display information about the download progress (the console slows down the process, so it’s better to turn off production).
3. multi_get (res, url,
percentile = 95 ) - often 90-99% of sites from a large list are downloaded almost in microseconds each, but 1-2 sites from a large list will be very slow. As a result, you have 9990 websites flying by in a minute, let's say, and the remaining 10 you will wait another minute - this terribly reduces efficiency - therefore there is such a parameter - download 95% (or 99, 50, 75) of the fastest URLs and exit without expecting slow.
4. multi_get (res, url,
timeout = 5 ) - timeout per URL - 5 seconds (unlike the timeout of embedded sockets in Python - it always works and does not freeze for no reason).
5. ...
ua = 'Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)' ... - so we believed, but still - which one to pass to User-Agent.
6. ...
ref = 'http://google.com/bot.html' ... - which one to pass to Referer
...
ref_dict = {'http://google.com/': 'http://mysite.com/?url-google.com'} - dict values ​​which URL should be sent to which Referer.
7. ...
cf = 'cookiefile.txt' ... - use cookies, store them in this file.
8. ...
follow = 0 ... - do not follow redirects (by default).
If, for example, all .com domains are indexed, many of them are redirected to the same, so redirects should be simply ignored.
9. res - not necessarily dict, for example, you can define a new class MyDict, where you can do def __setitem __ (self, url, html): and process html asynchronously, right during the download - without waiting for the end of the multi_get call, just define def keys ( self): return [] - return an empty list or list of URLs that should not be downloaded.
Code
Unfortunately, Habr kills whitespace (indentation), without which the python code will not work, so the code is here:
rarestblog.com/py/multi_get.py.txt (or
rarest.s3.amazonaws.com/multi_get.py.txt )
There is also an example in the code that makes 10 YQL queries to get 1000 random links and download 80% of them and measure the speed.
IMPORTANT MOMENTS
You will need to install pycurl (
pycurl.sourceforge.net/download )
>
easy_install pycurl
if easy_install is not installed, then first:
>
python -murllib http://peak.telecommunity.com/dist/ez_setup.py | python - -U setuptools
python -murllib http://peak.telecommunity.com/dist/ez_setup.py | python - -U setuptools
and then the above line.
For the test script will still need
>
easy_install cjson
(you can optionally replace cjson.decode with simplejson.loads - if you understand why)
Installing c-ares under Linux / FreeBSD
Under Windows, everything is in order (.exe setup already includes compiled 'c-ares' support), but on the server (Linux / FreeBSD) you will need to install support for 'c-ares' (asynchronous DNS queries), otherwise the speed pycurl / multi_get drops tenfold - you will not be able to use more than 20-30 connections without c-ares.
# wget http://curl.haxx.se/download/curl- 7.19.4 .tar.gz
# tar zxvf curl-7.19.4.tar.gz
# cd curl-7.19.4
Linux: # ./configure --enable-ares --with-ssl --enable-ipv6 --with-libidn
FreeBSD: # ./configure --enable-ares=/usr/local --with-ssl --enable-ipv6 --with-libidn
"--with-ssl --enable-ipv6 --with-libidn" - .
# make
# make install
[linux only]
:
# rm -rf /usr/lib/libcu*
# ln -s /usr/local/lib/libcurl.so.4 /usr/lib/libcurl.so.4
# ln -s /usr/local/lib/libcurl.so.4 /usr/lib/libcurl.so
# ldconfig
[ linux only]
# cd ..
# rm -rf curl-7*
# python -c "import pycurl;print pycurl.version"
, c-ares
Anti-dos
2. Since the script is rather stupid, but powerful - you can accidentally start a DOSit of someone’s website to avoid it — the small function reduce_by_domain is included, which compresses the list so that there is only 1 URL from one domain - a precaution not to put someone's website.
short_list_of_urls = reduce_by_domain(urls)
How to download all the URLs, without killing sites? Call reduce_by_domain, multi_get several times in a row - remember that if res is not cleared, the same URLs will not be downloaded a second time (see 1. in “Goodies and Utility”), it remains only to remove from the list of urls what you have already downloaded and do short_list_of_urls = reduce_by_domain (urls) again; multi_get (res, short_list_of_urls).
More nuances:Erroneous URLs will be returned with a
"---"
value.
Files larger than 100,000 bytes will not be downloaded.
.Pdf files will not download.
This is done as a precautionary measure - everything can be changed very easily in the function code, so as not to index what is not needed (images, .pdf).

Yoi Haji
view from Habra