📜 ⬆️ ⬇️

Four methods for downloading images from a website using Python

Recently I had to write a simple parser on python, which would download images from the site (in theory, the same parser can download not only images, but also files of other formats) and save them to disk. In total, I found four methods on the Internet. In this article, I decided to put them all together.

These methods are:

1st method

The first method uses the urllib module (or urllib2). Suppose there is a link to a certain image img. The method is as follows:
')
import urllib resource = urllib.urlopen(img) out = open("...\img.jpg", 'wb') out.write(resource.read()) out.close() 


Here you need to note that the recording mode for images is 'wb' (binary), and not just 'w'.

2nd method

The second method uses the same urllib. Later it will be shown that this method is slightly slower than the first one (the negative hue of the parsing speed factor is ambiguous), but it is worthy of attention because of its brevity:

 import urllib urllib.urlretrieve(img, "...\img.jpg") 


Moreover, it is worth noting that the urlretrieve function in the urllib2 library for reasons unknown to me (who can tell for what) is missing.

3rd method

The third method uses requests. The method has the same order of image upload speed with the first two methods:

 import requests p = requests.get(img) out = open("...\img.jpg", "wb") out.write(p.content) out.close() 

At the same time, when working with the web in Python, it is recommended to use requests instead of the urllib and httplib families because of its brevity and ease of handling.

4th method

The fourth method is fundamentally different in speed from previous methods (by an order of magnitude). Based on the use of the httplib2 module. As follows:

 import httplib2 h = httplib2.Http('.cache') response, content = h.request(img) out = open('...\img.jpg', 'wb') out.write(content) out.close() 


This uses caching explicitly. Without caching (h = httplib2.Http ()), the method works 6-9 times slower than previous analogues.

Speed ​​testing was conducted on the example of downloading pictures with the * .jpg extension from the lenta.ru news feed site. The selection of images that fall under this criterion and the measurement of the program execution time were made as follows:

 import re, time, urllib2 url = "http://lenta.ru/" content = urllib2.urlopen(url).read() imgUrls = re.findall('img .*?src="(.*?)"', ontent) start = time.time() for img in imgUrls: if img.endswith(".jpg"): """      url""" print time.time()-start 


The constantly changing images on the site did not affect the purity of the measurements, since the methods worked one after another. The results are as follows:

Method speed comparison table
Method 1 sMethod 2 sMethod 3 sMethod 4, s (without caching, s)
0.8230.9080.8740.089 (7.625)

Data is presented as the result of averaging the results of seven measurements.
We ask those who dealt with the Grab library (and others) to write in the comments a similar method for downloading images using this and other libraries.

Source: https://habr.com/ru/post/210238/


All Articles