📜 ⬆️ ⬇️

We make a dump of photos from the dialogue vk.com

Hello!

Yesterday I needed to download all the photos from a dialogue with one person in vk.com. There were more than 1000 photos. It is clear that with pens it would be tiring to do everything and ... It's a shame. I don't do programming for this, so that such dirty work is not done automatically. Therefore, it was decided to write a script.

Python was chosen as the language. It is convenient to use it for the console, it is quite fast, there is the urllib module, which allows you to download pictures by reference in one motion. But the main reason is that I began to study it recently. Decided to practice further.
')
The script itself turned out to be small, but it would be interesting to describe the creation process. I will try to write more comments in the code so that those who do not know python can also understand the process. And experts are very welcome advice and guidance. So let's get started.

“Vkontakte” does not provide the API specifically for downloading materials from the conversation, so it took the longest time to study how the system for loading images from the dialogue in vk.com is arranged. All the pictures are with them, of course, on the server, and anyone who has a link to this picture has access to them. Thus, to download all the photos from the dialogue, we need to get all the links to the pictures. Poking back and forth, it was found that when you click on "Actions -> show materials from the conversation," a POST request is sent to vk.com/wkview.php . The request contains the parameters:


In this request, the dialog_id is the value of the “sel” parameter in the address bar when we enter the dialog.
Having executed such request, we will receive in reply something like this:

16515<!>wkview.js,wkview.css,page.js,page.css,page_help.css<!>0<!>6590<!>0<!><!bool><!><div id="wk_history_wrap"> <div class="wk_history_title tb_title" id="wk_history_title">    _</div> <div class="wk_history_tabs tb_tabs_wrap"> <div class="tb_tabs clear_fix" id="wk_history_tabs"><div class="progress tb_prg fl_r" id="wk_history_tabs_prg"></div><div class="fl_l summary_tab_sel"> <a class="summary_tab2" onclick="showWiki({w: 'history<dialog_id>_photo'})" > <div class="summary_tab3"> <nobr></nobr> </div> </a> </div><div class="fl_l summary_tab"> <a class="summary_tab2" onclick="showWiki({w: 'history<dialog_id>_video'})" > <div class="summary_tab3"> <nobr></nobr> </div> </a> </div><div class="fl_l summary_tab"> <a class="summary_tab2" onclick="showWiki({w: 'history<dialog_id>_audio'})" > <div class="summary_tab3"> <nobr></nobr> </div> </a> </div><div class="fl_l summary_tab"> <a class="summary_tab2" onclick="showWiki({w: 'history<dialog_id>_doc'})" > <div class="summary_tab3"> <nobr></nobr> </div> </a> </div></div> <div class="tb_tabs_sh" id="wk_history_tabs_sh"></div> </div> <div class="wall_module wide_wall_module" id="wk_history_wall"> <div class="post_media" id="wk_history_rows"><div class="page_post_sized_thumbs clear_fix" style="width: 597px; height: 1722px;"><a onclick="return showPhoto('...', 'mail...', {"temp":{"base":"/","x_":["",500,331]},queue: 1}, event);" style="width: 193px; height: 127px;" class="page_post_thumb_wrap fl_l"><img src="" width="193" height="128" style="margin-top: 0px;" class="page_post_thumb_sized_photo" /></a> ... (     )</div></div> </div> <div id="wk_history_empty" style=""> .</div> <div id="wk_history_more" class=""> <div id="wk_history_more_link" onclick="return WkView.historyShowMore();" style=""> </div> <div id="wk_history_more_progress" class="progress"></div> </div> </div><!><!json>{"count":"23318","offset":3330,"type":"history","commonClass":"wk_history_content wk_history_photo_content","wkRaw":"history<dialog_id>_photo","canEdit":false,"lang":[]}<!>WkView.historyInit();<!><!pageview_candidate> 

Here I replaced the links to <some link>, since I already said that the pictures vk are in the public domain and anyone who knows the link can receive them.

Of all this, we are interested only in the links that are inside <img src = "">, as well as json at the end. I was not completely honest, saying that the POST request takes 4 parameters. More precisely, it accepts, but if we fulfill it, we will be issued only the first few photos. Since vk.com has content loading as it scrolls the page, there is an offset parameter, which is responsible for what part of the total number of photos we should load. As a result, the query parameters look like this:


Of all the parameters, only the offset will change. We pull him out of that very json'a at the end of the answer. Each time a request is executed, the offset inside the json will increase, indicating which “offset” needs to be made next time. Thus, we will need to make requests until we have the offset is less than the count.

By the way, what about the execution of requests? How do we access our page? It was found that access to the page can be obtained by someone who has a cookie called remixsid. Thus, we need to substitute this cookie into a function that fulfills the request and everything will work out. Safely? Not really, throwing cookies - this is not good, but I did not find another option. If someone knows, please write.

The general algorithm seems to be clear: make a request, pull out links, write them to a file, check
new offset> count? - if not, assign offset to a new value and execute a query with it, if yes, then exit the loop. Then go through all the links in the file and download the pictures that are located at their address. Start writing code.

 # coding=utf-8 import requests #    import re #      import sys #      import os #      import urllib #    import json #   json # argv[1] = remixsid_cookie # argv[2] = dialog_id # argv[3] = person_name 

Arguments will be transmitted through the terminal (remixsid, dialog_id and folder name):

 remixsid_cookie = sys.argv[1] #   RequestData = { "act": "show", "al": 1, "loc":"im", "w": "history" + sys.argv[2] + "_photo", "offset" : 0, "part" : 1 } request_href = "http://vk.com/wkview.php" #   offset  count. Count     bound = {"count" : 10000, "offset" : 0} 

Create a separate folder for photos:

 try: os.mkdir("drop_" + sys.argv[3]) #    except OSError: print "    'drop_" + sys.argv[3] + "'" if( os.path.exists("drop_" + sys.argv[3]) ): os.chdir("drop_" + sys.argv[3]) #     else: print "   \n" exit() 

Great, we are starting the query:

 test = open("links", "w") while( bound['offset'] < bound['count'] ): RequestData['offset'] = bound['offset'] content = requests.post(request_href, cookies={"remixsid": remixsid_cookie}, params=RequestData).text #     post    params   . .text      .  . 

Now we start parsing the answer. We retrieve everything through regular expressions. First, extract json and set the next offset:

  #      json_data_offset = re.compile('\{"count":.+?,"offset":.+?\}').search(content) # .search   .     span(),           bound = json.loads(content[json_data_offset.span()[0]:json_data_offset.span()[1]]) #  json bound['count'] = int(bound['count']) #count     bound['offset'] = int(bound['offset']) #  ,         .      "  " 

Now we need to extract all links from the src tags. We act in the same way, but use the findall method, which returns an array of all strings that matched a regular schedule:

  links = re.compile('src="http://.+?"').findall(content) 

Now write everything to the file:

  for st in links: test.write(st[5:len(st)-1] + '\n') #  ,   src="..." test.close() 

With this all. It remains only to go through the file and download all the links. This is done using the urllib module, like this:

 urllib.urlretrieve(,  ) 

And for our case:

 test = open("links", "r") file_num = 0 for href in test: #       ,      urllib.urlretrieve(href, str(file_num)) #          file_num += 1 print " " + str(file_num) + " \n" test.close() 

Done! But, since we’ll use it from the command line, let's write a little documentation (--help), as well as an error if the command line arguments are smaller than necessary. Add to the beginning:

 if( sys.argv[1] == '--help' ): print """ Usage: python main.py <remixsid_cookie> <dialog_id> <name_of_folder> <dialog_id> is a string parameter "sel" in address line which you see when open a dialog """ exit() else: if( len(sys.argv) < 4 ): print """ Invalid number of arguments. Use parameter --help to know more """ exit() 

That's all, sort of. Of course, there is much more to be added: checking for a completed request or not, checking for correctness of incoming data, automatic retrieval of <dialog_id> (for example, the first 10), but I just wanted to describe the main points. As a result, those same 1000 photos that I needed were downloaded. It took about 2 minutes. As I understand it, vk.com does not put any restrictions on requests, although I can assume that it doesn’t even respond to such small traffic for it.

The entire working code lies entirely on Github .

Thanks to all.

Source: https://habr.com/ru/post/244647/


All Articles