I have long noticed that the data on social networks is stored poorly. For example, the repost you made will be empty if the author of the original entry deletes it. Recent problems with audio recordings in vk are the last straw, and I decided to save locally all the data that might be of interest
in case of a nuclear war . After searching for ready-made solutions, I did not find anything that would suit me, so a script in Python was written in a few days.
Goals
Save all you can: audio recordings, documents, wall. From the wall you need to drag off all attachments to posts, and comments with all attachments will not be superfluous either. It is necessary at least to preserve all the posts with music and comments, where friends sent good tracks
or cats . I’ll say right away that for my purposes there was no readable backup of additional information (likes, record creation time, etc.).
For the cause!
The process of creating such an application has already
been described many times in Habré, so I will not repeat all the details, I will describe the work steps in brief, and I will also say a few words about the problems. That article was not overloaded with source codes, in the end there will be a link to github.
')
Considerations in the development
- First of all, you need to get yourself an application id . It is important that the type be standalone , otherwise some vk api methods will not be available.
- You also need the id of the user, whose data will be saved. You can find your own on the settings page
- For the application to work, you need user permission, or rather, access token. There is no direct non-interactive way to get a token, you can parse the authorization page, but it's easier to ask the user to click a button in the browser and copy the url. The auth () function is responsible for this:
url = "https://oauth.vk.com/oauth/authorize?" + \ "redirect_uri=https://oauth.vk.com/blank.html&response_type=token&" + \ "client_id=%s&scope=%s&display=wap" % (args.app_id, ",".join(args.access_rights)) print("Please open this url:\n\n\t{}\n".format(url)) raw_url = raw_input("Grant access to your acc and copy resulting URL here: ") res = re.search('access_token=([0-9A-Fa-f]+)', raw_url, re.I)
- Vk api requests have a limit: no more than five per second. If you contact the server too often, it will respond with an error. This is quite convenient: by the error code you can understand that the script is working too fast, wait for a while and repeat the request.
if result[u'error'][u'error_code'] == 6:
- Periodically, the vk server requires you to solve the captcha, suspecting that the client is a bot. In general, he correctly suspects. To save the process is not interrupted, you have to ask the user to click on the link to the picture, solve the captcha and drive the answer. This is rendered to the function with the straightforward name captcha ():
print("They want you to solve CAPTCHA. Please open this URL, and type here a captcha solution:") print("\n\t{}\n".format(data[u'error'][u'captcha_img'])) solution = raw_input("Solution = ").strip() return data[u'error'][u'captcha_sid'], solution
- Links, additional information like the number of likes and server responses in JSON will be written to files, just in case.
- Text of the song is attached to some audio recordings, which also makes sense to save.
- File names may be incorrect for the file system, so you have to get rid of some characters. I did not find a ready-made “right” solution, so I had to invent a mini-bike:
result = unicode(re.sub('[^+=\-()$!#%&,.\w\s]', '_', name, flags=re.UNICODE).strip())
- Another problem with file names: may coincide, for example in the case of documents. To do this, add (n) to the file name, where n is the first number that gives a unique file name.
Continue
The reference code for the api is taken from the
article of habrauzer
dzhioev , and the handling of the situations described above is added. To be what to save (in the case of processing the wall), you must first find out the number of posts:
Further we request each post separately and we sort it
for x in xrange(args.wall_start, args.wall_end): (post, json_stuff) = call_api("wall.get", [("owner_id", args.id), ("count", 1), ("offset", x)], args) process_post(("wall post", x), post, post_parser, json_stuff)
The result of the request is a set of data in JSON, which are interpreted into standard structures for python using json.loads () from the standard library. As a result, we have a hash array, in which some fields (key-value) carry the payload, and the rest do not interest us. In order not to write with our hands what field to use for which method, let us use the power of reflection: we will look for a method whose name matches the key of interest.
for k in raw_data.keys(): try: f = getattr(self, k) keys.append(k) funcs.append(f) except AttributeError: logging.warning("Not implemented: {}".format(k)) logging.info("Saving: {} for {}".format(', '.join(keys), raw_data['id'])) for (f, k) in zip(funcs, keys): f(k, raw_data)
Parsim
Now you need to deal with the answer fields. The interesting ones are attachments, text, comments. Attachments is a list of applications for the post (audio, pictures, documents, notes), you must be able to download each type. We determine which method to process each attachment in the same way: by the type of attachment, we are looking for a method with a suitable name. Here is an example of a rocker for audio:
def dl_audio(self, data): aid = data["aid"] owner = data["owner_id"] request = "{}_{}".format(owner, aid) (audio_data, json_stuff) = call_api("audio.getById", [("audios", request), ], self.args) try: data = audio_data[0] name = u"{artist} - {title}.mp3".format(**data) self.save_url(data["url"], name) except IndexError:
Unfortunately, the recordings taken at the request of the copyright holders are no longer available; an empty response is returned for them.
But other?
Methods for processing images, text, notes, upload documents and the rest -
in github . I can only say that everything is similar to the examples given. The script also has command line arguments, it makes no sense to describe them in the article. Examples and other details are
in the readme .
Todo
I did not save the photo albums, because I have nothing important stored there, and the
kilonet code from
his article works well. The videos and notes are still not saved, it seemed to me not very necessary.
Lastly
The code is far from ideal and does not differ in the absence of crutches, but performs the task. I hope someone will be useful my craft, to save their records / documents / music, or for training.
UPD 12/18/2016
The
hiwent user says that since 12/16/2016, the vk closed the possibility to use the API for working with audio recordings. In this regard, the script functionality provided for saving audio recordings does not work. In this regard, you can try to “pretend” to the native application vk, for example, the android version, or kate mobile. For them, the ability to work with audio recordings will not disappear, although there may be different methods.