📜 ⬆️ ⬇️

We download the history of correspondence with all users of VKontakte using Python

For linguistic research, I needed the corpus of direct speech generated by one person. I decided that for the beginning it was most convenient to use my own correspondence in the VC. This article is about how to download all the messages you have ever sent to your friends using the Python program and the VKontakte API. To work with the API, we will use the vk library.

To work with the site, you need to create an application and log in using a token. This process is nothing complicated and is described here and here .

So, the token is received. We import the necessary libraries (we will need time and re later), connect to our application and get started.
')
import vk import time import re session = vk.Session(access_token='your_token') vkapi = vk.API(session) 

Since we want to receive correspondence with all friends, let's start by getting a list of friends. Further processing of the complete list of friends can be quite long, so for testing you can write the id of several friends manually.

 friends = vkapi('friends.get') #       # friends = [1111111, 2222222, 33333333] #    

Having a list of friends, you can immediately start downloading dialogs with them, but I want to process only those dialogues that contain more than 200 messages, since short conversations with unfamiliar people are not very interesting for me. Therefore, we write a function that returns the "caps" of the dialogs.

 def get_dialogs(user_id): dialogs = vkapi('messages.getDialogs', user_id=user_id) return dialogs 

This function returns the "header" of the dialogue with the user, whose id is equal to the specified user_id. The result of her work looks like this:

[96, {'title': ' ... ', 'body': '', 'mid': 333333, 'read_state': 1, 'uid': 111111, 'date': 1490182267, 'fwd_messages': [{'date': 1490173134, 'body': ', , .', 'uid': 222222}], 'out': 0}]

The resulting list contains the number of messages (96) and the data of the last message in the dialogue. Now we have everything you need to download the necessary dialogues.

The main disadvantage is that VKontakte allows you to make a maximum of about three requests per second, so after each request you need to wait some time. For this we need the time library. The smallest waiting time that I was able to deliver in order not to get a refusal after several operations was 0.3 seconds.

Another difficulty is that in one request you can download a maximum of 200 messages. With this, too, will have to fight. Let's write a function.

 def get_history(friends, sleep_time=0.3): all_history = [] i = 0 for friend in friends: friend_dialog = get_dialogs(friend) time.sleep(sleep_time) dialog_len = friend_dialog[0] friend_history = [] if dialog_len > 200: resid = dialog_len offset = 0 while resid > 0: friend_history += vkapi('messages.getHistory', user_id=friend, count=200, offset=offset) time.sleep(sleep_time) resid -= 200 offset += 200 if resid > 0: print('--processing', friend, ':', resid, 'of', dialog_len, 'messages left') all_history += friend_history i +=1 print('processed', i, 'friends of', len(friends)) return all_history 

We will understand what is happening here.

We go through the list of friends and get a dialogue with each of them. Consider the length of the dialogue. If the dialog is shorter than 200 messages, just go to the next friend, if longer, then download the first 200 messages (count argument), add them to the message history for this friend and calculate how many more messages are left to download (resid). As long as the remainder is greater than 0, with each iteration, we increase the offset argument, which allows you to set the indent in the number of messages from the end of the dialog, by 200.

Because of the need to wait after each request, the program works for quite a long time, so I added the output of a small report on the current step, in order to understand what is being processed and how much is left.

NB: The messages.get method has an out argument, with which you can ask the server to send outgoing messages only. I decided not to use it and select the messages I needed after downloading for the following reasons: a) the file will still have to be cleaned, since the server gives each message in the form of a dictionary containing a lot of technical information and b) the messages of the interlocutors may also be of interest for my research.

Each downloaded message is a dictionary and looks like this:
{'read_state': 1, 'date': 1354794668, 'body': ' !<br> .', 'uid': 111111, 'mid': 222222, 'from_id': 111111, 'out': 1}

Then it remains only to clear the result and save it to a file. This part of the work no longer applies to the interaction with the VK API, so I will not dwell on it in detail. And what is there to tell - just select the necessary elements (body) for the desired user and with the help of re we remove line breaks that are marked with the <br> tag. Save everything to a file.

The complete program code looks like this:

 import vk import time import re session = vk.Session(access_token='your_token') vkapi = vk.API(session) SELF_ID = 111111 SLEEP_TIME = 0.3 friends = vkapi('friends.get') #        def get_dialogs(user_id): dialogs = vkapi('messages.getDialogs', user_id=user_id) return dialogs def get_history(friends, sleep_time=0.3): all_history = [] i = 0 for friend in friends: friend_dialog = get_dialogs(friend) time.sleep(sleep_time) dialog_len = friend_dialog[0] friend_history = [] if dialog_len > 200: resid = dialog_len offset = 0 while resid > 0: friend_history += vkapi('messages.getHistory', user_id=friend, count=200, offset=offset) time.sleep(sleep_time) resid -= 200 offset += 200 if resid > 0: print('--processing', friend, ':', resid, 'of', dialog_len, 'messages left') all_history += friend_history i +=1 print('processed', i, 'friends of', len(friends)) return all_history def get_messages_for_user(data, user_id): self_messages = [] for dialog in data: if type(dialog) == dict: if dialog['uid'] == user_id and dialog['from_id'] == user_id: m_text = re.sub("<br>", " ", dialog['body']) self_messages.append(m_text) print('Extracted', len(self_messages), 'messages in total') return self_messages def save_to_file(data, file_name='output.txt'): with open(file_name, 'w', encoding='utf-8') as f: print(data, file=f) if __name__ == '__main__': all_history = get_history(friends, SLEEP_TIME) save_to_file(all_history, 'raw.txt') self_messages = get_messages_for_user(all_history, SELF_ID) save_to_file(self_messages, 'sm_corpus.txt') 

At the time of launching the program, I had 879 friends in the VC. It took about 25 minutes to process them. The file with the raw result had a volume of 74MB. After selecting the text of only my messages - 15MB. The total number of messages in the received package is about 150,000, and their text takes 3,707 pages (in the Word document).

I hope my article will be useful for someone. All methods that can be used to access the VK API are described in detail in the section for VK developers .

Source: https://habr.com/ru/post/325368/


All Articles