Analytics Instagram and GAE

Some time ago on Habré an article was published about the search for similar accounts on Twitter. Unfortunately, the author did not react to the comments, so I had to reinvent the wheel. But in order not to do exactly the same thing, it was decided to look for similar accounts on Instagram using Google App Engine, so much so that everyone could use the service. So came instalytics.ru *.

The most difficult thing, of course, was to implement a service for everyone (well, stay within the free quotas of Google App Engine and take into account the limitations of the Instagram API ).
')
Everything is implemented as follows -

User request for analysis of the account is checked in Instagram - if the specified user is found, then the request for its analysis is added to the database. In addition, each request has its own priority (as long as all requests are the same).

Once every 15 minutes, a task is launched using cron, which selects one request from the queue from the database and creates a new task — to get all the subscribers of the user from the request. The task in case of an error is repeated again:

- name: followers-get rate: 1/m # 1 task per minute bucket_size: 1 max_concurrent_requests: 1 retry_parameters: task_retry_limit: 2 min_backoff_seconds: 30

Each task, in case if not all subscribers were received in one request, creates a new task:

 if users and users.get('pagination') and users.get('pagination').get('next_cursor'): cursor = users.get('pagination').get('next_cursor') url = '/task/followers-get?user_id='+user_id url += '&cursor=' + cursor taskqueue.add(queue_name='followers-get', url=url, method='GET')

After completing the acquisition of all subscribers, each analysis begins. To do this, a huge number of tasks are created to get a list of those users to whom each subscriber is subscribed (each task can create new tasks, as is the case with the subscribers above). In order to be in line with the Instagram limit of 5,000 requests per hour, the task queue is configured as follows:
```
 - name: subscriptions-get rate: 5000/h 
```
At the same time, after each request is completed, just in case, we sleep for 0.72 seconds (= 60 * 60/5000).
Unfortunately, in the free version of Google App Engine, you can only make 50,000 entries per day in the database. Since each task can create a new task, then the initial option - write the result of each task to the database - had to be replaced with a new one - the result of the previous task is transferred as a parameter to the new task, and only the last task writes the result to the database:
```
 if users and users.get('pagination') and users.get('pagination').get('next_cursor'): cursor = users.get('pagination').get('next_cursor') params = { 'user_id': user_id, 'f_user_id': f_user_id, 'cursor': cursor } if more_subscriptions: params['subscriptions'] = ','.join(more_subscriptions) taskqueue.add(queue_name='subscriptions-get', url='/task/subscriptions-get', params=params, method='POST') 
```
Some users (such as @instagram , for example) have millions of subscribers. In order not to waste precious resources on getting all their subscribers, the task is completed after receiving 100'000 subscribers.
Due to the restriction on the number of write operations to the database, it is not possible to properly monitor whether all tasks for a particular user have completed or not. The normal solution would be to write a list of running tasks to the database and, at the end of each task (or if the last attempt was made to complete the task), exclude the task from the list. But a huge number of tasks multiplied by all users does not allow it. Because the task list is stored in memcache:
```
 memcache.set('subscriptions'+str(user_id), ','.join(str(x) for x in followers), 1209600) 
```
Data from memcache can be deleted at any time. To avoid the situation with a “hung up” request (when all the tasks on request were completed, but memcache was deleted and we don’t know about it, respectively), every few hours a task is started that checks whether there are no requests that have received the status of receiving subscribers than 2 weeks ago (for the time being it is considered that this is the time during which all tasks will be completed). If such requests are found, then they are "forcefully" transferred to the next stage.
At the next stage, all previously obtained data from the database is read. As it turned out, there can be quite a lot of data and there may not be enough allocated GAE memory for them. Therefore, the data is read in chunks; for each chunk, an intermediate result is calculated, which is then added to the next intermediate result. In this process, I had to disable the automatic caches:
```
 ctx = ndb.get_context() ctx.set_cache_policy(lambda key: False) ctx.set_memcache_policy(lambda key: False) 
```
As a result of numerous calculations at this stage, 300 of the most popular users are selected, on which your users are subscribed.
For each of the 300 users, tasks are launched to obtain data on them (names, pictures, number of subscribers, etc.). By analogy with the process described above, either the completion of all tasks is expected, or a new stage will be forced after some time.
At the last stage, the calculation and selection of the most similar users is made (taking into account the number of your subscribers and subscribers in total). It turns out something like this , the link to the result is sent to the e-mail.

The above approaches and optimizations so far allow us to remain within the framework of free quotas allocated by GAE, although obtaining a result takes a lot of time. I need your help - add your users to the queue, let's see how long their analysis will take.

In the future I plan to add to the service the recognition of real people / companies on Instagram, but I cannot do without machine learning - so this will be a separate task.

^{* Russian language on the site does not work yet - I can not figure out why django translation does not work on GAE.}

Source: https://habr.com/ru/post/276237/

All Articles

Analytics Instagram and GAE

More articles: