The best publications of social networks

Hello. In my free time I do social projects. My friends and I have a sufficient number of “public” in different social networks, which allows us to conduct various experiments. There is an acute problem of finding relevant content and news that can be published. In this regard, the idea came to write a service that will collect posts from the most popular pages and issue them on the specified filter. For the initial test, I chose the social network VKontakte and Twitter.

Technology

First of all, it was necessary to determine the data warehouse (by the way, now the number of saved records is more than 2 million) and this figure will melt every day. The requirements were: very frequent insertion of large amounts of data and quick sampling among them.

I have already heard about nosql databases and wanted to try them. I will not describe in the article a comparison of the databases that I conducted (mysql vs sqlite vs mongodb).
I chose memcached as caching, later I will explain why and in what cases.
As a data collector, a python daemon was written that simultaneously updates all groups from the database.

MongoDB and demon

First of all, I wrote a prototype of the collector of publications from groups. Saw several problems:

Storage capacity
API restrictions

One publication with all the metadata takes about 5-6KB of data, and in the average group about 20,000-30,000 records, it turns out about 175MB of data per group, and there are a lot of these groups. Therefore, we had to set a task in filtering uninteresting and advertising publications.
')
I didn’t have to invent too much, I have only 2 “tables”: groups and posts , the first one keeps the records of groups that need to be parsed and updated, and the second is the scope of all publications of all groups. Now it seems to me that this is an unnecessary and even bad decision. It would be best to create a table for each group, so it will be easier to select and sort records, although the speed even with 2 million is not lost. But this approach should simplify the overall sample for all groups.

API

In cases when you need server processing of some data from the social network of VKontakte, a standalone application is created that can issue a token for any action. For such cases, I have saved a note with the following address:

oauth.vk.com/authorize?client_id=APP_ID&redirect_uri=https://oauth.vk.com/blank.html&response_type=token&scope=groups,offline,photos,friends,wall

Instead of APP_ID, insert the identifier of your standalone application. The generated token allows you to access the specified actions at any time.

Algorithm parser this:
We take the group id, in the cycle we get all the publications, at each iteration we filter the “bad” posts, save it to the database.
The main problem is speed. API vkontakte allows you to perform 3 requests per second. 1 request allows you to get a total of 100 publications - 300 publications per second.
In the case of the parser, this is not so bad: you can “merge” the group in one minute, but there will be problems with the update. The more groups - the longer the update will take place and, accordingly, the issue will not be updated so quickly.

The solution was to use the execute method, which allows you to collect requests for api in a heap and execute at once. Thus, in one request I am doing 5 iterations and I get 500 publications - 1500 per second, which gives the group "discharge" in ~ 13 seconds.

Here is the file with the code that is transferred to execute:

 var groupId = -|replace_group_id|; var startOffset = |replace_start_offset|; var it = 0; var offset = 0; var walls = []; while(it < 5) { var count = 100; offset = startOffset + it * count; walls = walls + [API.wall.get({"owner_id": groupId, "count" : count, "offset" : offset})]; it = it + 1; } return { "offset" : offset, "walls" : walls };

The code is read into memory, replacement of tokens replace_group_id and replace_start_offset is done . As a result, I get an array of publications, the format of which you can see on the VK API official page vk.com/dev/wall.get

The next stage is the filter. I took different groups, looked through the publications and came up with possible screening options. First of all I decided to delete all publications with links to external pages. Almost always it is advertising.

 urls1 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) urls2 = re.findall(ur"[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[az]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)", text) if urls1 or urls2: #

Then I decided to completely eliminate repost - this is 99% advertising. Few people will just repost someone else's page. Check for repost is very simple:

 if item['post_type'] == 'copy': return False

item is another element from the walls collection that the execute method returned.

I also noticed that a lot of ancient publications are empty, they have no attachments and the text is empty. For the filter, it is sufficient to prepend that item ['attachments'] and item ['text'] are empty.

And the last filter that I just brought out with time:

 yearAgo = datetime.datetime.now() - datetime.timedelta(days=200) createTime = datetime.datetime.fromtimestamp(int(item['date'])) if createTime <= yearAgo and not attachments and len(text) < 75: #

As in the previous paragraph, many old publications were with text (the description of the picture in the attachment), but the pictures themselves are no longer preserved.

The next step was to clean up failed publications that simply “did not log in”:

 db.posts.aggregate( { $match : { gid : GROUP_ID } }, { $group : { _id : "$gid", average : {$avg : "$likes"} } } )

This method is performed on the posts table, which has a likes field (the number of likes of the post). It returns the average of likes for this group.
Now you can simply delete all posts older than 3 days that have less than average likes:

 db.posts.remove( { 'gid' : groupId, 'created' : { '$lt' : removeTime }, 'likes': { '$lt' : avg } } )

 removeTime = datetime.datetime.now() - datetime.timedelta(days=3) avg =   ,    ( ).

I add the resulting and filtered publication to the database, this is where the parsing ends. I made the difference between parsing and updating groups only in one point: the update is called exactly 1 time for a group, i.e. I receive only 500 last records (5 on 100 through execute). In general, this is quite enough, given that VKontakte imposed a limit on the number of publications: 200 per day.

Front-end

I will not paint in great detail, javascript + jquery + isotope + inview + mustache.

Isotope is used for modern tile output.
Inview makes it easy to respond to events that hit the viewport of a specific element. (in my case, I memorize viewed publications, and highlight new ones in a special color).
Mustache allows you to build dom-objects on the template.

Filter publications by group

A simple php script was written to output data by groups.
This is an auxiliary function that, by the type of time filter, created an object that can be used directly in the request.

  function filterToTime($timeFilter) { $mongotime = null; if ($timeFilter == 'year') $mongotime = new Mongodate(strtotime("-1 year", time())); else if ($timeFilter == 'month') $mongotime = new Mongodate(strtotime("-1 month", time())); else if ($timeFilter == 'week') $mongotime = new Mongodate(strtotime("-1 week", time())); else if ($timeFilter == 'day') $mongotime = new Mongodate(strtotime("midnight")); else if ($timeFilter == 'hour') $mongotime = new Mongodate(strtotime("-1 hour")); return $mongotime; }

And the following code already receives 15 best posts for the month:

 $groupId = 42; // - id  $mongotime = filterToTime('week'); $offset = 1; //   $findCondition = array('gid' => $groupId, 'created' => array('$gt' => $mongotime)); $mongoHandle->posts->find($findCondition)->limit(15)->skip($offset * $numPosts);

Page index logic

It is interesting to watch group statistics, but it is much more interesting to build a general rating of absolutely all groups and their publications. If you think, the task is very difficult:
We can build a rating of only 3 factors: the number of likes, reposts and subscribers. The more subscribers - the more likes and reposts, but this does not guarantee the quality of the content.

Most million-plus groups often publish any garbage that has been surfing the Internet for several years, and among the million subscribers are constantly those who will repost and like.
It is easy to build a rating based on bare numbers, but the result cannot be called a rating of publications in terms of their quality and uniqueness.
There were ideas to derive the quality factor of each group: build a time scale, watch user activity for each time interval, and so on.
Unfortunately, I did not come up with an adequate solution. If you have any ideas, I will be glad to hear.

The first thing I understood was the realization that the contents of the index page need to be calculated and cached for all users, because this is a very slow operation. This is where memcached comes to the rescue. For the simplest logic, the following algorithm was chosen:

Cycle through all groups
We take all the publications of the i-th group and choose 2 of the best ones for a specified period of time.

As a result, there will be no more than 2 publications from one group. Of course, this is not the most correct result, but in practice it shows good statistics and the relevance of the content.

This is how the code of the stream looks, which generates an index page once every 15 minutes:

  # timeDelta -     (hour, day, week, year, alltime) # filterType - likes, reposts, comments # deep - 0, 1, ... () def _get(self, timeDelta, filterTime, filterType='likes', deep = 0): groupList = groups.find({}, {'_id' : 0}) allPosts = [] allGroups = [] for group in groupList: allGroups.append(group) postList = db['posts'].find({'gid' : group['id'], 'created' : {'$gt' : timeDelta}}) \ .sort(filterType, -1).skip(deep * 2).limit(2) for post in postList: allPosts.append(post) result = { 'posts' : allPosts[:50], 'groups' : allGroups } #     timestamp  mongotime,    json dthandler = lambda obj: (time.mktime(obj.timetuple()) if isinstance(obj, datetime.datetime) or isinstance(obj, datetime.date) else None) jsonResult = json.dumps(result, default=dthandler) key = 'index_' +filterTime+ '_' +filterType+ '_' + str(deep) print 'Setting key: ', print key self.memcacheHandle.set(key, jsonResult)

I will describe the filters that affect the issue:
Time: hour, day, week, month, year, all the time
Type: likes, reposts, comments

Objects were generated for all points of time.

  hourAgo = datetime.datetime.now() - datetime.timedelta(hours=3) midnight = datetime.datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) weekAgo = datetime.datetime.now() - datetime.timedelta(weeks=1) monthAgo = datetime.datetime.now() + dateutil.relativedelta.relativedelta(months=-1) yearAgo = datetime.datetime.now() + dateutil.relativedelta.relativedelta(years=-1) alltimeAgo = datetime.datetime.now() + dateutil.relativedelta.relativedelta(years=-10)

All of them are in turn passed to the _get function along with various filter variations by type (likes, reposts, comments). To all this, you need to generate 5 pages for each variation of filters. As a result, the following keys are put into memcached:

Setting key: index_hour_likes_0
Setting key: index_hour_reposts_0
Setting key: index_hour_comments_0
Setting key: index_hour_common_0
Setting key: index_hour_likes_1
Setting key: index_hour_reposts_1
Setting key: index_hour_comments_1
Setting key: index_hour_common_1
Setting key: index_hour_likes_2
Setting key: index_hour_reposts_2
Setting key: index_hour_comments_2
Setting key: index_hour_common_2
Setting key: index_hour_likes_3
Setting key: index_hour_reposts_3
Setting key: index_hour_comments_3
Setting key: index_hour_common_3
Setting key: index_hour_likes_4
Setting key: index_hour_reposts_4
Setting key: index_hour_comments_4
Setting key: index_hour_common_4
Setting key: index_day_likes_0
Setting key: index_day_reposts_0
Setting key: index_day_comments_0
Setting key: index_day_common_0
Setting key: index_day_likes_1
Setting key: index_day_reposts_1
Setting key: index_day_comments_1
Setting key: index_day_common_1
Setting key: index_day_likes_2
Setting key: index_day_reposts_2
Setting key: index_day_comments_2
Setting key: index_day_common_2
Setting key: index_day_likes_3
Setting key: index_day_reposts_3
...

And on the client side, only the necessary key is generated and the json string is pulled out of memcached.

Twitter

The next interesting task was to generate popular tweets across the CIS countries. The task is also not easy, I would like to receive relevant and not “trash” information. I was very surprised at the limitations of Twitter: it would not be so easy to take and merge all the tweets of certain users. The API greatly limits the number of requests, so you can’t do it the way it does it: make a list of popular accounts and constantly parse their tweets.

A day later, a solution came: we create an account on Twitter, subscribe to all the important people whose topics of publications are of interest to us. The trick is that in almost 80% of cases, one of these people retweets some popular tweet. Those. we do not need to have a list of all accounts in the database, just dial a database of 500-600 active people who are constantly in trend and retweet real interesting and popular tweets.
In the Twitter API, there is a method that allows you to receive a user's tape, which includes tweets of those to whom we are following and their reposts. All we need now is to read our tape to the maximum once every 10 minutes and save the tweets, filters and everything else we do in the same way as in the case of VKontakte.

So, another thread was written inside the daemon, which once in 10 minutes ran such code:

  def __init__(self): self.twitter = Twython(APP_KEY, APP_SECRET, TOKEN, TOKEN_SECRET) def logic(self): lastTweetId = 0 for i in xrange(15): #     self.getLimits() tweetList = [] if i == 0: tweetList = self.twitter.get_home_timeline(count=200) else: tweetList = self.twitter.get_home_timeline(count=200, max_id=lastTweetId) if len(tweetList) <= 1: print '1 tweet, breaking' # ,   API    break # ... lastTweetId = tweetList[len(tweetList)-1]['id']

Well, then the usual and boring code: we have tweetList, loop and process each tweet. List of fields in the official documentation. The only thing I want to emphasize:

  for tweet in tweetList: localData = None if 'retweeted_status' in tweet: localData = tweet['retweeted_status'] else: localData = tweet

In the case of retweet, we need to save not the tweet of one of our subscribers, but the original one. If the current record is retweet, then it contains inside the key 'retweeted_status' exactly the same tweet object, only the original one.

The final

There are problems with website design and layout (I myself have never been a web programmer), but I hope someone will find useful information that I have described. I myself have been working with social services for a long time. networks and their API and know a lot of tricks. If someone has any questions, I will be happy to help.

Well, a few pictures: