📜 ⬆️ ⬇️

The twi journal

For a long time did not dare to write on Habr. At least, due to technical instability of the project. Now that the work has been established (I sincerely hope so), we have received a small recognition in the form of a grant from Yuri Milner and Pavel Durov, I am ready to send the project to a habermeat grinder.

image

My name is Nikita Likhachev, I want to tell you about the site The Twi Journal . This is a newspaper that is based on an automatic analysis of Russian-language Twitter.

Project idea


Design a robot capable of analyzing the broadcast of the Russian-speaking segments of the networks Twitter, Instagram and Foursquare. Then display this content in a convenient way on one site and diversify its placement - send it to other social networks. Someone wonders what is happening on Twitter, but he reluctantly leaves Vkontakte or Facebook . And someone just has little time to keep track of everything - he wants to assess the agenda in ten minutes.
')

Sample objectivity


The project in no case claims absolute objectivity. Because we have taken the liberty to exclude from the indexed database accounts that post joke, fears and other content that does not carry any useful information. We also ignore masfolovers (those who subscribe to everyone) and people who wind up their ratings with the help of bots. The first base was collected by hands, putting top bloggers on the white list:

If the blogger Navalny is on the white list and it corresponds to the “not a trap” check, then the people he reads are automatically entered into our database.

Now the base continues to be replenished with hands and automatically already - due to the fact that the robot finds new users in retweets on the existing base. Until now, we have not been reproached for the inferiority of the information picture, because an important topic cannot pass by at least one user from our database.

Data processing


Information


The robot we call Adam collects all indexed tweets and divides them into several types: regular tweets; tweets referring to third-party media resources; with reference to well-known photo hosting sites; with reference to video hosting.

Thus, the main page displays popular tweets and parsed links to articles in the media with the number of mentions, and in separate sections of the photo and video :

image

We constantly try to invent algorithms that help in a short time to get the maximum amount of fresh information. On videos, for example, they set a limit on the date of loading, in order to display the most recent ones in priority. The robot also monitors twitter reviews on video and displays them as comments:

image

User Rating


On the basis of our base, we strive to build at least an approximate objective rating of Russian microbloggers , dividing them into users, corporate accounts and the media. The rating is based on the reduction of several indicators in one formula: the average number of user references, retweets of his records, the number of his followers in relation to the number / lists to which he was added.

All twitter ratings can be divided into two parts - those that require authorization to participate and those that do not. The former are considered more objective, since they have at their disposal information about the references and retweets of the user. But they also have a significant drawback: the majority of popular bloggers never log in to them because of mistrust or uselessness. The second type is deprived of this disadvantage, but rarely it is objective, since it is almost always based only on the number of followers, tweets and, perhaps, the age of the account. We tried to combine the best of both types of rating.

image

Foursquare seat rating


It is built in real time: we show places that are popular in the city right now. It is calculated as follows: once in 25 minutes, a robot is launched, which along pre-defined boundaries of the city (in Moscow, only the center and a couple of kilometers around it is checked) creates a matrix of points. For each point within a radius of two kilometers, the availability of popular sites is checked using the Foursquare API .

image

A little about technology


Now we are located on the same server. The whole project (including daemons) is written in PHP. We use MySQL and MongoDB databases (for speed-critical moments) - the performance of InnoDB on the insert is more than enough, and we cache most of the database samples using memcached. In general, memcached is an ideal choice for us, as we have to operate with a large amount of data that can be cached without loss of efficiency. This has reduced the time for generating the main page to 40ms ( I'm afraid to predict the behavior of the site with a probable habraeffekt ).

Recently, we began to use Gearman for parallelizing tasks such as processing tweets, calculating ratings and for background tasks, such as saving pictures on Amazon S3.

Robot Adam checks for updates in the tape every 15-180 minutes, depending on the time of day. Since materials are gaining popularity not immediately, but gradually, it is important for us to accompany them for some time after publication. It is at this moment that we parse tweet into its components: text, links, images and video. All links are expanded if they are shortened, and their content is modified similarly to the Reader function in Safari (in the manner of Readability).

When processing images, we support photo hosting sites twitpic , yfrog , pic.twitter.com , flickr , lockerz and instagr.am . For each of them, we wrote a simple API handler that finds a preview for the images, the author, and an explanatory text. For some photo hosting sites I had to use undocumented features. Fortunately, programmers often think the same way, especially in terms of naming methods and parameters for them.

image

Development plans


Now we set up various experiments. For example, we plan to launch The Twi Football . As part of this project, we want to try online-broadcasting of matches based on the analysis of Russian-language Twitter. The project will be some kind of springboard for testing technologies that we will use in the main project: the fans get the server directly from Twitter using the Streaming API (new tweets on the hashtags of teams will appear faster than on the native Twitter search page).

In our free time we indulge in our symbolism:

image

But seriously, we want to try to scale the project to other countries. We begin, of course, with the United States (bought the domain twijournal.com). If they go there, we will go to other countries. There is little time left, because the money that Durov and Milner gave us is running out pretty quickly, although we don’t bounce too much.

In our wildest dreams, we dream that we will be able to build similar media based on other social networks, and then merge everything into one large content aggregator. But for now it's just a dream.

The twi journal

PS Suddenly a developer who wants to work with us or a journalist from another country reads this post? Just in case I leave here our email: editors@tjournal.ru

Source: https://habr.com/ru/post/142562/


All Articles