Big data: size matters?

All web developers face the task of individually selecting content for users. With increasing data volume and increasing their diversity, ensuring sample accuracy is becoming an increasingly important task, having a significant impact on the attractiveness of the project in the eyes of users. If the above is in the sphere of your interests, then perhaps this post will push on some new ideas.

In each epoch of development of the IT industry, there were their own buzzwords - the words that everyone had heard, everyone knew that the future was behind them, but only a few knew what was really behind this word and how to use it correctly . At one time, there were “waterfalls”, “XML”, “Scrum”, and “web services”. Today, one of the main contenders for the title of buzzword No. 1 is “big data”. With the help of big data, British scientists diagnose pregnancy by check from a supermarket with an accuracy close to the hCG test. Large vendors create platforms for analyzing big data, the cost of which goes beyond the millions of dollars, and there is no doubt that every pixel in any self-respecting Internet project will be built taking into account big data no later than by 2020.

At the same time, a rare article about algorithms for analyzing big data goes without a comment. “Well, show me an example that works on an industrial scale!”. Therefore, we will not beat around the bush and begin with an example: www.ok.ru/music . Most of the content in the music section of Odnoklassniki is selected on the basis of “big data” individually for each user. Is it worth it? Here are a few simple numbers:

+ 300% auditions and subscriptions
+ 200% song additions
+ 1000% CTR when targeting music ads

But the main thing is not at all. Much more valuable live and unbiased opinion of real users. A year ago, in the framework of the “Outside the Window” project, people who had never used Odnoklassniki before, spent two weeks online, reporting in detail about their impressions. One of the reviews about the music section sounded like this: “It somehow guesses what I like. I don’t understand how, but it’s nice. ”
')
In fact, there is no magic, of course, no - it's all about the data. The very data that our users generate by listening to and downloading music, browsing the music catalog. Information about all user actions flows into the classic MS SQL relational database, where primary processing, filtering and aggregation of data take place (yes, good old SQL can also handle big data). The data prepared in SQL is uploaded for additional analysis into a small Hadoop cluster, which makes a compact but informative squeeze used already in real time (part of it is imported into Cassandra, part is loaded immediately into memory). For greater efficiency, recent user actions are added to the database (Tarantool) and are also counted online.

Used for the selection of content squeeze includes various kinds of correlations between objects of different types. For music tracks, this is information about how often they are listened to within a small time window (temporal similarity). For music performers, this is information about how often the same user likes them (collaborative similarity), and how similar the musical lists of their closest neighbors are (second-order collaborative similarity). For users, this is information about which tracks, which artists and how often they listen (user ratings). For convenience of processing, all correlations are recorded in a single structure - the graph of tastes.

Due to its relatively compact size, the taste graph allows real-time solving a wide range of tasks related to personalized content selection. Having a list of the most popular tracks throughout the system, you can:

evaluate their relevance for a specific user (the number and weight of paths of length not more than N between the user and the tracks),
Split user tastes into connected blocks (clustering by density of a subgraph of common neighbors using affinity propagation) and select recommendations for personalized PageRank

Having a collection of songs compiled by the user, you can pick up similar interesting tracks (also PPR for the collection and personalization of the result for the user). Technical details on how, why and why can be found here .

The attentive reader’s gaze will not escape the fact that none of the solutions used can be called new / breakthrough / unique (underline) neither from the point of view of algorithms, nor from the point of view of technologies. Why then really high-quality big data solutions appear on the Russian market so rarely?

Quite a few spears were broken (and are still broken) in disputes about the size of the data that can be considered really “big”. But is it really a matter of size? Hundreds of gigabytes / terabytes / petabytes (underline the right) data do not represent value by themselves - their main purpose is to help understand the past and predict the future. It is obvious that data alone is not enough for this - we need analysis algorithms, technologies and people who can implement them.

Many companies have enough data to make good business sense when used properly. Processing algorithms are widely known and actively developed, processing technologies are also available in various price categories (from open source software capable of working on stock iron to multi-million integrated systems). Obviously, there is a lack of the last, most important component - experienced people who are able to put all the components together.

It's easy enough to find a programmer who knows all the nuances of garbage collection in Java, who has experience working with a dozen different types of DBMS, thoroughly familiar with Spring / Trove / Hibernate and 50 more libraries and packages. However, most of them are technologically oriented and “not sharpened” in order to work with literature, master new methods of statistical processing, and set up experiments. Finding a mathematician who is capable of this is more difficult, but also possible. But in this case it will be extremely difficult to advance further than the shapeless cloud of Matlab code. The probability of finding a person who can take the best of two worlds is so small that many generally doubt their existence.

It would seem that many university graduates should strive to get into such a valuable ecological niche, but even yesterday’s students have the same stratification for “techies” and “mathematicians”. The first ones in the tasks of intellectual analysis are prone to “what’s to do here” with captcha approaches; the latter go to mathematics nirvana and do not always return. But their ability to learn is not as blunted as that of mature specialists, although their development requires serious additional investments.

Despite the complexity and capital intensity, an effective data mining system is capable of making the project very attractive and user-friendly, ensuring an increase in the audience.

Source: https://habr.com/ru/post/216401/

All Articles

Big data: size matters?

More articles: