📜 ⬆️ ⬇️

Big Data and Odnoklassniki: what they do with the data in the 2nd most visited social network in Russia

Odnoklassniki do not take away the main thing - this is the second most visited social network in Russia (4th place among all Runet sites). And, for example, in Armenia it is the first one at all. Millions of people visit the web site every day and leave there terabytes of data that can be analyzed. What data does the social network collect from users? On which stack can the lightweight process tens of terabytes of data per day? Is more data always better?



We interviewed Dmitry Bugaychenko, who told us about Big Data in Odnoklassniki.
')
Dmitry Bugaychenko. He graduated from St. Petersburg State University in 2004, where he defended his Ph.D. in formal logic methods in 2007. He worked in outsourcing for almost 9 years without losing contact with the university and the academic environment. Big data analysis in Odnoklassniki became for Dmitry a unique chance to combine theoretical training and a scientific foundation with the development of real and sought-after products.



A few words about you and how long you have been learning machine and Big Data.

Dmitry Bugaychenko : I think I can be attributed to the category of programmers who did not want to just program for money in their own or other people's business, but rather to those areas where there are more questions than answers, and to solve them you really need to use all potential and knowledge base. That is why, most likely, I came to the field of data analysis. My first serious projects related to what is commonly considered “bigdata” began in late 2011.

Is it possible to announce the daily audience of the network and the approximate amount of data generated by users?

Dmitry Bugaychenko : You can. Our daily audience is 40 million people, the daily volume of data (not counting the amount of downloaded photos / videos) is measured in dozens of terabytes.

What user data do you collect and how are they used?

Dmitry Bugaychenko : Good question. I would like to immediately note that in Odnoklassniki we pay a lot of attention to the issues of respecting users' privacy and ethical issues of data storage and use. In addition, we respect and comply with the laws of the Russian Federation regulating issues related to the collection and use of personal data. Practically all the data we work with is available in one form or another to ordinary users of the social network through a web interface and mobile applications (posts, “classes”, comments, etc.), and we use this data, first of all , to improve the user experience - we are trying to make it easier for users to get new interesting content, understand their needs, reduce the negative, etc.

If we talk about photos and videos, what useful can be obtained from such data?

Dmitry Bugaychenko : Analysis of multimedia (photo, video, sound) objects is really fraught with a number of difficulties. These data go into operation either when implementing specific functions (for example, deduplication), or at those stages when 80% of the result has already been squeezed from the “classical” activity data and the remaining 20% ​​need to be pressed, connecting other data.

What tools are involved in data processing? Tell us about your Big Data kitchen.

Dmitry Bugaychenko : You can talk about it for a long and exciting time, which we have repeatedly done at various meetings, conferences and DataFest. In short, for the most part, we use mostly open source technologies and analysis patterns. The data collection involved the Apache Kafka queue, storing Hadoop + Parquet, analysis, depending on the context: classic MapReduce, Spark, Hive, Pig, Samza, Spark Streaming, Python with scykit-learn and K, TensorFlow and Caffe for neural networks. New technologies and patterns are being introduced quite actively, but with an eye to the fact that their introduction was objectively justified, and not just "because it is fashionable."

Have you ever had to switch from one tool to another in the course of work? If so, for what reason? Has it ever happened that a tool is slightly better, but because of it the load on the servers is higher and the maintenance costs bring to nothing all the advantages?

Dmitry Bugaychenko : The search for new opportunities for growth, be it technology or algorithms, is part of our daily work. But we always try to rationally evaluate alternatives: “HYIP” around the technology is an excuse to look at it, but whether it is worth investing in its implementation is decided after evaluating the costs and the potential effect. In addition, in terms of introducing new approaches, we try to adhere to the “one new” rule (either a new task on known technologies / algorithms, or a new technology / algorithm in a known task), although this is not a dogma.

If we talk about a compromise between performance and other properties (ease of use, quality of model prediction, etc.), then decisions are also made rationally - if the potential effect is sufficient, we modify the open solutions so that they give acceptable performance .

If now, with current experience, you returned five years ago, what would you do differently?

Dmitry Bugaychenko : In general, in my opinion, the basis for the development of our infrastructure for working with data was built correctly, but of course there are a few points. First of all, they would immediately use Hadoop for analytics instead of SQL-based solutions. Previously, they would start introducing interactive analytics tools (Hive, Hue). This would make it possible to make a significantly faster start, since working with SQL and the complicated procedure for laying out changes to the algorithms seriously slowed down the work.

On the other hand, the corresponding technologies at that time were much less mature, so the probability of the opposite effect is also not zero. More than once it happened that when introducing a young technology, it has to be actively patched, which significantly complicates the transition to new versions in the future. So, if we had a time machine, I think we would not risk it.

Do you think the current data processing system is optimal? And what could be improved?

Dmitry Bugaychenko : The system cannot be improved only if nobody needs it, and this is not about our data processing system. There are, of course, many small and not very technical improvements: bug fixes, upgrades. There are many things that could be improved in the processes (first of all, to develop internal and external educational programs). But if we talk about something “big and bright”, this is, first of all, of course, the concept of semantic data lake: in this case, the data is not just a big dump of logs and aggregates, but a single effective repository with a fixed and defined metamodel , allowing to formulate and perform operations on data in terms of the subject area (user, post, "class", etc.), without reference to the technical details of data storage.

Data analysis is incredibly interesting in itself, but is it possible to voice somehow in numbers, is it worth it to do this at all? What is the "exhaust" for the social network from the fact that you are analyzing user data? Can we say, as in the case of Odnoklassniki, it is monetized (not necessarily in money) Big Data?

Dmitry Bugaychenko : This is a very difficult question with an ambiguous answer. It all depends on how and where the analysis is implemented. If, for example, we try to replace collections of content made by editors based on our own preferences or opinions on user preferences, then the introduction of collections based on data analysis can give a multiple increase in user activity (we have seen a tenfold increase in our practice). If, for example, the same collections are built by literate people on the basis of data, but without using machine learning, then often the introduction of algorithms for solving the same problems gives much more modest results, measured by percentages.

What should be the minimum amount of data for a social network to adapt to the user? And is there any limit to personal data when there is practically no further improvement?

Dmitry Bugaychenko : Most of our systems are beginning to adapt to the user with the first click. At the same time, the required amount of data up to “saturation” strongly depends on the subject area. And yes, more is not always better. If too much data is received, the system may become overwhelmed, creating a “opinion bubble” for the user, when recommendations are received that are relevant but uninteresting. The fight against such “bubbles” is a separate non-trivial topic.

Is there any symbiosis with other mail.ru projects in the field of Big Data? Are your decisions replicated to the entire holding, or, on the contrary, do you adopt successful decisions of your colleagues?

Dmitry Bugaychenko : The exchange of both ideas and data is actively going in both directions. In fact, there are more than two sides, since many people practice machine learning. For example, VKontakte builds its analytical infrastructure largely guided by our experience, Poisk@Mail.Ru has interesting effective implementations of learning trees, which we look at, etc.

I think you remember the sensational story in Das Magazin about how the American President Trump allegedly led to victory through advertisements on social networks, tailored to a specific user based on an analysis of their activity. As a professional, tell me, how can this story correspond to reality? And if so, is such targeting possible - not necessarily political - in Odnoklassniki?

Dmitry Bugaychenko : In my opinion, the story is quite real. Moreover, by the standards of the region, this is quite a full-time story - personalized targeting has been successfully used in business for years, just here it has been used in a political advertising campaign. Personalized targeting in Odnoklassniki is, of course, possible and is already used in business.



Friends, on October 21, at the SmartData conference , Dmitry Bugaychenko will make a new report " From click to forecast and back: Data Science pipelines in OK ". Come!

Source: https://habr.com/ru/post/336866/


All Articles