Big Data in Beeline: real experience

Hi, Habr! My name is Alexander Krot , I am responsible for the development of machine learning and data mining algorithms in Beeline, as well as for the training and selection of data processing specialists led by Sergey Marina, who previously introduced you to the work of our Big Data division. I already wrote about certain aspects of Big Data and Machine Learning , but today I will tell you how it works in practice, namely, how we solve problems related to big data analysis in Beeline, how we select specialists, what tools and methods we use in practice .

Who are we

To begin with, we have a laboratory in which specialists work - the so-called Data Scientist 's - people who are very difficult to find in the market. Many people come to us for an interview, only a few of whom remain to work with us.
The process begins with a telephone interview with questions on certain sections of mathematics. After the candidate is waiting for the test task - a specific machine learning task, similar to the tasks on kaggle.com. Having built a good algorithm and getting a high value of the quality metric on the test sample, the candidate is allowed until the next stage - an immediate interview, which verifies knowledge of machine learning and data analysis methods, and also sets non-trivial questions from practice and logic problems. The ideal candidate is a graduate of the Moscow Institute of Physics and Technology , Moscow State University , ShAD , a participant in the ACM and Kaggle competitions with obligatory practical experience in building machine learning and data analysis algorithms (telecom, banks, retail). The advantage is the experience of using the methods of Large Scale Machine Learning , that is, the construction of machine learning algorithms, in the case when you have to learn very large data that does not fit into the RAM of the machine.

How we solve problems

At each stage of solving the next task, we use the newest and most popular tools. The very process of developing a new product, after all the demands from the customer are presented to it and the approximate solution process is clear, is as follows.
')

Data collection

As a rule, when the task has already been set and the approximate vector of data that is required to solve the problem is clear, the process of collection and aggregation begins. This is the so-called ETL (Extract Transform Load) process. At first glance, this step may seem trivial, but in practice this is far from the case. It is necessary to unload a large amount of unrelated data, clean it and merge it.
In practice, there may be gaps in the data, there may be incorrect values. Complicated by the fact that we work with data from many sources (billing, geodata, Internet events, data on the quality of service, CRM, data on deposits and charges, connected tariffs and services, and much more), and to combine data between by itself (and, for example, perform the well-known join operation), you need to put the data into a single repository, and before starting the join process, you need to think carefully about how to make the request so that it is as efficient as possible from the point of view of computational complexity. This is where the skill of efficient implementation of algorithms is needed.
At this stage, tools such as Hive , Pig (for simple requests), Apache Spark (for more complex requests) are used. At this point, it is important to note one feature: often in machine learning tasks (for example, in classification or regression tasks) a so-called training sample is needed - a set of objects whose target (predicted) variable is already known. In practice, it is very difficult to find a good and large learning sample (more about which I will discuss below) in many tasks - you have to first obtain a set of training objects by solving auxiliary tasks.

Algorithm construction

After all the data have been collected, the process of building an algorithm begins, which includes many steps. To begin with, from all the data, some small part is taken - one that fits in the RAM of an ordinary personal computer. At this point, all the skills that experts apply when solving problems on Kaggle are used. Namely: through experiments, Data Scientist decides how to fill in the missing values in the data, makes the selection and creation of features , tests many algorithms , selects all the necessary parameters, and also solves many smaller auxiliary tasks. After completion of this stage, as a rule, there is a ready-made prototype of the future algorithm. It uses data analysis tools such as R or Python with various libraries (for example, scikit-learn or pandas ). It is important to note that all steps in the construction of the algorithm are made into a detailed report using IPython Notebook ( RMarkdown, respectively).

Scaling algorithm. Big Data Training

Further, as the algorithm prototype is ready, as a rule, there is a learning process on big data - the learning of the algorithm chosen at the previous stage on a larger data volume starts. To do this, use tools for working with big data - Apache Spark and Vowpal Wabbit . The first of them makes it possible to effectively implement iterative machine learning algorithms and algorithms on graphs thanks to an efficient in-memory computational model. The second is to implement online training of the so-called linear models on big data, without loading all data into RAM at all. It is worth noting the growing popularity of the latest tools, as well as the fact that they, like all previous ones, are freely distributed (and therefore, in practice, as a rule, they require substantial refinement for industrial applications).

It is worth noting that this is not the only possible process for solving a problem. Often, algorithms are immediately trained on big data, or the task itself involves only competent extraction and aggregation of data (that is, it is a pure ETL process). Also, we often solve problems on graphs, about which examples and methods for solving which I told earlier . In general, for the most part, the process is as described above. The average time from the birth of an idea to the realization of a productive solution ranges from several days to a month (taking into account all sorts of testing, optimization and improvement of the quality of algorithms).

What tasks we solve

I will describe only the tasks that I do directly, at the same time there are other equally interesting tasks, such as geoanalytics, which use simpler algorithms, but high-quality visualization is required, for example.
So, we solve the following problems of machine learning and data mining:

Natural Language Processing Tasks

Natural language and text processing is one of the most complex data mining sections at the moment — along with image and signal processing. At this point, we use both classical algorithms and text features like Bag of Words or TF-IDF , as well as more advanced Deep Learning methods of deep learning (for example, we actively use the word2vec data structure to search for synonyms of words), which The size of the training sample is much more effective in the tasks of text classification (and used to combat spam). From the tools here we use various libraries like NLTK (Natural Language Toolkit) and already implemented algorithms in Apache Spark.

Tasks on graphs (Social Network Analysis)

We are also engaged in the tasks of network analysis - the so-called graphs. I talked about this in detail earlier , so now I will only briefly remind you that a graph is a set of objects between which connections are known. Or, in simple terms, this is a set of points (vertices) connected by segments (edges). Typical graphs are social networks, where the concept of friendship is defined between objects, or the Internet graph, in which there are sites that link to each other. In the graphs, you can highlight the community, calculate, consider their various characteristics, and even predict new acquaintances between people. Since the tasks on the graphs are very resource-intensive, here we use Apache Spark, which already implements some well-known algorithms, such as the search for strongly connected components ( Strongly Connected Components ) or PageRank . However, this is often not enough, and we use other tools that use the more familiar Pregel calculation model.

Predictive Modeling Tasks

Classification tasks in the classical formulation look simple. There is a set of objects, say, subscribers and a lot of features that describe them. Of the entire set of objects, there is a small group, for each object from which the value of the target variable is known, for example, the probability that the subscriber is inclined to go into an outflow or is inclined to perform one or another target action. This is the so-called learning sample. The task is to predict the value of the target feature for all other objects. A typical regression task (when the predicted value is a certain number) or classification (when the predicted value is a certain label). Among the examples of the tasks we solve, in addition to the problem of forecasting outflow, we can single out the tasks of predicting the subscriber's gender, age, and propensity to consume specific services, such as the Shared Data Bundle. All this is used, for example, for targeted offers of the operator’s own services. Here we actively use linear models, decision trees, neural networks, and also compositions of algorithms (for example, boosting). The tools include Python with its libraries or, in the case of training on big data, Apache Spark or Vowpal Wabbit.

Cluster Analysis Tasks

Clustering tasks, unlike forecasting tasks, do not have a training set. A typical formulation of such problems is to look for patterns in unallocated data. The only thing that is at the entrance to such a task is a set of objects and attributes known to them. As part of the task, it is necessary to answer the question: are the objects divided into clusters, within each of which objects are very similar to each other? With this, we regularly segment the subscriber base in order to find groups of users, in a sense similar to each other. We distinguish people who make up a certain social community. After that, we study the interests and characteristics of a particular group using the forecasting tasks described above. Here we use both classical algorithms like KMeans and more advanced hierarchical clustering and graph clustering algorithms.

This is not a complete list of tasks that we are engaged in in the framework of data mining company Beeline. The entire list is difficult to describe in one article, I will tell you more about the individual tasks later.

Source: https://habr.com/ru/post/254469/

All Articles