📜 ⬆️ ⬇️

How the process in Data Science works

Hi, Habr!

After the last publication, “Your Personal Course on Big Data,” I received several hundred letters with questions, reading which, I was surprised to find that people immersed themselves in theory very much, taking little time to solve practical problems in which skills are needed completely different . Today I will tell you what difficulties appear in practice and what you have to work with in solving real problems .


Those who have already completed a large number of machine learning courses and started to solve problems at Kaggle know very well what a typical problem statement looks like: "A set of training sample objects is given , for each of which there are signs , and the values ​​of the target variable are given . Also a test sample, for each object of which the value of the target variable must be predicted. " There are more complex productions, here I just gave the most typical - for regression and classification problems. This starts solving the problem on kaggle.com - you read the condition of the problem safely, look at the quality metrics, download data, launch IPython Notebook , connect your favorite libraries and start working hard and long with data, select hyper parameters of models, select signs, train a lot of classifiers and weigh them, find complex objects and look at them more carefully, etc. Thus, you struggle to optimize the quality metric and get a decent position in the Leader Board .
')
Unfortunately, in practice, everything looks completely different. In fact, 90% of the work has been done before you . The skills of building quality models are certainly important, but not paramount . Pay attention to the words highlighted above in the problem statement - they are not accidental, since in real life, nothing is “given to you”, there is no “quality metric”, there is no “training sample”, not to mention the fact that there is no task in essence - there is only a vague description of what goal is pursued in solving one or another tasks. And here begins the real work of the scientist according to the data, in which most of the time is routine, and it is necessary to think far in advance and long before there is a clear statement of the problem and the “object-feature” matrix is ​​built.

In short, the most common problems encountered are the following:


So, let's analyze each of the problems in order:

Blurred statement of the problem


The task does not arise by itself - every thing we do - is done for a specific purpose. And business goals are often formulated quite informally: “reduce customer churn,” “increase revenue from affiliate programs,” “retain users,” “reduce the burden on a particular business department by optimizing business processes.” It is in this formulation (more often - a little more precisely) that the task falls to Data Scientist 's. That is why this person should be familiar with the business in order to understand what is really needed. If, for example, the task of promoting services is solved or a targeted offer is developed, it is necessary to understand how the process of interaction with the client will be arranged, how the advertising campaign will be conducted, how the criteria for its success will be determined - and why do we need machine learning in the end? all this needs to be known with precision. But as soon as you start to deal with the details - you have the following problem.

A large number of degrees of freedom


What is - "the client went into the outflow" - in the language of business is clear - the business begins to lose money. But the vaults that contain customer information do not understand this - they store dozens of statuses in which outflows can mean blocking, termination of the contract, non-use of services for a certain period of time, untimely payment of bills - all this can be called customer outflow in one or a different degree. What do you think - if the formulation of the target parameter and the problem statement allows so many degrees of freedom - how easy is it to generate attributes for the task? Right! - it is even harder to do. Now imagine for a moment that we have decided on the task itself and the signs. And how will contact with the client be made? - and here it turns out that the process of retaining a client is a complex procedure that takes time and which has certain features. If we say, you predict that the client is inclined to go into the outflow with a certain probability on some date, then there is also the likelihood that you simply do not have time to interact with it. And all this must be considered long before the development of the algorithm. That is why it is necessary to learn how to quickly dive into a particular subject area in order to know the whole process and all the details. Suppose that we coped with this, we also decided on what signs we need to solve the problem. But then a new problem arises.

A large number of different data sources


Data, especially in a large company, is not always stored in one source. And far from always the means for working with this data are the same, because it can be as sequence files in HDFS , noSQL databases, and also relational sources. But after all, we often need an “object-attribute” matrix - which means we have to do a large number of all our favorite join 's - which, in the case of big data, makes us think about how to make requests optimally. It will require skills to work with different tools, ranging from SQL, Hive, Pig and ending with the fact that it is often easier to write code in Java / Scala than to use SQL-like languages. However, if you are familiar with all these tools and imagine how to write complex join 's correctly, other problems arise.

Data quality and gaps


Very often (the larger the company - the more often) you will have a large number of gaps in the data - for example, if the task requires some specific signs to solve - it may turn out that most of these signs are available only for a small group of clients. Further, even if customer data is available, it turns out that in the data part there are so-called outliers, not to mention the fact that and data sources are different - the same data can be stored in different formats and you have to join everything to the same form with the same join 'e. But suppose that you coped with it. They wrote a good script on Hive / Pig or on Spark and launched. Think it all? Not really.

"Raw" software for working with data


The main reason for the popularity of Big Data at one time was cheap. In particular, most of the popular tools for working with big data are open-source development , be it Hive, Pig, or everyone's favorite Apache Spark . But those who have had experience with open-source tools at least once know perfectly well that it’s not at all worth counting on them and that from the first time everything does not always start. For example, if you wrote a simple script that sequentially reads files from HDFS and moved away for several hours. With the arrival, you can easily find out that the script “fell” simply because, among the folder with files you read, there happened to be some * .tmp file that Pig, for example, could not read. Everyone who has worked with Apache Spark - can also tell you about the problems of reading small files, about the fact that it is often hard to save built models on it - you have to write your own serializers. There are many such examples. But, let's say, you have been running and testing the code for a long time, and here, it seems, you have a long-awaited “object-sign” matrix.

Disadvantages of working with big data


And here you are, imagine that you have a large amount of data. Suppose even that you have a small training sample that fits in RAM, and you have already built a good model. As a rule, at the same time you generated a large number of new features (we did the Feature Engineering , which we met here and here ). And what to do with the very huge sample of objects for which you need to call your long-awaited Predict function? After all, for it, too, you need to do all the same transformations that you did with the training sample — add new columns, fill in the blanks, normalize the data. When building a model, you probably did this with great packages like Pandas , which dealt with DataFrames . Everything, now they are no more - everything will have to be done by hand. The only thing left to rejoice here is that in Apache Spark version 1.3 there is support for DataFrame, which will allow working with big data as fast as it is done in R or Python . Probably.

The presence of the training sample


This is probably not the most unpleasant, but at the same time an important problem that arises in practice - where to get the training sample? Here you have to think a lot, because It is important to strike a balance between the size of the training sample and its quality. For example, in some tasks there is often a certain set of training objects and it is possible by clustering and imposing training objects to try to increase the size of the training sample. You can also find a training set by solving a variety of auxiliary problems — there are no uniform methods here — you have to get out of each time for a specific task.

Total


So, I hope that after reading this post, you still want to actively engage in data analysis. After all, all these problems are somehow solved with time (I can say this in my case, as well as by interviewing my colleagues from other companies). This is always the case when new tools appear, as is the case with Apache Spark. However, it is important to know about all these features and difficulties in advance.

Successes all and excellent beginnings!

Source: https://habr.com/ru/post/254349/


All Articles