Big Data: current reality

Hi, Habr!

Since the publication of a series of articles on data analysis and machine learning, enough time has passed and people are starting to ask for new publications. Over the past year I have been able to work with several companies planning to implement advanced analytics tools for the selection of specialists, as well as training their employees and solving design problems. For me, this was a rather unusual and at the same time complex experience, so I would like to address this post to managers of companies planning to implement Big Data and Data Mining .

It will be about the most typical questions that I managed to hear in practice and examples of what I had to face. I will say right away that this concerns far from all companies, but only those who do not yet have a culture of working with data.

It is important to understand the possibilities and limitations of using Big Data analytics.

Very often the conversation begins like this: “We want to predict X according to our data Y with an accuracy of no less than Z” , where X and Y do not correlate at all, and Z is such that I can say “I give a tooth”.
')
For example, recently one of the IT managers of a rather large retail network addressed me. The task sounded like this: " There are performance indicators (of the order of several hundred) regional stores (several dozen) for 2 years (ie, 2 tables). You need to predict the same figures for the third year (that is, again the table) ." Any expert knows that this is a typical task of forecasting time series and that with such introductory conditions and such a set of data it is not solved.

Again, people who are familiar with the methods of natural language processing ( Natural Language Processing ) will say that today, for example, text classification problems are solved in most cases well (simply speaking, with a small amount of error), while tasks like the generation of conclusions from a set of related texts ”(an example of such a“ Decision Support System ”I was offered to develop for one of the funds) is practically not solved at the moment in the sense that the existing solutions are completely unproductive.

Here we would like to note that Big Data is not a magic wand, which takes a “sheet” (another term that is often heard in many companies), does something inside and gives the output what your heart desires. In forecasting tasks, the target variable ( target ) is predicted, as a rule, on the basis of a set of features ( features ), which should be exhaustive in terms of impact on the target variable. By the way, the second important observation also appears from here:

In niche tasks, it is easier to train existing specialists than to find new ones from the market.

In the tasks of predictive analytics, one of the most important stages (actually laying the first limit on the quality of the models received) is the preparation and selection of features ( feature engineering , feature selection , etc.). Simply put - a set of optimal parameters that affect a particular target variable. This feature set is determined precisely by the problem domain , therefore its knowledge is critical for many tasks.

For example, let them solve the problem of forecasting customer churn (for any service business). Data Scientist, who will begin to solve this problem, is likely to consider indicators of the “number of support calls” as signs of churn. But any expert who understands the subject area will tell you that the outflow is not affected by the number of calls, but “the presence of level X calls on the subject of Y for the last Z days”. Adding such a trait increases the quality of the models “at times”, and it is not difficult to notice that only a person who is deeply familiar with the subject area can invent such a trait. Not to mention the fact that the concept of “outflow” itself can be correctly determined only by a person who knows this concept not by hearsay.

Yes, of course there are machine learning methods that can independently find the patterns described above, but they are still far from practical applicability. Therefore, knowledge of the subject area in highly specialized tasks at the moment is extremely necessary.

This mainly concerns such tasks as detecting fraud, outflow, or tasks related to the prediction of time series. I think that many people who are engaged in data analysis in niche tasks will confirm this fact.

Often do not need a big CAPEX

3 months ago, one not very large Internet service decided to build, again, an outflow model for its users. Yes, this is probably one of the first tasks that any service business starts to solve, having heard the name Big Data. This is understandable, because almost all services work on retention, because the cost per acquisition is almost always high - in other words, it is easier to earn on those users who already pay their LTV, rather than attract new ones. So, going back to the task: there are several million users, all information about them is currently in relational repositories. Before the head of the unit responsible for this project was instructed to calculate the budget. The money was not so big for the company, but absolutely fantastic for this project. It was supposed (almost literally I quote) “to purchase several machines” , “organize Hadoop cluster on them” , “organize data transfer to the cluster” , “put commercial software for big data so that it can work with the data on the cluster” and (not specified how) to build a predictive system .

Despite the fact that budget reconciliation in many companies is a more political process, it is better to take more money than not to get that “for the purchase of commercial software of well-known brands will not be fired”, it is necessary to understand that often tasks are solved much easier and with fewer resources, if not existing forces. If you know how. Although, people guided by the principles described above are certainly worth understanding.

In this example, as a result, it turned out that the outflow model that satisfies the business could be successfully built on already existing data stored in relational databases (practically without doing engineering engineering - those who know will understand me). The tools required Python with its libraries, the code of which is launched now every month, works for several hours, after which the output data is “uploaded to excel” (this is also important because this tool participates in most of the company's business processes) and transferred to the service planning campaigns to retain customers. The solution is not the best for today, but it fully satisfies the customers. By the way, you need to tell why it satisfies:

Companies are still afraid of "artificial intelligence"

In the example described above, almost the entire outflow forecasting business process has closed. Nearly. In this company, the storage culture was so well-adjusted that it was possible to easily predict the outflow almost in real time and when a high probability of outflow occurred, automatically pass the user through the existing “contact policy” and send him a previously prepared target offer. However, after the moment they began to coordinate with the senior management, it became clear that the company is not ready to introduce such a tool, arguing that "what if something is not predicted or broken and then lose a lot of money . " At the same time, if you carefully calculate all the risks in monetary terms, it turned out that it is economically better to err several times, instead of removing the burden from the department dealing with the prevention of the outflow.

There is no single conceptual apparatus.

Since most of the tasks of modern analytics imply the use of machine learning methods, it is important to understand the quality metrics of the resulting algorithms, because they often influence the business case, i.e. on the very value that the company wants to extract. And here the project manager will need to work (again, not in all companies) to explain to financiers, project managers or supervisors in accessible language such things as “first / second kind error”, “completeness”, “accuracy”, etc. and also to take into account that people from the subject area in which the task is located, operate with other concepts in solving similar problems. Therefore, it is also important to know what “elevator”, “coverage” and other concepts are typical for the subject area (see paragraph 2 above)

Of course, this does not apply to all companies. In many, especially in Internet companies, the culture of working with data is well developed; everyone understands each other and implements a solution much easier.

This is one of the most common things that a person who develops predictive analytical tools has to meet (which is often also called the buzzword Big Data). In this article, I wanted to show that, in addition to the well-known statement “90% of Data Scientist’s work consists of preparing and clearing data,” in practice, it often turns out to be much more difficult to go all the way from idea to the final productive solution, and the most difficult part begins and ends far not on data cleansing.

That is why it is often more important to find a common language with people, having a decision tree built behind the back (Decision Tree), rather than a gradient boosting or RandomForest, which cannot be explained on the fingers.

Successes to all!

PS Unfortunately, because I don’t have enough time for everything, I’ll probably not be able to write enough about what to share. Leave voting

Source: https://habr.com/ru/post/258475/

All Articles