📜 ⬆️ ⬇️

Problems of Modern Data Science

image

Hi, Habr! Recently, more and more often it has been observed that the expectations of employers and potential scientists differ greatly in their data. The company, investing in new developments, is primarily waiting for a return on investment, and not just another model. A specialist who has completed all sorts of courses is waiting for clear and understandable data at the entrance, and at the exit I would like to give the model by attaching the quality metrics to it. And then "let the managers understand", how it will be built into the process and how exactly the resulting model will be used. As a result, there is an abyss and misunderstanding between business and scientists.

In fact, it turns out that the models themselves are not needed by anyone, but in fact they have to deal with a very large number of routine tasks.
')
I would like to use generalized examples (all coincidences with real life are random) to show what difficulties really have to be overcome in order to bring money to the employer. Probably, after this, people will go to the data analytics more consciously, simultaneously acquiring the necessary skills for work, rather than studying another article about the algorithm.

Let's start with the most common example. Imagine that you graduated from college, are well versed in machine learning, you know what xgboost is, decision trees, and other algorithms that are not understandable to a simple, normal person. Come to work for b2c company (for example, with a big “c” - where the average bill from a client is taken regularly and over a long period of time), the main purpose of which is essentially to maximize LTV, say, all the same telecom operator. Having heard stories at conferences about the properties of such businesses (“it’s easier to retain the old client than to attract the new one”, “it’s important to manage the outflow”, “it’s necessary to balance loyalty and increase ARPU”) you are not surprised when you are offered to improve / build the outflow model . After all, this is really important - in such companies loyalty (usually measured with the help of NPS or LT) comes first. Everyone understands that this is important (but in its own way).

What happens next? Of course, you are drawing a binary classification in your head, you mentally uncover your xgboost and wait for the output to be that treasured table with a clearly defined separate column (called the target variable) and a quality metric that will judge the success of the algorithm (although you will probably come up with the metric yourself - from the roc-auc, precision, recall, etc. list). But this is not happening. Because it is not even clear what the outflow is. You have just graduated from a university, you have never worked for an operator, for you the outflow “this is when customers stop using the services of the company”. Yes, machine learning algorithms are universal and can solve any problem, but to correctly formulate (and this is most of the work), only those who are well (or not) understand how companies manage the outflow can. For example, the author of this note knows at least a couple of dozen (and their variations) of the definition of outflow, but it is not known what definition is the most correct (and no one knows).

Well, let's say, we decided on the outflow. What is a mobile operator client? A normal person understands that a client is a client and there is nothing to add, what a stupid question? It is necessary to take the client_id and unload signs on it. But the fact is that the company is already the N-th year is some big project called MDM, which still have not decided what to consider a client. And you can take a lot of things - starting from the phone number, ending with the application number of the service or personal account (on which several numbers can be serviced). But let's say that you were lucky in this case too, and the company found a modest employee who told you that you can take as a subscriber and you can safely rush to unload long-awaited features.

And here you think what kind of data the telecom operator has, affecting the outflow - go read the articles of great scientists, without finding any specifics in them. Then you go to ask elder comrades who consistently upload data from the “some oracle” showcase, and specifically, they take these columns, the names of which are known, but what they mean and are considered to be “there seems to be somewhere documentation” This is “when the vendor introduced all of this to us — left over from it.” Without getting a comprehensive understanding of the features (otherwise you would have been hired), you begin to engage in creative work. And here you find that your coolest ones become unrealistically complex, and the difficulties begin even on the simplest things. For example, it is completely obvious to you that such a well-known indicator, ARPU (= about average check), influences the outflow, and again you go to older comrades in order to find out where you can get it - but it turns out that there are payments (what paid the client), and there are charges (what billing accrued). In theory, of course, these 2 values ​​should be very similar, but only in theory. It is clear that payments are a more significant parameter, but they rarely occur. But the charges are generated almost "on the fly" after each transaction. And, most likely, it is necessary to take them exactly and take them as a feature, and according to it, consider the APRU grade.

It is clear that sooner or later you will understand the features (most likely, colleagues will tell you), in passing, understanding that in order to make them you need to work 5-7 years in the CRM of the same operator in order to understand their real meaning and how to count them .

So, with the features figured out. Unloaded (most likely not with their own hands). And then you can (or not?) Sigh, because now, that is, that same treasured table. Here, as usual, you build (often - no) charts, look at the dependencies, train the model, get huge rock-auk, recolla, precision and other numbers and tell the management. “The quality turned out to be 100,500%,” “the technology of the machine-learning works,” now we will pass the jupiter notebook to our developers — let them rewrite it “into production” and that's it.

But everything is not so simple, because we did not do what we asked for. After all, we were asked to increase the efficiency of outflow management, and not the jupiter notebook. To which the reader will object: well, so - we predict the people who are most likely to go into the outflow, take the most disloyal - we offer them something and everything, and that's what kept them. So the scheme is simple, but, as they say, there are nuances. And it is precisely the further reasoning and actions that companies need, and not a trained model, which still (as it later turns out) most likely does not solve the problem.

In fact, now the work is just beginning. For example, in the scheme described above, “we take the most disloyal and retain them” - the logic is excellent for a normal person working in hiring (= not losing his money). But anyone who has ever had a business with a number of potential customers from which to choose can tell you one simple rule - “not all customers are equal and should be treated accordingly”, there are customers with whom it is easier not to work. And here comes the understanding that we need to retain far from everyone, but only the most valuable ones (it is here that they usually say a lot of words about CLTV). And this means that the target segment for which we developed the model is completely limited and, perhaps, it was not worth building a model at all, but simply to estimate how many people fall into it. Simply put, let's first understand how many people we don’t need to be held back - then it turns out that some would be beneficial to give a personal client manager to help, and IVR as support would be a pity for someone, and the rest of the segment is so small that model build inappropriately - and it's easier once a month to call them all.

Well, ok, now the approach is clear, how to retain customers. Honestly, I would like to write a bunch of text on the answers to many such questions and approaches to solving such a common and disassembled number of textbooks. But if you describe it, you’ll almost get the whole book, so I’ll just leave these questions here, and anyone can write the answers to the PM (the answer is not guaranteed for reasons of author’s workload):


Answering such questions is priceless, for everything else xgboost =)

We learn this craft in our School.

Unfortunately, our experience shows that even participation and successful performance in Kaggle competitions do not help in solving industrial problems (fans of sports programming competitions came to a similar conclusion - participation in competitions like ACM has little to do with industrial software development). Moreover, this experience is acquired only by trial and error and will never be described in books - even in our lectures we do not tell all the subtleties that we have put into practice.

We remind you the start dates of our courses:


We also have a new course. We received many requests regarding distance learning. In response to these requests, we made an online introductory course. This course is an introduction to machine learning and data analysis and, on the one hand, allows you to get acquainted with these disciplines, and on the other hand prepares students for our basic courses.

Sign up for a preparatory course here .

PS Full article is available here.

Source: https://habr.com/ru/post/328556/


All Articles