10 reasons why your data project will fail

Introduction

The science of data processing continues to excite people, but the actual results are often disappointing to interested businessmen. How can we reduce risks and ensure that results meet expectations? Working as a technical specialist at the intersection of R & D and commercial operations gave me an idea of the problems that stand in this way. I present my personal point of view on the most common types of failures and failures of projects related to computer science.

A full version with slides and explanatory text is available here. Slides are also separately in the PDF file .

There is also some discussion on Hacker News .

First a few words about me:
He led teams of specialists on the theory and methods of data analysis in two startups in London.
The products developed are used by Time Inc, Staples, John Lewis, Top Shop, Conde Nast, New York Times, Buzzfeed, etc.
')
This post is based on discussions that I have led with many leading data processing experts in the past few years. Many companies seem to go through the general scheme of hiring a team of data processing specialists, only about 12 months later, to quit or disband the entire team. Why is the number of failures so great?

Let's look at the reasons.

1. Your data is not ready

If the data is in the database, you can use it, right?
But we can assume that this is just garbage, if they have not been used before.
Check the data.

A very wise data processing consultant told me that he always asked if the data was previously used in any project. If not, it adds 6-12 months to the data cleaning work.

Check the data before you begin. Check the data for completeness and pollution. For example, it may be found that the database contains various transactions stored in dollars and yen without specifying the currency. Similar, really, happens.

2. Often heard: "Data processing is a new oil"

But it is not. Data is not a commodity; they must be converted into some product before they acquire any value. Many interlocutors told me about projects that were launched without any idea who would be their user or how to use their “valuable data”. The answer came, as a rule, too late: "no one" and "no way".

3. Your data professionals are thinking about leaving.

Could you send me a work assignment?
What are you developing at the moment?
Actually, I just got access to R and Python! Literally 5 minutes ago.

Do not create problems for your specialists without providing them with access to the data and tools required for normal operation. The senior researcher from the correspondence above took six weeks to get permission to install Python and R. He was happy!

Alas, the happiness was short-lived:

You must be joking ...
Here it is.

This program is blocked by the requirements of group policy. For more information, contact your system administrator.

Now let me introduce this guy:

He was the product manager for an online auction site you may have heard about. His story was about the A / B test of the prototype algorithm for the main search engine for products. The test was successful, and the new algorithm went into action.

Unfortunately, after a lot of time had passed and a lot of money was spent, it turned out that there was an error in the A / B test code: the prototype was not used. They accidentally checked the old algorithm on their own data . The results were meaningless.

This was the problem:

You will not know that the results are rubbish.
Sampling error, measurement bias, Simpson paradox, statistical significance, etc.
R & D is not an easy task

4. You do not have a data processing leader

You need people who live and breathe with sampling error, measurement bias and things like that - or you will never know that your results do not make sense. Such people are called "scientists."

By the way.

This person is neither a “scientist” nor a data processing specialist:

“An analyst leader who forms a strategy for managing information flows, for business intelligence (BI) tools and for analytical solutions aimed at organizational transformation. He has experience in leading teams in developing enterprise-class solutions and maximizing business value. ”

And this data processing specialist can be considered a “scientist”:

"Specialization: probabilistic programming, data analysis, Bayesian modeling, hidden Markov models, Monte Carlo methods with Markov chains (MCMC), recurrent neural networks (LSTM), multitasking training, domain adaptation."

Also - the opposite statement is very often true:

5. You should not have hired scientists *

*Cm. point 3.

For ETL technology (extract, transform, and load data) hire data engineers.
Hire business intelligence (BI) specialists to create reports.
The end.

6. Your boss reads machine learning blog posts.

The buzz around machine learning means that there is a lot of easily accessible content. This can lead to a phenomenon that could be called a “precocious expert”: now everyone has great ideas on machine learning. The symptom is the use of such phrases as, for example, “elimination of divergence” or “ensemble method” in an inappropriate context. Believe me, that doesn't end well.

A cost-oriented HealthCare project used data from hospitals to process information about patients with symptoms of pneumonia entering the emergency rooms. There was a desire to create a system that could detect people with a rather low probability of death, so that they could just be sent back home, providing them with antibiotics. This would allow to focus care on the most serious cases that could threaten complications.

The developed neural network had a very high accuracy, but, oddly enough, always sent asthmatics home. This was inexplicable, since asthmatics actually have a rather high risk of complications from pneumonia.

It turned out that asthmatics who showed symptoms of pneumonia are always sent to the intensive care unit. Therefore, during the neural network training interval, there was not a single asthmatic death. The model concluded that asthmatics have a very low risk of death, although in reality the situation is reversed. This model had greater accuracy, but if it were used, it would inevitably lead to the death of people.

7. Your models are too complicated.

Use the primarily explanatory model.
Test using some basic characteristics for comparison.

The moral of this story: use a simple model that you can understand. Only then move on to something more complicated and that, if there is a need for it .

8. Your results are not reproduced.

Git;
Code analysis;
Automatic testing;
Ensuring interaction in pipelining of data.

The basis of any science is reproducible results. Do all of this. And do not say later that I did not warn you.

9. The R & D laboratory is alien to your company's corporate culture.

People prefer intuition.
R & D is a high-risk area of activity.
Meetings in the laboratory, negotiations, publication of articles, etc.

The laboratory engaged in applied science imposes serious obligations on the company. Accurate data can often be very dangerous for people who prefer to trust their intuition. R & D carries with it a high risk of failure and requires - as a necessary, but still not sufficient condition for success - an unusually high level of perseverance. Ask yourself honestly - does your company really accept such a culture?

10. The development of information products without reliance on real data is tantamount to taking up taxidermy without observing live animals.

When preparing any information product (even some layout), it is categorically not allowed to develop user interaction and the work of product managers using non-genuine data. As soon as real data is used on the layout, it may turn out that it is a complete fantasy.

Real data may turn out to be strange emissions or, on the contrary, completely monotonous. They can appear as extremely dynamic. They may be completely or difficultly predictable. Use real data from the start, or your project will end in suffering and self-loathing.

Source: https://habr.com/ru/post/317836/

All Articles