Experience in developing the requirements for a professional data scientist

Today, almost any business feels the need to research data. Data science is not perceived as something new. Nevertheless, it is not obvious to everyone what a hired specialist should be.

This article is not written by an HR specialist, but a date of a cynthist; therefore, the presentation style is very specific, but there is an advantage in this - it is a view from the inside, which makes it possible to understand what data scientist qualities are necessary for the profession so that the company can rely on such person

Prologue

The time has come when data science startup has grown out of the cradle - the number of tasks for analysis has increased at an unforeseen speed, and this speed immediately ceased to be compensated by automation. It became obvious that we needed new brains in the team ...
')
At first, it seemed to me that a person needed a completely specific one: just an ordinary date — something — there ... a programmer, an analyst, a statistician. So what is the difficulty to make a list of requirements?

“In engineering, if you don’t know what you’re doing, you shouldn’t do it.”
Richard Hamming

I came to the point as usual. Got two sheets of paper. One is entitled "Technical Skills", the other - "Professional Qualities." After that, there was a desire to climb on some resource, find there a pack of resumes, write out lists of qualities, choose those that please. But something stopped me. “This is not my way,” I said to myself. - I'm not good at it. I understand the tasks .. ”

I tried to go from task. Our tasks are simple. You are given a non-parsed CRM of dubious content and asked to predict sales for a couple of months in advance. Quite simple. Anyone can do it ... Disclaimer: if he can sort out the client’s business. Ideally, a working group is taken for this, which abstracts from all other tasks and devotes itself to the analysis of this particular one. At the entrance - the client's wishes, at the exit - a solution that can be checked without going into details and not duplicating the work done.

From here, I put together the first somehow formal requirement — a person should be able to take on a separate task and don’t bother anyone especially until the first rough decision is received. Then this solution can be improved, attracting specialists to help. But in the first stage, involving someone else is the same as putting an overseer over a person. And an overseer can at any time push a novice off and start doing everything for him, making hiring absolutely meaningless.

Based on this first requirement, I very quickly filled out the first sheet: know python, be able to extract information from different sources, store information, use AWS, know terver and statistics, be able to random processes. A little later I added an economy there in the basic version. The result was a list of skills required for the first requirement to be fulfilled.

But with the list of professional qualities I did not work out. Even googling, I did not find any requirements for the professional qualities for the data scientist, which seemed appropriate.

Either the general formulations of the form “responsibility” emerged, or by qualities were understood the skills that belonged to another list.

His own thoughts were mixed into a mess, which was difficult to systematize. Global mixed with specific, applicable only to certain tasks. To endure with one heap such qualities that were too general, alongside qualities that the candidate could never use later, it seemed to me very wrong.

This is where the idea of the Task was born. A good and elegant, as it seemed to me, way to pay off the need to philosophize over the lists of requirements, and at the same time to collect the necessary list, looking at the errors in the decisions.

Task Formulation

The entrepreneur decided to open a shop at badminton courts, so that visitors would not have to go to the supermarket for shuttlecocks and racket.

Throughout the year, the entrepreneur kept all checks (purchases) from purchases in order to subsequently understand what decisions should be taken to increase profits. Information from checks contained in the attached file train_dataset.csv .

Flounders and rackets he packed and sold exclusively in sets of three types:

Racket and two shuttlecocks
Racket and five shuttlecocks
Ten shuttlecocks

From time to time, an entrepreneur had to change prices with an eye to supermarket prices and tax rates.

The shop and the court worked without weekends and holidays. The flow of buyers was somewhat limited due to the fact that only 4 people are allowed on the court, and the court is booked in advance for a two-hour session, there are a total of three courts in the stadium. However, not a day passed without a sale, because from time to time either completely unprepared people came to court, or someone tore off a racket or lost shuttlecocks.

A year later, the entrepreneur decided to arrange a sale, which should last from the first of January to the thirty-first of January, inclusive. He redistributed the sets of goods and assigned them the following prices:

Only one racket - 11 dollars 80 cents
Five shuttlecocks - 5 dollars 90 cents
One racket and one shuttlecock - 12 dollars 98 cents

It is required to establish the size of the income of the entrepreneur in January.

Sensitivity to probabilities

“I believe that the best predictions are based on understanding
fundamental forces involved in the process. ”
Richard Hamming

The task was compiled in imitation of the real tasks of life, but in an artificial way that was not hidden from the candidates. And, therefore, certain formulas were used to create a dataset. Let, flavored with random variables, but formulas. In any case, it was assumed that the data scientist has the ability to detect and use these formulas for prediction.

Of course, it is impossible to discard the possibility that dataset will not give a complete picture, allowing you to restore the formula with the desired accuracy. But in this case, in real life, we think of what additional information should be, and where to get it from.

In general, the desire to find the "law of the universe" is a good professional quality. The ability to understand what to look for and where to look is the same. Mr. Hamming knew what he was talking about. And thanks to him, the first line appeared in my list of requirements:

The ability to detect cause-and-effect relationships, describe them, formulate the conditions under which relationships can be transformed into a useful formula for business.

It was not by chance that I used the phrase “useful for business” here. In my personal practice, it often turned out that not the answer to the task brought business to the business, but a kind of side effect, which was obtained by opening some kind of internal dependencies. In some cases, this brought the startup additional money, new contracts, increased the amount of know-how and by-products.

Therefore, when analyzing the solutions sent to me, I carefully watched how the candidate would use the knowledge about the artificiality of datasets, whether he would ask for additional information at some point or would prove the sufficiency of the dataset to complete the task.

Self confidence

“If an event attracts our attention, the associative memory begins to look for its cause, or rather, any reason that is already stored in the memory is activated.”
Daniel Kahneman

Not to say that the associative memory is bad. She is the source and fuel of our fantasy. Fantasy allows you to generate hypotheses, intuitively put forward assumptions, quickly find those pairs of variables between which communication is possible.

And it also triggers us as a confirmation bias.

We are so accustomed to our own experience and our own knowledge that we begin to spread them to new situations. In the world of the living it is often useful. For example, the belief that all snakes are poisonous saves more lives than doubting that this particular snake is not poisonous. But in a safe office, having enough time, it’s better to take any judgment as a hypothesis.

Dataset to the task was specially made in such a way that the time interval covered only a year of observations. It is good that candidates at the stage of consideration of graphs put forward a hypothesis about the presence of seasonal fluctuations. It is bad that rarely anyone declared the need to check it. And it’s too bad that some without checking insisted on seasonality.

So I entered the following in the list of qualities:

Critical thinking, including in relation to their own experience.

I really wanted to add here “and knowledge,” but then it seemed to me that this postscript opens up a great new topic.

Neuroticism

“Having developed this or that theory, we again turn to observations,
to check it out. ”
Gregory Mankew

The data science literature deals with ways to automate hypothesis testing. However, I rarely met methodical instructions for their use. Because of this, believe it or not, once I got confused between two seemingly very different activities - testing statistical hypotheses and checking the model.

At the same time, which is even more confusing, the difference between the concepts of the statistical hypothesis and the hypothesis in general is overlooked. So that there is no such confusion in our article, let me apply the term assumption for the general concept of a hypothesis.

In the previous paragraph, one such assumption was put forward with respect to datasets, namely, the presence of seasonality. Quite intuitively, you can define the seasonal component as periodically repeating. And here you should immediately ask yourself the question: how many times should the component be repeated so that it can be considered seasonal? Moreover, can we, on the basis of periodic repetition, assert the presence of a seasonal component in a dataset, the time interval of which is only a year.

As already mentioned, the length of the interval was chosen specifically. I wanted candidates to have the need and the opportunity to offer their own ways of checking for seasonality for the task at hand. And this quality I also added to the list of required professional qualities:

The ability to test assumptions in standard ways and invent new ways to check.

Probably, “invent new ways” sounds too loud. I rarely have to come up with something new. The method of simple considerations following the question “What if?” Is quite appropriate.

In the beautiful article “This is right, but wrong”, Alexander Chernookiy gave examples of quick and almost intuitive solutions for several probabilistic problems. This mechanism, it seems to me, is quite well suited for testing assumptions.

First, think about what kind of seasonality we want to find. Seasonality can be an external factor that is unknown to us, and which embodies some paranormal repeatability in the data. You can describe this seasonality, without going beyond the dataset, by writing out the seasonal component separately and showing the degree of its stability. And seasonality can be hidden inside the known data. For example, if seasonality affects the number of buyers, and the number of buyers on sales, if we knew in advance and completely, when a buyer would come, we would hardly need seasonality as a separate phenomenon. Therefore, we will look for exactly paranormal seasonality, since we do not need it and need it.

Let's now assume that this seasonality does not affect sales. Then all the fluctuations in sales are either random, or you can find a certain relationship between them and changes in other variables. How fully will this dependency describe what is happening? Will there be any place for paranormal seasonality?

That is, to check for seasonality, we can find all dependencies on known variables, and after that, subtracting these dependencies from fluctuations, look at the remainder. Moreover, if the spread of the balance is small enough, then perhaps there will be no point at all in finding paranormal values.

So we got a simple way to check for seasonality in the absence of a sufficiently long data interval.

Attention

“Our mind is not prepared to understand rare events.”
Robert Banner

Turning to the search for the relationship between the two quantities, we first try to feel their mutual change. And there is probably no simpler and more elaborate way than linear regression. It can help form an opinion about the relationship even in cases where the quantitative relationship between the quantities is unknown. Well, it has a number of other advantages.

And disadvantages.

In fact, the relationship between two quantities is not always so simple that it can be identified by numerical characteristics. No matter how beautiful the linear approximation of the relationship between two quantities is, there is always the possibility that we are dealing with something more complex. The English mathematician Francis Enscombe illustrated this phenomenon with four examples, which were later called the Enscombe Quartet .

Putting something like the Enscombe quartet into the task turned out to be a good idea and very simple to implement. Despite the popularity of the phenomenon, very many candidates fell for it.

The implementation of the phenomenon in the problem looked as follows. Suppose there are three groups of customers, each of which realizes a certain interest when buying. Two groups behave in a similar way, and their behavior is expressed in a linear relationship between demand and price. But the third group does differently. With the price moving above a certain threshold, buyers from this group abruptly stop buying more than the necessary minimum.

This phenomenon, quite common in the real world, allowed us to simulate one of the examples of Enscombe and hide it among two other distributions.

In fact, "hide" does not fit the situation. I just put this distribution next to other, more familiar and understandable. The difference was obvious on the graphs, as it seemed to me, but not everyone noticed. And the attempt of one of the candidates to “improve” the approximation by switching to a higher polynomial was especially interesting.

So I formulated another requirement for professional qualities:

To be able to isolate meaningful observations, to build hypotheses about their significance.

Impulsiveness

“The measuring instrument has been used extensively for five years and has gone through three tests.”
Timothy Leary

I have previously described a situation where unexplained balances become so small that their influence becomes indistinguishable against the background of the business benefits offered by the rest of the model.

However, it is required to understand what can be hidden behind the expression “so small”.

Usually the world is observed and measured by us with the help of some devices. Simple, like a ruler, or complex, like an electron microscope. A complex instrument also includes a computer with a statistical programming environment installed on it.

In a sense, any observation or conclusion we make can be perceived as the result of a measurement. We look at the conditions of the problem and measure the income in a time-out interval that has not yet happened. Here I replaced the word “predict”, mysterious and magical for many, with the word “measured”. As part of my everyday work, I can quite so to say, because the forecast with a fairly high level of accuracy is replaced by routine calculation.

But any measurement cannot be extremely accurate. Each device has a measurement error caused by its imperfections. And in the measurements it is necessary to indicate their accuracy; for this, together with the obtained result, the confidence interval is indicated.

The indication of a confidence interval is not even a recommendation, but a necessity, which is often forgotten. Moreover, although in my words some pedantry will sound, I believe that the calculation of the confidence interval is an act of self-esteem, and the following quality is among the necessary qualities for a data scientist:

Accuracy in complying with the formal requirements of algorithms and methods, especially when it comes to calculating confidence intervals and checking necessary and sufficient conditions.

Plastic

“This provision is not quite true, but it is true enough for practical application in most cases.”
Francis encombe

Until now, I avoided discussing the most prominent features of this task. The predicted interval is characterized by a strong change in the goods sold. Now it's time to explain why this change is in the task.

Above, I have already stated my view on the possibilities of testing various assumptions. There should always be a check. If something cannot be verified, or the method of verification is not known, then various options should be presented; they may be cause for further research. But at the same time, it is necessary to try to describe the situation as much as possible, based on known information.

In fact, what do we know about sales? There are people who, for well-known and listed reasons, make purchases. You can model the entire process almost completely, since we found all the dependencies and found that the unexplained remainder is normally distributed and has very little variance.

The questions begin to appear: does the purchased volume of goods cover people's needs? What do they do when the need remains unmet? For example, what do they do if the price of a product, in their opinion, is too high? Where does the linear dependence of demand come from?

In fact, these are questions for the business. And they should certainly be asked to the business owner as an expert in their field. In the end, the original dataset is far from always full, and the business, even with a staff of professional analysts, does not know everything. Actually, business calls data science precisely because it does not know everything. But what if ...

What if there is a verifiable and consistent model describing the situation using only the known data we have? This is also worth checking out.

Epilogue

Let me draw up a final list of the professional qualities of a data scientist that I have written out.

The ability to detect cause-and-effect relationships, describe them, formulate the conditions under which relationships can be transformed into a useful formula for business.
Critical thinking, including in relation to their own experience.
The ability to test assumptions in standard ways and invent new ways to check.
To be able to isolate meaningful observations, to build hypotheses about their significance.
Accuracy in complying with the formal requirements of algorithms and methods, especially when it comes to calculating confidence intervals and checking necessary and sufficient conditions.

In this assembled form, the list seems pretty obvious to me. Perhaps because to some extent repeats the list of cognitive distortions. Which, by the way, makes me think about the natural evidence of a posteriori observations. And yet, I remember the time of meditation on the second blank sheet of paper and I understand that the list would not have been compiled without the work done.

Another interesting idea is that the importance of some fact for one person is not necessarily obvious to another. This is well seen in the solutions of the problem that I received from dozens of candidates ...

Author: Valery Kondakov, Co-founder and CTO of Uninum company
Coauthor: Pavel Zhirnovsky, Co-founder and CEO of Uninum

PS

Vacancy statistics on 25/06/19
Post Date: 27/05/19
Total Job Views: 2727
Total responses: 94

They sent a solution to the problem, but it turned out to be wrong: 20%
Agreed to solve the problem, but did not send the answer: 30%
Refusal for review for various reasons: 45%
Sent a solution close to the correct one: 5%

Source: https://habr.com/ru/post/457630/

All Articles