📜 ⬆️ ⬇️

The difference between statistics and data science

Hello, dear readers. We will again try to consult with you about the relevance of the Oreule novelty . This time it will be about statistics for Data Science.

The original volume is 250 pages, the release date is February 25.


The book covers concise cases with a small number of graphs and examples in R.
')
To think and vote was more interesting - under the cut you will find an article whose author tried to catch and describe the difference between statistics and Data Science

It is difficult to say that now it is in great demand - the specialty "data scientist" or articles on data science. It always happens when a term begins to sound out of every iron. Everyone vividly makes content, and that’s what the most popular search queries of our day are: “responsive”, “the Cloud”, “Omni-channel”.

Of course, the demand for data scientists is huge. Last year, the Glassdoor portal designated this profession as the top specialty 2016 - citing as an example 1,700 open positions with an average annual salary of $ 116k.

But after I had studied the post on Data Science, and then the answer from Quora to a question from a business school (by the way, there were deep thoughts) - trying to understand this fashionable trend, I only had more questions. Everything was a little differently defined, what Data Science is, and what is not. After a couple of hours, I was no longer sure that the Data Science phenomenon existed at all.

Therefore, I was afraid that my own article on Data Science would simply fill up this pile. And why read the stuffing of another marketer, praising the topic in all ways, which he himself doesn’t understand too much. What is data science? How is it different from statistics? Why is it in such demand?

As it soon became clear, the answer is connected not only with the ability to program, but also with the deepest understanding of the product being created.

Skeptical statistician


It seems that Nate Silver doesn’t see the difference between data science and statistics. He is a famous calculator, a key specialist from the FiveThirtyEight media site, and the same person who correctly predicted the voting results for the 2008 presidential election in 49 of the 50 US states. In 2012, he already turned out 50 out of 50. And he perceives the term “data science” more skeptically.

“I think data-scientist is a popular synonym for“ statistician, ”Silver said in 2013 at a lecture at the Joint Statistical Meeting.

“Statistics is a scientific discipline. The term “data science” is a bit redundant, so it’s better to use the term “statistics”. ”

Statistics the whole trend related to data science seems a bit arrogant. It doesn’t matter what exactly the exact definition of “data science” is, anyway, this field of activity overlaps with the work that statistics have been doing for more than a decade.

And, although there are a million counter-arguments, it is difficult to refute such an opinion, without first coming to a common opinion: what is “data science”? Too many definitions of data science are made up of old loud phrases. For example, "mining data for business intelligence." Ambiguous words, one after another. Turtles to the bottom.

Even if the science of data is something special, I still could not understand why all these companies needed legions of such specialists. Why is work so cool? Maybe companies simply imitate Google, Facebook and Netflix, lusting their profits and market value?

Frustrated, I scribbled a short message to one friend, CTO. He responded with lightning speed: “I don’t even want to hear about them.”

For several months, he interviewed candidates for the position of data scientist, open to their company. It turned out that self-proclaimed data scientists were more than vaguely aware of what they were to do. Each candidate had a slightly different set of skills, and an even more peculiar understanding of their tasks.

“99% of candidates are not data scientists,” he said. “They do not know how to do what we need.”
Apparently, even those who advocate the protection of this profession do not fully understand where the statistics end and the science of data begins.

The man who knows the answers


In search of answers, I wrote to Drew Harry (Drew Harry), director of data science on Twitch . Last autumn, we discussed an article about how Twitch was consolidated . If anyone could show me the way, then it was Drew.

“Yes, I know a colleague with interesting thoughts on this,” he wrote.
A few days later I came to a meeting with Brad Schumic (Brad Schumitsch), we decided to sit in a cafe near the Twitch head office in San Francisco.

“Well, tell me what you think about data science and statistics,” asks Brad. And then he sits quietly, sips hot chocolate and listens attentively to me - and I, after two cups of coffee, am jumping from the R language to managing data pipelines and further to algorithms.

Brad is a Fulbright Fellow. A dozen years ago, he wrote an important article detailing how a mathematical method called “ convex optimization ” improved the quality of H.264 video encoding . He has a PhD in machine learning from Stanford, he spent a year at Google X, an experimental research and development center where Google has developed ambitious projects such as an unmanned vehicle or Google Glass glasses.

Brad has the answers I need, but he, like a good data scientist, starts asking questions to indicate the starting position.

After I complete my calculations, Brad politely replies: “These are all very useful remarks, but in general the topic is not an easy one. In general - a great topic, just because there is something to discuss. ”

After a pause, he begins: “First, I have great respect for statisticians.”

He is deliberately slow and does not hesitate to take pauses in order to collect his thoughts.
“Statistics are the most important component of data science. In Twitch, the data science team has three competencies: statistics, programming, and product understanding. We would never have hired a person who is poorly oriented in statistics. You can be a cool programmer, but if you don’t know what a Bayesian conclusion is, then we also have an engineering department, I can do it. ”

“Some people think that the science of data is just applied statistics, but we are definitely not just statistics. I need not only people who would be engaged in theoretical studies on statistics. No one should write articles like Fisher, ”he continues, referring to Ronald Fisher, the founder of modern statistics and experimental design. - "It is much more important to be able to apply the findings."
Naturally, in a company like Twitch, such an “application” requires in-depth knowledge of computer science.

Not just statistics


In the statistical community, it is becoming increasingly common to say that the boundaries of statistics need to be expanded - for example, to be more attentive to collecting, presenting and managing data, and more closely involved in predicting the result, rather than simply logically building relationships. There are many areas in which statistics could grow. Instead of just doing a tutorial, and then returning to theoretical studies, statisticians should establish communication.

For example, a couple of decades ago, quanta (statistics, engaged in quantitative analysis) pored over the numbers in the offices and passed the data to interested persons - for example, traders - so that they could take the necessary measures. Today, data scientists write algorithms that are capable of fully absorbing data, calculating and concluding transactions — all in a split second.

Obviously, the roots of all this - in the statistics. I understand why many, including the highly respected Neith Silver, can mix it with the science of data. But the sphere of professional activity of data researchers is far from being limited to statistics.
Computer science enriches many disciplines, giving them new aspects. Marketing + programming = growth hacking (growth hacking). Probably statistics + programming = data science. How I would like to return to those classes Udemy, who skipped.

The era of dynamic products


Twenty years ago, the sites I visited with II si in the computer class were mostly static documents. But with such pages you will not get far, so soon more complex sites appeared that responded to user input. For example, Google - it received a search query from the user, and then issued a list of relevant web pages.

But, of course, Google did not store a static document on any conceivable user input. No, Google’s search robots scoured pages and extracted data from them to the maximum. Therefore, as soon as you entered the request “spare parts from bicycles”, Google programmatically looked through all the data it had and generated a page with links to pages that seemed to match this request.
Of course, today we expect that websites and data applications should be dynamic and take into account not only your user input, but also the mass of information about you that we managed to find out. On my homepage in Netflix there will be movies recommended by me, based on my preferences. In Spotify, the weekly playlist “Discover” is composed for me.

When you open Facebook, the formation of a news feed begins, and innumerable factors participate in its optimization. Will Oremus, senior technology writer for the Slate portal, describes this process in his excellent study of the algorithm that underlies the Facebook feed:

Whenever you open Facebook, one of the most influential, ambiguous and incomprehensible algorithms in the world is included. It scans and collects all the information that all your friends posted last week, all the people you follow, all posts from the groups you are in and from every Facebook page you like. For the average Facebook user, over 1500 posts are typed. If you have several hundred friends, then there may be 10,000 posts. Then, according to a carefully guarded and constantly changing formula, the Facebook news feed algorithm ranks these posts in exactly the order in which, apparently, you would become to read. Most users usually view only the first few hundred.

Someone had to write an algorithm that implements all these possibilities. Facebook could collect all this “history” and pass it on to a very talented statistician. The statistician would have armed himself with his limitless knowledge and experience, after which he would have written an excellent model in the R language, in which he would logically derive the relationships between all these variables. Which, of course, will allow you to effectively select advertising that is most suitable in certain situations.

But how to weave it all into a product? Are there many benefits in one retrospective? Facebook needs an algorithm to analyze all this while the page is loading, to predict and provide the optimal news feed. This is what the data scientist does.

That is why such experts are needed in technical companies. And why, even if they work with statistics, they are far from “the very same specialists, they are a side view”.

But success in data science also requires a deep understanding of the product with which you work.

Question in question


“Twitch is full of great specialists, and not all of them know the statistics. Therefore, in order to achieve a result, you need to establish contact between the data explorer and the product manager, ”says Brad.

While we are discussing the role of data science in product development, Brad continually mentions "efficiency."

“It is much more efficient to work if everyone equally understands the meaning of the product, decides which parameters are more important, understand from the programmer’s point of view, how to implement tracking, and from the statistics point of view, how to do the analysis.”

Not understanding how people will use the product, and what are the goals of the company, you can distort the entire data analysis. The task of a data scientist is to keep all this information in mind at once, and when someone comes to a department with a vaguely defined problem, know which data to turn to in order to answer the question.

Versatile craftsmen


Looking back, I understand why it is so difficult to give a definition of this sphere, since the specialists in it work at the interface of statistics and programming, as well as statistics and production. Moreover, it is clear how difficult it is to find such a definition if you form a data science team yourself.

At Google and Netflix, such work has been going on for years, but the current startups of eight people also want to get involved in the game. Virtually any application has its own model of content delivery, optimized for each specific user. The better the algorithm, say, in a dating application like Hinge, the better the recommended partner will suit you, and the more likely the client will find a mate. In my opinion, it is obvious why companies need people with such specialization, but even more clearly, why it is so difficult to find a specialist for this role. And the demand for data researchers is only growing.

The current data scientist fancifully combines the features of an economist, physics and mathematics. This is a rare person who, due to the prevailing circumstances and the right education, is also a cool engineer and computer. But such people are hard to find. Experience shows that not everyone who claims to be a data scientist is, in principle, able to explain what it is.

Perhaps, if we all come to a common opinion, what data scientists should do, there will be fewer such posts. But all the same, there is a feeling that the rush demand for real specialists in this area will remain.

Source: https://habr.com/ru/post/320772/


All Articles