Profession Data Scientist: how not to make a mistake

Does a person like to play with numbers or numbers with a person? In the classical secondary education there is a funny paradox: students are taught to memorize the rules and cases of their application, but the more the student knows the rules and exceptions, the more often he has the opportunity to make a mistake. In the dictation, woven from the texts of classical Russian literature, an abundance of commas specifying the nature, leads to the idea that it is not a set comma that is a mistake. Therefore, competent work is an essay with a large number of commas. Problem causation, right? Maybe, if you are a good writer, you use a lot of commas specifying the nature, but this is not the case when the number of commas makes you a good writer ...

The interpretation of commas in classical Russian literature is an example of poor data analysis, built on the lack of curiosity and understanding of mathematical statistics. These factors + passionate desire to develop in the field of information technology - the key to understanding the specialty "scientist by data."

The post was prepared based on a speech by an Airbnb data science officer.

We will not dwell on why the data scientist profession is noted as one of the most attractive and promising in the world. It is enough to mention that the number of vacancies in this direction is growing exponentially, and according to the McKinsey Global Institute calculations, by 2018 in America alone, an additional 190 thousand specialists will be required from data trained in statistics and machine learning. McKinsey notes that in addition, millions of managers will need to be trained in basic data skills.

This is a huge market that is just emerging, however, big data problems and ways to solve them did not arise yesterday. The amount of archived data accumulated over the years in Airbnb alone is a few petabytes of data. Dozens of terabytes of information are processed daily using a repository built on the basis of Apache Hadoop and Hive. We have already talked about the personalized search engine Airbnb - it was created on the system of distributed processing in real time Storm. For Airbnb, user data analysis is necessary to make virtually any decision to develop a company. And we are vital professionals data scientist.

Today, only a third of the demand for data science specialists can be met. The undersaturated market cannot provide companies with qualified personnel in the field of data mining or predictive analytics, which leads to an increase in demand and wages. Public and private universities do not cope with the process of training specialists in working with data.

Data Scientist: personality traits

A number of technical universities offer a training program for the “Master of Science in Data Science and Management.” The specialty will require you to have deep knowledge in the field of mathematical statistics, machine learning, programming. However, no training can be compared with the experience that you get directly from work, faced with real problems. Only work will show you that the chosen path is not the easiest in life.

Data science is as difficult as doing science in general. As in conventional scientific disciplines, most of the methods you use will not work. You can't just go to the lab, click with your fingers and get the result. You will come up with a lot of interesting (just great!) Things: how to make the system better, how to adjust and optimize the sample, and the like. About two thirds of your ideas will not work. Overwhelmingly most of the time you will fail. And should be ready for this.
To be a good data scientist, it’s not enough to be a good programmer. You should understand statistics better than software engineering. A competent data scientist is a competent statistician. The specialists around you understand everything else better - and this is normal, you should be able to listen to them, get from them the data you need in your work.

A data scientist is a person who loves math. Employers looking for a data specialist should first pay attention to mathematical specialties. You have not studied math and are afraid to put an end to a career? There is an alternative way - the study of computer science. And you can succeed in academic science. Mindset is important, do you understand? You can be an expert in neuroscience and decide to study the data - math will take you with open arms.

Immersion in mathematics should not prevent you from studying computer systems. Otherwise, it is easier to become a teacher. This is a big problem, in fact, that mathematicians do not understand the scale of the data used, they do not understand the very structure of computer data and, as a result, are not able to simulate the appearance of systemic problems in perspective. There is always a gap between the probabilistic mathematical model, which, as you suppose, corresponds to the structure of your problem, and the actual data that you are trying to analyze. Collecting statistics means to rush between the model and the data. It is very important to understand this at a deep level, and not to treat mathematics (and computer systems) as a magic box, where you can throw numbers, turn the handle and get the result.

Data Scientist: how to become

Man acts according to the patterns in his head. When considering a problem, you operate with ready-made behaviors. Data scientist works with random variables and probabilistic models, because its task is to identify the most unexpected patterns. If you want to hire such a specialist, and admit to yourself that you don’t know so much about statistics, suggest to the person you are interviewing that the test is completely devoid of context. Out of context. And you will see how he will handle the problem without knowing how to solve the problem. This is the essence of the work - to think not about the previously obtained statistical data, not about computer models of the solution, but about the problem. This solution demonstrates the ability of a specialist to operate probabilistic models with complex data.

So, you are ready to do all these things, you understand statistics, you understand the data structure and algorithms, or you are a scientist who understands what underlies the simulation. Now you can get a job. But there is still a lot of things in the world that you do not know that it is difficult to understand, because it is not included in textbooks. For example, most data analysts do not understand how teams work within software development. It is very frightening and unnerving when you come into contact with a medium with an incomprehensible material. There is nothing humiliating in recognizing this and starting all over again - becoming a student of more experienced developers.

Watching the development of a software project from scratch is an invaluable experience. Another way to get experience of interacting with the real environment is to participate in the Kaggle project . The resource is used to solve complex problems in various fields of knowledge (marketing, finance, banking, medicine, insurance, research). Kaggle transforms business challenges for companies into a structured data set that is convenient to work with.

Data Scientist: not being who you are not

Do not try to be who you are not. Not infrequently, a data scientist is perceived as a data analyst. The analyst can say: “If my data analysis tools cannot answer the question, the question remains unanswered.” Here we ask a question to the database and, if it does not return in half an hour, we cancel it and proceed to the next.

A data scientist thinks like this: “If my data analysis tools cannot answer the question, then I need better tools and data.” This example best explains how to be a data scientist. The scientist does not say: I cannot answer the question, I will go and do something else. The scientist continues to think about the question and find out the ways in which he can answer it.

