Hi, Habr!We restarted MLClass ! - the first in Russia community of fans of Data ScienceIn this regard, as promised in the article,
your personal course on Big Data , I post answers to the most frequently asked questions that arise from people interested in
Data Science and
Big Data . The answers are given by the best practices of data analysis: the
winners of Kaggle , employees of many companies implementing Big Data solutions, and everyone who knows what Data Science is not by hearsay. It is worth noting that every day more and more people in Russia (and also in the CIS countries) are interested in data analysis, more and more competitions, hakatons are being held. However, there are still a lot of myths around this topic, which I am going to dispel in this post!
')
So, I took
about 100 of the most common questions , selected the most discussed of them, and commented on each of the most detailed questions so that there are no more questions left!
Of course, the answer to this question strongly depends on the preparation of the person who is going to solve problems. But, in general, for a person who has little knowledge of statistics, common sense and mathematical thinking, it is worth starting immediately with
hands on exercises. Therefore, it is recommended to quickly complete
the Andrew Ng course on Coursera.org , as well as parse classic tasks on
kaggle.com , such as
Titanic: Machine Learning from Disaster and
Bag of Words Meets Bags of Popcorn , which are analyzed in sufficient detail on the site itself, and the first solutions are written almost in
Excel - therefore the entry threshold will be minimal.
First of all, you need to understand how the real work of Data Scientist looks like. One can learn machine learning indefinitely - but what's the point if most of the work is often a routine? Therefore, in order not to waste time and just to be sure that you need it, you should first familiarize yourself with how everything really works in the real world (and not at Kaggle) and what you should be prepared for. For this, in its time a series of articles was written:
Many who begin to learn Data Analysis are faced with a lack of mathematical thought. Indeed, to understand all the algorithms, as well as competent work with data, it is simply necessary to have some rigor in their judgments. Therefore, members of the
MLClass community agreed that it is necessary to start with simple courses that are most relevant to Data Science:
Of course, these are books by far not all branches of mathematics, but they are enough to get down to solving problems. Also it should be borne in mind that the majority of Habr's readers already have minimal knowledge of mathematics - therefore, surely enough of these books
One of the most frequently asked questions to me. This question was best answered by
Stanislav Semenov , who is currently in the
TOP5 of the best Data Sciecntis of the world according to Kaggle :
Strangely enough, but it is precisely experience that strongly solves in this matter ... The more you solve various problems of data analysis, try different methods and techniques, study the mathematical basis of everything, the easier it will be with each new task. I would personally recommend carefully studying the solutions of previous problems and competitions (for example, here and here ). After all, surely some similar problems were solved before. You can learn a lot from those who have already successfully implemented something.Which once again confirms that Data Science is primarily a practical science, in some ways even similar to sports. It is necessary to practice regularly and improve skills.
It is worth noting that the answer to this question obviously cannot be exact and correct, because everything is ultimately determined by the nature of the work (and the employer as well). However, quite levels can be conditionally defined as follows:
1.
Beginner . As a rule, it is necessary to be able to work well with data: perform processing, cleaning, selecting features, bringing data, conditionally speaking, to the "object-attribute" matrix. I must say that basically this is all black work, but everyone does it. It is clear that it is important to solve simple analytical problems - to build pivot, test hypotheses.
2.
Middle . Here it is important to know machine learning. The experience of participating in kaggle competitions is useful. It is important to know mathematics, algorithms very well. It is important to have practical experience, because implementation tasks are much more difficult. It is also very necessary to be an expert in the field in which you solve problems - especially when it comes to niche business, such as telecom (do you know, for example, what are “luxury”, “sell_aydi”, “market_key”?)
3.
Senior . It is already important to understand how to work with big data - how it is stored and how it is processed. Be familiar with the
Hadoop ecosystem, with the
MapReduce computing
model , as well as user-level frameworks like
Apache Spark ,
Apache Storm .
4.
Advanced . Here we have to understand more technical details, and also to clearly understand the plan for solving the problem, to estimate the deadlines. As a rule, there already have to lead a group of developers. The responsibility here is big, machine learning becomes small, but at the same time, the result of work directly depends on the amount of money earned by the company. In this position, the demand is great, because a big responsibility.
Again, I note that the division is quite conditional.
The answer to the question, again, probably depends on the personal qualities of the person, as well as the company in which he works. In general, if you focus on the average values, then in accordance with the previous paragraph, the classification will be approximately as follows:
1.
Beginner - from 80 to 150 thousand rubles
2.
Middle - from 100 to 200 thousand rubles
3.
Senior - from 150 to 250 thousand rubles
4.
Advanced - from 200 to XXX thousand rubles
And, of course, as noted rightly in the
comments to this question “In the regions, everything is much more modest.”
This question is primarily interested in those who have not yet worked in this area. It is worth noting that most of the work is a routine that happens every day. However, the routine consists of consistent and careful work with data, testing various hypotheses, data visualization. The tasks of machine learning are solved at the very last moment. Nevertheless, the majority of participants in the discussion expressed the opinion that, nevertheless, in general, the process of analyzing data in one way or another follows the CRISP-DM methodology (CRoss Industry Standard Process for Data Mining), which can be briefly explained with a picture that speaks for itself:
In general, of course, there are deviations from this process. But, almost everyone who is engaged in data analysis somehow “sits” at one of the stages shown in the figure.
So, these were the most common questions about data analysis that people have been asking lately. I am very pleased that interest in
Big Data and
Data Science in general is growing every day and more and more people are getting education and skills in this area. I will continue to bear the obligation to promote this in every way!
In conclusion, I would like to wish success to all who are still at the beginning of this journey!