📜 ⬆️ ⬇️

Who is engaged in machine learning and what is now popular in Data Science? Kaggle user survey results

Hi, Habr! In August 2017, Kaggle's machine learning competition platform conducted a survey among more than 16,000 respondents to find out the state of data analysis and machine learning. The results were made publicly available, so we decided to analyze how domestic Data Science differs from foreign, what a typical Kaggle user looks like in Russia and in the world, and, finally, which algorithms and frameworks are most popular.



16 716 respondents from 171 countries took part in the survey. The number of survey participants from Russia was 578 people.

Who uses Kaggle?


What is a typical machine learning participant? First, we look at the age distribution of Kaggle users in the world and in Russia:
')


As can be seen in the histograms, the majority of Kaggle users are between the ages of 20 to 40 years old, although, of course, there are a considerable number of users in old age, which can not but cause admiration, considering how young and dynamic this area is. Compared to the rest of the world in our country, people who are a bit younger are interested in machine learning: the median age of the respondents is 28, whereas in the whole world - 30 years.

Now let's take a look at the gender structure of the users of the platform:

As expected, men who are engaged in data analysis in the world, as well as in Russia, are still the majority.

How much do Kaggle users earn?


Today, data analysis attracts more and more people. This is not surprising, because in addition to the innovative nature and the presence of a huge number of interesting problems, this area can boast one of the highest salaries not only among the IT professions, but in the entire labor market. Check if this is:



First, we see that the mode of distribution density is the interval from 0 to 15 thousand dollars a year, that is, salaries in this range are most often found - 869 people out of 4,351 (those who submitted their earnings data) or about 20%. If we talk about the median value, then it is equal to 54 thousand dollars a year - higher than in most countries of the world. For example, the official average salary in the USA for 2016 was equal to 44 thousand dollars a year. Finally, we note that there are a certain number of people who earn much more than the bulk: the maximum salary in the sample is 699 thousand dollars a year!

Apparently, specialists in work with data can count on quite decent wages in the world. What is the situation in Russia compared to other countries?



Boxplot is a great way to compare distributions. It is clearly seen that we still can’t compare with the Germans or the Americans with respect to the wage specialist in machine learning: the median in our country is $ 17,500 or 1.05 million rubles. per year, while in Germany and the USA these values ​​are 72 and 107 thousand dollars, respectively. The country in which the salary in the field of data analysis is comparable to ours is India. The situation is not brightened up by the fact that out of 976 people with a salary of more than 100 thousand dollars a year, only 3 Russians, while in the same India there are 11 of them.

It is also interesting to see which posts Kaggle users occupy in their companies:



As expected, the Top 2 most common occupations at Kaggle, both worldwide and in Russia, included Data Scientist and Software Engineer. However, while on the whole Kaggle data scientist is 40% more than developers, in our country the situation is the opposite, which is somewhat unusual: software engineers participating in machine learning competitions are no less than a data scientist is ok

It is worth noting that salary in posts correlates with this “popularity rating”: representatives of the most popular positions in Kaggle surpass their colleagues not only in number but also in pay. So, among all survey participants, data scientists receive much more developers, but in Russia, on the contrary: despite the unceasing growth of popularity, domestic data scientists still cannot boast the same as their foreign namesakes.

Further, the background and the level of education of the Kagglerov:



It is not surprising that in the top there are people with a background in Computer Science, mathematics and statistics, as well as those with education in engineering and physics. Speaking about the level of education of users of the platform, we see that most of them have a master’s degree, then bachelors go, and finally, a considerable share is occupied by doctors of science.

By the way, how does the level of education affect wages in data analysis?

Thus, a master's degree does not give a significant advantage over a bachelor's degree, while PhDs on average can expect a higher salary. An excellent incentive for those who want to do science, but doubts whether to go after the degree or after graduating from the magistracy immediately begin to build a career.

What algorithms and tools are most popular in data analysis now?


Let's start with the algorithms:



The classics are alive and still in trend: linear and logistic regressions are most often used by Kaggle competitors in their work. They are followed by decision trees, random forests and neural networks. The main difference between our country and the rest of the world in this aspect is that our gradient boosting is much more popular with us, whereas abroad they prefer to use SVM or Bayes classifiers instead. Offer in the comments your version, why it is so.

We now turn to the toolkit:



At the moment, it is impossible to imagine a good data scientist without owning Python and SQL, which is confirmed by the survey results. In addition to them, Jupyter Notebooks, the TensorFlow and R. deep learning library are in the top 5 in popularity. By the way, the latter is not as popular in Russia as abroad.

Finally, let's see which areas in data analysis are the most popular in the industry?



Expectedly, the most popular direction in the world and in Russia was training with a teacher. Also in the industry is widely used analysis of time series, training without a teacher, processing natural language and identifying emissions, and, for example, widely discussed in recent training with reinforcement has not yet found a fairly wide practical application.

Data can be downloaded here , and several other reports see here .

Of course, everyone who is interested in data analysis will, sooner or later, want to participate in Kaggle machine learning competitions. That is why we in Newprolab as part of the Big Data Specialist program give participants the opportunity to try their hand at such competitions, and both projects on the program allow you to do this: in the first one, participants compete in the best prediction of the gender and age category of Internet users, only having their logs of visiting, and in the second project they are trying to achieve the maximum result of the recommendation system in terms of the NDCG metric. Set in the eighth group is already underway, and for early birds there is a 15% discount. All program information is here .

Source: https://habr.com/ru/post/346824/


All Articles