📜 ⬆️ ⬇️

OpenDataScience and Mail.Ru Group open course materials on machine learning and new launch

Recently, the OpenDataScience and Mail.Ru Group conducted an open machine learning course. In the last announcement a lot has been said about the course. In this article we will share the course materials, as well as announce a new launch.



UPD: now the course is in English under the brand mlcourse.ai with articles on Medium, and materials on Kaggle ( Dataset ) and on GitHub .


Who can not wait: the new launch of the course - February 1, registration is not needed, but so that we remember you and separately invited, fill out the form . The course consists of a series of articles on Habré ( Primary data analysis with Pandas is the first one), supplementary lectures on the YouTube channel , reproducible materials (Jupyter notebooks in the github repositories of the course), homework assignments, Kaggle Inclass competitions, tutorials and individual projects on data analysis. The main news will be in the VKontakte group , and life during the course will be shared with Slack OpenDataScience ( join ) in the #mlcourse_ai channel.


Article layout



How is the course different from others?


1. Not for beginners


Often you will be told that nothing is required of you, in a couple of months you will become an expert in data analysis. I still remember Andrew Ng's phrase from his basic Machine Learning course: “You don’t have to know what a derivative is, and now you’ll understand how optimization algorithms work in machine learning.” Or "you are almost a data analysis expert", etc. With all the immense respect for the professor - this is hard marketing and jaundice. You will not understand optimization without knowledge of derivatives, the foundations of matan and linear algebra! Most likely you will not even become a Middle Data Scientist, having completed a couple of courses (including ours). It will not be easy, and more than half of you will fall off in about 3-4 weeks. If you are wannabe, but are not ready to immerse yourself in math and programming, see the beauty of machine learning in formulas and achieve results by typing tens and hundreds of lines of code - this is not for you. But we hope you still here.


In connection with the above, we indicate the threshold of entry - knowledge of higher mathematics at the basic (but not bad) level and mastering the basics of Python. How to prepare, if you do not have this yet, is described in detail in the VKontakte group and here under the spoiler, just below. In principle, you can complete the course without mathematics, but then see the following picture. Of course, as far as the date of the Scientist needs to know mathematics - this is holivar, but we are here on the side of Andrei Karpaty, Yes you should understand backprop . Well, in general, without mathematics in Data Science, it’s almost like sorting with a bubble: a problem can be solved, but it can be done better, faster and smarter. Well, without mathematics, of course, it’s impossible to get to the state-of-the-art, and it’s very interesting to follow him.


Math and Python

Maths


  1. If quickly, then you can go through the notes from the specialization of Yandex and MIPT on Coursera (share with permission).
  2. If you approach the issue thoroughly, one link to the MIT Open Courseware is enough. In Russian, a cool source is the WKI -page of courses of the FKN HSE. But I would take the program of MIPT 2 courses and walked through the main problem book, there is a minimum of theory and a lot of practice.
  3. And of course, nothing can replace good books (here you can also mention the program of ShAD ):

  • Mathematical analysis - Kudryavtsev;
  • Linear algebra - Kostrikin;
  • Optimization - Boyd (English);
  • Probability Theory and Statistics - Kibzun.

Python


  1. The quick version is browser tutorials Ă  la CodeAcademy, Datacamp and Dataquest, right there I can specify my repository .
  2. Thorough - for example, email course Coursera or MIT-shny course "Introduction to Computer Science and Programming Using Python".
  3. Advanced level - the course of the St. Petersburg Computer Science Center.

2. Theory vs. Practice Theory and Practice


There are a lot of machine learning courses, there are really cool ones (as specialization is “Machine learning and data analysis”), but many fall into one of the extremes: either too much theory (PhD guy), or, conversely, practice without understanding the basics (data monkey) .



We are looking for the optimal ratio: we have a lot of theory in the articles on Habré (the fourth article on linear models is indicative), we try to present it as clearly as possible, we present it even more popularly in lectures. But the practice of the sea - homework, 4 competitions Kaggle, projects ... and that's not all.


3. Live communication


What is missing in most courses is live communication. Beginners sometimes need just one short piece of advice to get off the ground and save hours, or even dozens of hours. Coursera forums usually die off at some point. The uniqueness of our course is active communication and an atmosphere of mutual support. In Slack OpenDataScience, when passing a course, they will help with any question, chat lives and thrives, your own humor arises, someone trolls someone ... But the main thing is that homework authors and articles — in the same chat room — are always ready to help.


4. Kaggle in action



From the public VKontakte "Memes about machine learning for adult men."


Kaggle competitions are a great way to quickly get data analysis into practice. Usually they begin to participate after completing a basic machine learning course (as a rule, the Andrew Ng course, the author is certainly charismatic and speaks well, but the course is already very outdated). During the course we will be invited to participate in as many as 4 competitions, 2 of them are part of homework, you just need to achieve a certain result from the model, and 2 others are already full-fledged competitions where you have to create (come up with signs, choose models) and outrun your comrades.


5. Free


Well, also an important factor, which is already there. Now in the wake of the spread of machine learning, you will find quite a few courses offering to train you for a very tidy compensation. And here everything is free and, without false modesty, at a very decent level.


Course materials


Here we briefly describe the 10 topics of the course, what they are devoted to, why a course of basic machine learning cannot do without them, and what new we have brought.


Topic 1. Primary data analysis with Pandas. Article on Habré



I want to immediately start with machine learning, to see math in action. But 70-80% of the time spent working on a real project is a mess with the data, and here Pandas is very good, I use it almost every day in my work. This article describes the basic Pandas methods for primary data analysis. Then we analyze the data set on the outflow of clients of the telecom operator and try to “predict” the outflow without any training, simply relying on common sense. In no case can one underestimate this approach.


Subject 2. Visual data analysis with Python. Article on Habré



The role of visual data analysis is difficult to overestimate - this is how new signs are created, patterns and insights are sought in the data. K.V. Vorontsov gives an example of how exactly, thanks to the visualization, guessed that during boosting, classes continue to “move apart” as trees are added, and then this fact was proved theoretically. In the lecture, we consider the main types of images that are usually built for the analysis of signs. We will also discuss how to look into multidimensional space in general - with the help of the t-SNE algorithm, which sometimes helps to draw such Christmas decorations.


Topic 3. Classification, decision trees and the method of nearest neighbors.
Article on Habré



Here we will start talking about machine learning and about two simple approaches to solving the classification problem. Again, in a real project, you should start with the simplest approaches, and it is the decision trees and the nearest neighbors method (as well as linear models, the next topic) that you should try first after the heuristics. We will touch on the important issue of model quality assessment and cross-validation. We will discuss in detail the pros and cons of trees and the method of nearest neighbors. The article is long, but especially decision trees deserve attention - it is on their basis that random forest and boosting are built - algorithms that you will probably use most of all in practice.


Topic 4. Linear classification and regression models.
Article on Habré



This article will be the size of a small brochure and not without reason: linear models are the most widely used approach to forecasting in practice. This article is like our miniature course: a lot of theory, a lot of practice. We will discuss the theoretical background of the least squares method and logistic regression, as well as the advantages of the practical application of linear models. We note here that there will be no excessive theorizing; the approach to linear models in machine learning is different from the statistical and econometric ones. In practice, we will apply logistic regression to the very real task of identifying a user by the sequence of visited sites. After the fourth homework, many people will drop out, but if you do it, you will already have a very good idea of ​​what algorithms are used in production-systems.


Theme 5. Compositions: bagging, random forest. Article on Habré



Here again the theory is interesting, and practice. We will discuss why “crowd wisdom” works for machine learning models, and many models work better than one, even the best. But in practice, we will smoke a random forest (the composition of many decision trees) - what is worth trying if you don’t know which algorithm to choose. We will discuss in detail the numerous advantages of a random forest and its areas of application. And, as always, not without flaws: there are still situations where linear models will work better and faster.


Topic 6. Construction and selection of signs. Applications in word processing tasks, images and geodata. An article on Habré , a lecture about regression and regularization.



Here the plan of articles and lectures differs a little (only once), the fourth theme of linear models is too big. The article describes the main approaches to the extraction, transformation and construction of signs for machine learning models. In general, this occupation, the construction of features, is the most creative part of Data Scientist's work. And of course, it is important to know how to work with various data (texts, images, geodata), and not just with a ready Pandas data frame.


The lecture will again discuss linear models, as well as the basic technique for adjusting the complexity of ML models - regularization. In the book "Deep Learning" they even refer to one well-known comrade (too lazy to go for proof-link), who claims that in general "all machine learning is the essence of regularization." This, of course, is an exaggeration, but in practice, in order for the models to work well, they must be tuned , that is, it is the correct use of regularization.


Topic 7. Teaching without a teacher: PCA, clustering. Article on Habré



Here we turn to an extensive topic of study without a teacher - this is when there is data, but the target feature that I would like to predict - here it is not. There are such unmarked data a dime a dozen, and we must be able to gain from them. We will discuss only 2 types of tasks - clustering and dimension reduction. In your homework, you will analyze the data from accelerometers and gyroscopes of mobile phones and try to cluster on them the carriers of the phones, select types of activities.


Topic 8. Training in gigabytes with Vowpal Wabbit. Article on Habré



The theory here is an analysis of stochastic gradient descent; it was this optimization method that allowed us to successfully train both neural networks and linear models on large training samples. Here we will also discuss what to do when there are too many signs (a trick to hash the values ​​of the attributes) and go to Vowpal Wabbit, a utility that allows you to train a model in gigabytes of data and sometimes even acceptable quality in minutes. Consider many applications in various tasks - the classification of short texts, as well as the categorization of questions on StackOverflow. While the translation of this particular article (in the form of Kaggle Kernel) serves as an example of how we will submit the material in English to Medium.


Topic 9. Time series analysis using Python. Article on Habré



Here we discuss various methods of working with time series: what stages of data preparation are needed for models, how to get short-term and long-term forecasts. Let's go through various types of models, ranging from simple moving averages to gradient boosting. We will also look at ways to search for anomalies in the time series and talk about the advantages and disadvantages of these methods.


Theme 10. Gradient boosting. Article on Habré



Well, where without a gradient boosting ... this is Matrixnet (Yandex search engine), and Catboost is a new generation of boosting in Yandex, and the search engine Mail.Ru. Busting solves all three main tasks of learning with a teacher - classification, regression and ranking. And in general, he wants to be called the best algorithm, and this is close to the truth, but there are no better algorithms. But if you do not have too much data (fits into the RAM), not too many signs (up to several thousand), and heterogeneous signs (categorical, quantitative, binary, etc.), then, as the experience of Kaggle competitions shows, almost certainly the gradient boost will show up best in your task. Therefore, it was not without reason that so many cool implementations appeared - Xgboost, LightGBM, Catboost, H2O ...


Again, we will not limit ourselves to the “how to sew Ixbust” manual, but we will examine in detail the theory of boosting, and then consider it in practice, in a lecture we will come to Catboost. Here the task will be to beat the baseline in the competition - this will give a good idea of ​​the methods used in many practical tasks.


More about the new launch


The course starts on February 5, 2018. During the course will be:



How to connect to the course?


Formal registration is not necessary. Just do your homework, participate in competitions, and we will consider you in the ranking. However, fill out this survey, the left e-mail will be your ID during the course, at the same time we remind you about the launch closer to the point.


Sites for discussion



Good luck! Finally, I want to say that everything will work out, the main thing - do not drop! This “do not throw” you now glanced over and most likely did not even notice. But think: this is the main thing.



')

Source: https://habr.com/ru/post/344044/


All Articles