Specialization in machine learning on Coursera from PhyTech and Yandex

At the beginning of the year, Coursera opened a machine learning course from Yandex and HSE, which we already talked about . By the time of launch, 14,000 people signed up for it. An hour after opening, users created a channel in Slack, where they began to discuss the program. Now there are 21,000 listeners.

On February 9, an entry into machine learning specialization became available on the platform, which is already being developed jointly with Fiztech by our specialists. It is designed in such a way as to help listeners to smoothly immerse themselves in the topic.
')
Specialization "Machine learning and data analysis" consists of five courses and work on its own project. Training will last several months. You can sign up for it until February 19th. If you do not have time to do this, from March 14 you can sign up for the second stream.

The authors of the course are Yandex employees, Yandex Data Factory specialists, who teach at Fiztekh. Konstantin Vorontsov is also among them. We asked some of our colleagues to tell who specialization might be useful for and why it is needed. Also under the cut - the program of all courses.

Victor Kantor is a Senior Lecturer in the Algorithms and Programming Technologies Department of the FIFT MFTI, the head of the user data analysis group at Yandex Data Factory. He conducts lectures and seminars at the Moscow Institute of Physics and Technology at the departments of Algorithms and Programming Technologies, Data Analysis, Banking Information Technologies, and also taught at the Departments of Computational Linguistics and Image Recognition and Text Processing.

In our specialization, we solved the problems that we most often observe in the training of specialists in the field of data analysis.
She immediately gives the necessary knowledge about Python and data analysis libraries, so that later on the theory does not break away from practice.
We immediately remind you of the necessary mathematics in the future in order not to make profanations like: “Oh, these are matrices. Well, it does not matter that you do not remember what you can do with them - you will still multiply them on the computer. ” We want you to understand the methods we described.
We tell you about those of them that are often used in practice, and not those that we just wanted to tell more.
We will teach you how to correctly draw conclusions from data using statistics and not to make common mistakes.
We will analyze a lot of applied tasks, by the example of which you will learn how to apply all that you have learned.

Evgeny Ryabenko is a leading analyst at Yandex Data Factory, Ph.D. in Physics and Mathematics, an associate professor at MIPT. He lectures on applied statistics at the faculties of the VMC MSU and FUPM MIPT. Lecturer School of data analysis Yandex.

The differences between the HSE course and our specialization are not only in the pace of teaching, but also in the topics covered. Konstantin Vyacheslavovich's course is devoted to machine learning. This is a fairly modern scientific field, but over the years of its existence a certain academic canon of teaching has already been formed: first, the simplest methods are explained, then more complex methods are built on them, and somewhere in the end we get to the state-of-the-art technician, allowing to receive really high-quality results in applied tasks. Roughly speaking, machine learning is taught as a mathematical analysis.

In our specialization, we are trying to give a more complex and complete picture of data science, in which machine learning is one of the most important, but, nevertheless, not the only component. There is no canonical corpus of data science topics today, but as colleagues, we and our practitioners have some idea about things that we have to face in applied problems one way or another, and we want to tell exactly about them. For example, we will have a separate course devoted to techniques for constructing experiments for data collection and methods for interpreting simulation results - this is the area of application of statistics. As for machine learning itself, in our specialization we expand the range of topics addressed by HSE and pay, for example, great attention to learning tasks without a teacher, where there are also many important productions that are actively used in the industry - clustering, searching for anomalies, extracting structure from texts. Some important topics — for example, the composition of algorithms — will be dealt with in greater detail, in accordance with their practical significance.

The starting point of all the training we see applied tasks. We will consider the most important productions that most often occur in data science, regardless of specific application areas. The tasks of building recommender systems or forecasting time series can be solved by different methods of machine learning, sometimes some of them show themselves better, sometimes others do. We want to teach students to see how such problems are reduced to mathematical statements, what analysis methods it makes sense to try, and how to choose the best one in the end.

Evgeny Sokolov - head of the analysis of unstructured data in Yandex Data Factory. In 2013, he graduated from the Moscow State University Moscow University where he is currently writing a dissertation on matrix expansions. Leads the faculty of practical training in machine learning and lectures at the PCF HSE. Lecturer School of data analysis Yandex.

When the HSE machine learning course was launched, it became clear to us that many people need a smooth immersion in the subject. The course turned out to be difficult for many, because such a format made it very concentrated. There are those who have complained about too many complex mathematics or the need to know Python well. Specializations consist of several courses and allow you to make learning smooth. The first course helps people get involved, teaches Python and the necessary mathematics (so that no one is afraid of the words "derivative" and "vector"). The part where we talk about basic machine learning consists of two courses. In addition, the format of the specialization allowed us to cover other useful areas of data analysis that are needed in practice. There is also one big project and additional courses.

Emily Dral is Yandex Data Factory Leading Analyst. She graduated from the Faculty of Physics, Mathematics and Natural Sciences of the RUDN University, Department of Information Technology. She developed educational materials and led courses such as "Technologies for developing software systems", "Object-oriented approach to the development of software systems", "Methods of intelligent search." In MIPT he conducts seminars of the course “Machine Learning” at the FIVT, the department “Algorithms and Programming Technologies”.

Specialization and course are different tasks that they solve. I really like the HSE course - it’s quite fundamental. It has formalized mathematical formulations of problems, describes the structure of algorithms, the mathematics that stands behind it. This course, in my opinion, is suitable for a fairly trained listener who is not just going to use some kind of machine learning algorithms, but also wants to understand how they work. To do this, you must own the appropriate mathematical apparatus.

Specialization gives us the opportunity to consider even simple questions that will help those who have no theoretical knowledge and practical experience, and those who have forgotten something before moving on to complex issues. We recall interesting facts from linear algebra, mathematical analysis and statistics, and, for example, talk about hypothesis testing. Many can forget these things, because they studied them for a long time, but they never worked with them in life. We have a lower rate, but at the same time the threshold of entry is lower.

In addition, the presentation in the specialization is also built a little differently. We try to make sure that all the things we use are intuitive.

Course 1. Mathematics and Python for data analysis

In this course, you will become familiar with the fundamental mathematical concepts necessary for data analysis, and you will get an initial programming skill in Python. The course consists of two large parts. The first part of the course is practical, it is dedicated to the Python programming language. You will get acquainted with the syntax and ideology of the language, learn how to write simple programs. You will also learn about libraries that are often used in practice for data analysis, for example, NumPy, SciPy, Matplotlib and Pandas. The second part of the course is devoted to such areas of mathematics as linear algebra, mathematical analysis, optimization methods and probability theory. At the same time, the emphasis is on the explanation of mathematical concepts and their application in practice, and not on the derivation of complex formulas and the proof of theorems.

Course 2. Training on marked up data.

Our focus will be on successfully applied classification and regression algorithms: linear models, neural networks, decision trees, and so on. We will place particular emphasis on such a powerful technique as the construction of compositions, which can significantly improve the quality of individual algorithms and is widely used in solving applied problems. In particular, we learn about random forests and about the gradient boosting method.

The construction of predictive algorithms is only part of the work in solving the problem of data analysis. We will deal with other stages: evaluation of the generalizing ability of algorithms, selection of model parameters, selection and calculation of quality metrics.

Course 3. Search for structure in data

From this course, you will learn about data clustering algorithms, with the help of which, for example, you can search for groups of similar mobile operator clients. You will learn how to build matrix expansions and solve the problem of thematic modeling, lower the data dimension, look for anomalies, and visualize multidimensional data.

Course 4. Building data conclusions

Does knowledge of data analysis methods affect wages? Does the bank credit rating system work? Is the new banner really better than the old one? To answer such questions, you need to collect data. The data almost always contain noise, so the statements that can be made on their basis are not always true, but only with a certain probability. Statistical methods help to build the most correct conclusions and numerically assess the degree of confidence in them.

How can one estimate unknown parameters of a system by a small number of observations? How to measure the accuracy of such estimates? What data is needed to answer your question, and what questions can be answered using existing data? You will learn everything you need to successfully transform data into conclusions - the organization of experiments, A / B testing, universal methods for estimating parameters and testing hypotheses, correlations, and causal relationships.

Course 5. Applied data analysis tasks

In this course, we will analyze application tasks from various areas of data analysis: text analysis and information retrieval, collaborative filtering and recommender systems, business analytics, time series forecasting. Using their example, you will learn how to extract attributes from disparate data, what problems arise and how to solve them. You will learn how to reduce the customer’s task to a formal statement of the machine learning problem and understand how to check the quality of the model built on historical data and in an online experiment. On each task, we study the pros and cons of the machine learning algorithms.

Having listened to this course, you will get acquainted with common types of applied problems and will understand the patterns of their solution.

Data Analysis: Final Project

In contrast to the tasks based on model data, work on a real-life project will give you the opportunity to go through all the stages of data analysis - from data preparation to building the final model and assessing its quality. As a result, a project will appear in your arsenal, which you can use in practice and independently develop in the future.

The ideal goal of our specialization is to make the listener able to pass an interview to a Data Scientist position at a level that corresponds to his professional experience. You will master the science of data and learn how to solve analytical problems using its methods - from collecting data to building an optimal model and assessing its quality. More details and a record are on the Coursera specialization page .

Source: https://habr.com/ru/post/277427/

All Articles