Machine learning. Yandex course for those who want to spend the New Year holidays with benefit

New Year holidays are a good time not only for recreation, but also for self-education. You can take your mind off everyday tasks and devote a few days to learning something new that will help you all year (or maybe not one). Therefore, we decided this weekend to publish a series of posts with lectures from the courses of the first semester of the School of Data Analysis.

Today - about the most important. Modern data analysis without it is impossible to imagine. The course covers the main tasks of learning by precedent: classification, clustering, regression, reduction of dimension. Methods for solving them, both classical and new, created in the last 10–15 years, are being studied. The emphasis is on a deep understanding of the mathematical foundations, relationships, advantages and limitations of the methods under consideration. Separate theorems are presented with proofs.

')
Konstantin Vyacheslavovich Vorontsov, senior researcher at the Computing Center of the Russian Academy of Sciences, reads a course of lectures. Deputy Director for Science, ZAO Forexis. Deputy Head of the Department of "Intellectual Systems" FUPM MIPT. Associate Professor of the Department "Mathematical Methods of Forecasting" of the Moscow Institute of Physics and Technology. Expert company "Yandex". Doctor of Physical and Mathematical Sciences.

Lecture 1. Basic concepts and examples of applied problems.

Setting training objectives for precedents. Objects and signs. Types of scales: binary, nominal, ordinal, quantitative.
Task types: classification, regression, forecasting, clustering.
The basic concepts are: model of algorithms, training method, loss function and quality functional, principle of empirical risk minimization, generalizing ability, sliding control.
Examples of applied problems.

Lecture 2. Bayesian classification algorithms, non-parametric methods

Probabilistic formulation of the classification problem. Basic concepts: a priori probability, a posteriori probability, the likelihood function of a class.
Medium risk functional. Errors I and II kind.
Optimal Bayes classifier.
Distribution density estimation: three main approaches.
Naive Bayes classifier.
Non-parametric estimation of the distribution density by Parzen-Rosenblatt. The choice of the kernel function. Selection of window width, variable window width. Parzen window method.
Non-parametric naive Bayes classifier.
Robust density estimation. Sampling of the sample (dropout of emission objects).

Lecture 3. Parametric methods, normal discriminant analysis

Multidimensional normal distribution: geometric interpretation, sample estimates of parameters: expectation vector and covariance matrix.
Quadratic discriminant. Type of dividing surface. The wildcard algorithm, its disadvantages and ways to eliminate them.
Fisher linear discriminant.
Problems of multicollinearity and retraining. Regularization of the covariance matrix.
Dimension reduction method.
Distribution mixture model.
EM-algorithm: the basic idea, the concept of hidden variables, E-step, M-step. Constructive derivation of M-step formulas (without justification of convergence).

Lecture 4. EM-algorithm and network of radial basis functions.

Stop criterion, the choice of the initial approximation, the choice of the number of components.
Stochastic EM-algorithm.
A mixture of multidimensional normal distributions. The network of radial basic functions (RBF) and the use of EM-algorithm for its configuration.
Nearest Neighbor Method (kNN) and its generalizations.
Selection of the number k according to the criterion of sliding control.

Lecture 5. Metric classification algorithms

Generalized metric classifier, indentation concept.
Method of potential functions, gradient algorithm.
Selection of reference objects. Pseudocode: the PILLAR algorithm.
Function of competitive similarity, algorithm FRiS-STOLP.
Biological neuron, McCulloch-Pitts model.
Linear classifier, indentation concept, continuous approximation of the threshold loss function.

Lecture 6. Linear classification algorithms

The quadratic loss function, the least squares method, coupling with Fisher’s linear discriminant.
The stochastic gradient method and special cases: ADALINE adaptive linear element, Rosenblatt perceptron, Habb rule.
Disadvantages of the stochastic gradient method and ways to eliminate them. Acceleration of convergence, “knocking out” of local minima. The problem of retraining, weight reduction (weight decay).
Hypothesis of exponentiality of the likelihood functions of classes.
The Bayesian optimal classifier linearity theorem.
Evaluation of the posterior probabilities of classes using sigmoid activation function.
Logistic regression. Maximum likelihood principle and logarithmic loss function.
The stochastic gradient method, analogy with the Habb rule.

Lecture 7. Support Vector Machine (SVM)

Optimal separating hyperplane. The concept of the gap between the classes (margin). Cases of linear separability and the absence of linear separability.
Relationship to the minimization of regularized empirical risk. Piecewise linear loss function.
The problem of quadratic programming and the dual problem. The concept of support vectors.
Recommendations for choosing a constant C.
Kernel function (kernel functions), straightening space, Mercer's theorem.
Ways of constructive construction of nuclei. Examples of kernels.
Comparison of SVM with Gaussian core and RBF network.

Lecture 8. Linear classification methods: generalizations and review

Theoretical substantiations of various continuous loss functions and various regularizers.
Bayesian approach. The principle of maximum joint likelihood of data and model.
Some varieties of regularizers used in practice. Quadratic (L2) regularizer. L1- and L0- regularizers and their connection with feature selection.
Relevant vectors method.
Complexity approach. Rademacher complexity and some of its properties. The upper estimate of the probability of error for linear classifiers.

Lecture 9. Regression Recovery Techniques

The task of restoring regression, the method of least squares.
One-dimensional non-parametric regression (smoothing): Nadaraya-Watson's estimate, kernel selection and smoothing window width.
Multidimensional linear regression. Singular decomposition.
Regularization: ridge regression and Tibshirani lasso.
Principal Component Method and the Karunen-Loeve Decorrelative Transformation.
Robust regression: a simple LOWESS screening algorithm.

Lecture 10. Time Series Prediction

Additive and multiplicative time series models. Trend, seasonality, calendar effects.
Adaptive models: exponential smoothing, Holt-Winters and Theil-Wage models.
Sliding control signal and Trigg-Lich model.
Adaptive selection and composition of prediction models.
Examples of applied tasks: traffic forecasting, number of visits, sales volumes.

Lecture 11. Neural networks

The structure of a multilayer neural network. Activation functions.
The problem of completeness. The task is exclusive or. Completeness of two-layer networks in the space of Boolean functions.
Error propagation algorithm. Formation of the initial approximation. The problem of network paralysis.
Methods to optimize the network structure. Select the number of layers and the number of neurons in the hidden layer. The gradual complication of the network. Optimal thinning of the network (optimal brain damage).

Lecture 12. Clustering Algorithms

Setting the clustering problem. Types of cluster structures.
Graph clustering methods: algorithm for selecting connected components, FOREL algorithm, clustering quality functionals.
Hierarchical clustering (taxonomy): agglomerative hierarchical clustering, dendrogram and monotony property, properties of compression, extension and reductiveness.
Statistical clustering methods: EM-algorithm, k-means method.

Lecture 13. Methods of partial learning

Simple heuristic methods: SSL task features, self-training method, composition of classification algorithms.
Modification of clustering methods: optimization approach, clustering with constraints.
Modification of classification methods: transductive SVM, logistic regression, Expectation Regularization.

Lectures 14-15. Classifier compositions. Boosting ( part 1 , part 2 )

Classifier composition: compositional learning tasks, classic AdaBoost algorithm, gradient boosting.
Bagging and committee methods: bagging and random subspace method, simple and weighted voting, seniority voting.
Mixtures of algorithms: the idea of areas of competence, an iterative method of teaching a mixture, consistently increasing the mixture

Lecture 16. Estimates of generalizing ability

Tasks and criteria for choosing a teaching method: problems of choosing a model or a teaching method, empirical estimates of sliding control, analytical estimates and regularization criteria.
The theory of generalizing ability: probability of retraining and VC-theory, Occam's razor, combinatorial theory of retraining.
Methods for selecting attributes: brute force and greedy algorithms, df and wide, stochastic search.

Lecture 17. Methods of selection of signs. Feature selection

The complexity of the task of selecting features. Full bust.
Add and remove method, step regression.
Search in depth, the method of branches and borders.
Truncated search in width, multi-row iterative algorithm MGUA.
Genetic algorithm, its similarity with MSUA.
Random search and Random search with adaptation (SPA).

Lecture 18. Logic classification algorithms

The concept of regularity and informativeness: definitions and notation, interpretability, informativeness.
Methods for finding informative laws: a greedy algorithm, an algorithm based on feature selection, data binarization.
Compositions of laws: decisive list, deciding trees, voting of regularities, decisive forests.

Lecture 19. Logic classification algorithms. Decisive trees

The decisive list. Greedy list synthesis algorithm.
Decisive tree. Pseudocode: ID3 greedy algorithm. The disadvantages of the algorithm and how to eliminate them. The problem of retraining.
Reduction of the decisive trees: preduction and reduction.
Convert the decision tree to a decision list.
LISTBB algorithm.
Alternating decision trees.
Inattentive decision trees (oblivious decision tree).
Decisive forest and boosting over decisive trees. TreeNet algorithm.

Lecture 20. Logic classification algorithms. Weighted vote

Methods for the synthesis of conjunctive patterns. Pseudocode: KORA algorithm, TEMP algorithm.
Heuristics that ensure the difference and usefulness of patterns. Construction of Pareto-optimal laws. Align the indentation distribution.
Application of AdaBoost's boosting algorithm to regularities. Criterion of informativeness in boosting.
Examples of applied tasks: credit scoring, customer care forecasting.

Lecture 21. Search for associative rules

Methods for the synthesis of conjunctive patterns. Pseudocode: KORA algorithm, TEMP algorithm.
Heuristics that ensure the difference and usefulness of patterns. Construction of Pareto-optimal laws. Align the indentation distribution.
Application of AdaBoost's boosting algorithm to regularities. Criterion of informativeness in boosting.
Examples of applied tasks: credit scoring, customer care forecasting.

Lecture 22. Collaborative iterations

Task setting and applications.
Correlation models based on data storage, the task of recovering missing values of the proximity function.
Latent models: biclusterization and matrix expansions, probabilistic latent models, experiments from Yandex data.

Lectures 23-24. Thematic modeling ( part 1 , part 2 )

The task of thematic modeling: probabilistic thematic model, unigram model.
Thematic models PLSA and LDA: probabilistic latent semantic model, Dirichlet latent distribution, empirical quality assessment of thematic models.
Robust probabilistic thematic model: model with background and noise components, EM-algorithm for a robust model, sparseness of a robust model.

Lecture 25. Training with reinforcements

Multi-armed bandit: simple task setting, greedy and half-greedy strategies, adaptive strategies.
Dynamic programming: complete problem statement, Bellman equation.
The method of time differences.

Update: all the lectures of the Machine Learning course in the form of an open folder on Yandex.Disk .

Source: https://habr.com/ru/post/208034/

All Articles

Machine learning. Yandex course for those who want to spend the New Year holidays with benefit

Lectures 14-15. Classifier compositions. Boosting ( part 1 , part 2 )

Lectures 23-24. Thematic modeling ( part 1 , part 2 )

More articles: