Machine learning. Yandex course for those who want to spend the New Year holidays with benefit
New Year holidays are a good time not only for recreation, but also for self-education. You can take your mind off everyday tasks and devote a few days to learning something new that will help you all year (or maybe not one). Therefore, we decided this weekend to publish a series of posts with lectures from the courses of the first semester of the School of Data Analysis.
Today - about the most important. Modern data analysis without it is impossible to imagine. The course covers the main tasks of learning by precedent: classification, clustering, regression, reduction of dimension. Methods for solving them, both classical and new, created in the last 10–15 years, are being studied. The emphasis is on a deep understanding of the mathematical foundations, relationships, advantages and limitations of the methods under consideration. Separate theorems are presented with proofs.
')
Konstantin Vyacheslavovich Vorontsov, senior researcher at the Computing Center of the Russian Academy of Sciences, reads a course of lectures. Deputy Director for Science, ZAO Forexis. Deputy Head of the Department of "Intellectual Systems" FUPM MIPT. Associate Professor of the Department "Mathematical Methods of Forecasting" of the Moscow Institute of Physics and Technology. Expert company "Yandex". Doctor of Physical and Mathematical Sciences.
The basic concepts are: model of algorithms, training method, loss function and quality functional, principle of empirical risk minimization, generalizing ability, sliding control.
Probabilistic formulation of the classification problem. Basic concepts: a priori probability, a posteriori probability, the likelihood function of a class.
Medium risk functional. Errors I and II kind.
Optimal Bayes classifier.
Distribution density estimation: three main approaches.
Naive Bayes classifier.
Non-parametric estimation of the distribution density by Parzen-Rosenblatt. The choice of the kernel function. Selection of window width, variable window width. Parzen window method.
Non-parametric naive Bayes classifier.
Robust density estimation. Sampling of the sample (dropout of emission objects).
Multidimensional normal distribution: geometric interpretation, sample estimates of parameters: expectation vector and covariance matrix.
Quadratic discriminant. Type of dividing surface. The wildcard algorithm, its disadvantages and ways to eliminate them.
Fisher linear discriminant.
Problems of multicollinearity and retraining. Regularization of the covariance matrix.
Dimension reduction method.
Distribution mixture model.
EM-algorithm: the basic idea, the concept of hidden variables, E-step, M-step. Constructive derivation of M-step formulas (without justification of convergence).
The quadratic loss function, the least squares method, coupling with Fisher’s linear discriminant.
The stochastic gradient method and special cases: ADALINE adaptive linear element, Rosenblatt perceptron, Habb rule.
Disadvantages of the stochastic gradient method and ways to eliminate them. Acceleration of convergence, “knocking out” of local minima. The problem of retraining, weight reduction (weight decay).
Hypothesis of exponentiality of the likelihood functions of classes.
The Bayesian optimal classifier linearity theorem.
Evaluation of the posterior probabilities of classes using sigmoid activation function.
Logistic regression. Maximum likelihood principle and logarithmic loss function.
The stochastic gradient method, analogy with the Habb rule.
Optimal separating hyperplane. The concept of the gap between the classes (margin). Cases of linear separability and the absence of linear separability.
Relationship to the minimization of regularized empirical risk. Piecewise linear loss function.
The problem of quadratic programming and the dual problem. The concept of support vectors.
Recommendations for choosing a constant C.
Kernel function (kernel functions), straightening space, Mercer's theorem.
Ways of constructive construction of nuclei. Examples of kernels.
Comparison of SVM with Gaussian core and RBF network.
The structure of a multilayer neural network. Activation functions.
The problem of completeness. The task is exclusive or. Completeness of two-layer networks in the space of Boolean functions.
Error propagation algorithm. Formation of the initial approximation. The problem of network paralysis.
Methods to optimize the network structure. Select the number of layers and the number of neurons in the hidden layer. The gradual complication of the network. Optimal thinning of the network (optimal brain damage).
Tasks and criteria for choosing a teaching method: problems of choosing a model or a teaching method, empirical estimates of sliding control, analytical estimates and regularization criteria.
The theory of generalizing ability: probability of retraining and VC-theory, Occam's razor, combinatorial theory of retraining.
Methods for selecting attributes: brute force and greedy algorithms, df and wide, stochastic search.