⬆️ ⬇️

About machine learning, history and life with Dmitry Vetrov



As part of an open course on machine learning, we continue to communicate with prominent representatives of this field. Our first interlocutors were Alexander Dyakonov, Konstantin Vorontsov and Evgeny Sokolov, see the video on the YouTube channel of the course. This time we talked with Dmitry Vetrov.



Good day! Today, our guest is Dmitry Petrovich Vetrov - a research professor at the Faculty of Computer Science and the head of the research team of Bayesian methods. Dmitry Petrovich continues to read courses at the Department of Mathematical Methods of Prediction of the CMC of Moscow State University, where he studied at the time. So it turns out, maybe by chance, that we have already met and talked with Alexander Dyakonov, Konstantin Vorontsov, Yevgeny Sokolov, and now you are already the fourth representative of the Department of Mathematical Methods of Forecasting VMK MSU. Tell us how you were all together, did you work together on any project, how your life scattered you, and do you now communicate with colleagues?



Of course, we actively communicate. We all came from the same scientific school of academician Zhuravlev, who founded in 1997 the department of mathematical methods of forecasting at the VMC. It is actively taught and partially taught now Konstantin Vyacheslavovich, Alexander Gennadyevich, Yevgeny Andreevich and your humble servant. Regarding the work on joint research, I think it was my main mistake. I spent 5 years as the scientific secretary of the department, carried out the operational management of the department, and during that time, we have not worked on any research project together. It seemed a lot of time, still have time to work, but it turned out that life began to scatter us. Konstantin Vorontsov concentrated on working at the Physical and Technical Institute, Alexander Dyakonov went into the industry, Yevgeny Sokolov and I focused our main efforts on the Faculty of Computer Science of the National Research University Higher School of Economics. As a result, we worked in the same department for several years, taught together, communicated, but did not conduct any joint research. Now I regret it. I learned a lot from my colleagues, but I could learn more ...



That is, your cooperation was rather in teaching, yes?



Yes. And so, despite the fact that life has scattered us, in my opinion, the Department of Mathematical Methods of Forecasting today is the strongest at the VMC and, with what I can, I try to help her. We still read courses there: Deacons, Falcons, Vorontsov, and Winds. That is, although we have lost the affiliation with the department, we continue to participate in her life, although, of course, not like before.



Well, probably, if we talk about theoretical machine learning, now the department of mathematical methods of forecasting gives the best education. If we compare it with the same PCFs, where many courses are more practical than practical ones ... Here you can recall Konstantin Vyacheslavovich’s article on MachineLearning.ru, how to teach machine learning in general. My subjective feeling that the theoretical base is the most powerful is given precisely at the IUD.



I would not argue so categorically. To date, there are many places where machine learning is very well taught: at the same VMC, at the PCF on the specialization “Machine Learning and Applications”, at the SAD and at the PhysTech. It's hard for me to say where the program is more theoretical, and where it is more applied.



But you moved from MSU to HSE, what if in general to compare these two universities? HSE, perhaps, is criticized by many for its Western attitude and, in general, for its orientation to the Western scientific system, the Web of Science and Scopus scientific citation bases ... Actually, HSE has such a double game: on the one hand, there are many state orders for research, and on the other hand - the race for publications in the best magazines, at the best Western conferences. You just get it all done, you publish in the best magazines, go to the top machine learning conferences. How would you answer this rather philosophical question: how do we catch up with the west? Should we focus on their values, on publications in their journals? Or still, if we catch up all the time, then we will never overtake them?



See, first, I would correct you. You say: "Western, Western ...". This is not Western at all, not Western for a long time - it is a world trend in the development of science. Secondly, the specificity of the development of science in our country, both for objective and subjective reasons, was that in many branches of science was isolated from global trends. And machine learning unfortunately related to them. It seems to me that any form of scientific isolation is harmful for the community that isolates itself from the rest of the world. Therefore, I fully advocate for maximum integration. But not with the Western community, I repeat, but with the world. There are many world-class researchers from China, India, Japan ... If we want to continue advances in world science, to be at the forefront, then, of course, you need to follow international conferences, journals, and, of course, publish there. In my opinion, integration into the global scientific community will provide an opportunity for these advanced positions to emerge and, perhaps, even to become leaders in certain areas. Now it has become obvious to everyone that the Russian scientific community in the field of machine learning is 10-20 years behind the world trends. This is very sad. In fact, this means that this area of ​​science must be rebuilt from scratch. And the main reason for this lag was self-isolation from the global scientific community. We have to catch up with him - there is no choice anyway. And yes, nothing better than to be guided by the world scientific standards of research (with a clear following of the scientific method, competent experiment design, anonymous reviewing, continuous reading of scientific articles to be “in the subject,” etc.), humanity has not yet invented. Any attempts to oppose these standards lead to lagging and gradual degradation. At the same time, we have our competitive advantages: a high level of mathematical knowledge among applicants and students, a number of industry initiatives aimed at teaching modern machine learning methods. There are new projects such as the organization of school and student data analysis competitions. These are very good events that give reason for cautious optimism. It is a pity that all these undertakings are not due to, but, often, contrary to the Russian Academy of Sciences, which, it would seem, was supposed to lead this trend. Therefore, I believe that the science in the field of artificial intelligence in Russia must be rebuilt from scratch. There are places where intelligent specialists and solvers of applied problems will be made of you, but there are practically no places where you will be made the developer of new machine learning technologies. But it seems to me that it’s boring to reimplement technologies developed in Google, as many companies do, and I have a feeling that we can do more.

As for the fact that I publish at leading conferences ... I think that I publish a little, I am not satisfied with the current publishing activity. I want to do this much more intensively and we are actively working on it.



And yet, now it often happens that even the salary of a scientist, his scientific reputation depends on citation, in particular in the Web of Science and Scopus. It seems that this system has the same drawbacks as the exams, the same exam. Despite the shortcomings, you still need to focus on publications and indexing in scientific citation databases?



Please explain…



It seems to me that the scientific community will soon learn to somehow better assess the contribution of scientists. Suppose somehow based on the PageRank algorithm. After all, now even the citation context and emotional color are not taken into account. Suppose I quote you now, but I will say that I do not agree with what I have written and in general all this is some kind of nonsense. With the current system, it will still count as +1 to the number of citations of your articles. What are your options for improving the system for assessing the contribution of a researcher?



Even if you quote me with negative emotions, the very fact of this quotation will mean that my research somehow influenced your work. Citation determines a very simple thing: what a person has done is necessary for someone, someone uses it, even with negative emotions. This is better than no quotes at all. This is the first. The number of citations does not depend on the salary of HSE employees. It is determined by the level of publication, that is, the level of the publication in which your work is published. This is the second. With quoting, you can do anything, for example, you can engage in self-citation. But, to raise the level of the publication in which you have published an article is impossible in principle. Not for any money, nor through communications ... You cannot “ask” to publish you at the leading conference - there you have to break through the strict system of reviewing and selection. By the way, the fact that the scientist's salary is determined on the basis of the level of the publication in which it is published is not only characteristic of the HSE, but it is the same thing at Moscow State University and the Physical and Technical Institute. Next question: how to determine which publications are considered good and which are bad. A critical question. Any errors in the definition of this lead to the fact that researchers are beginning to focus on the wrong goal. For example, instead of professionally growing and publishing at more and more prestigious conferences, they are starting to chase a growing number of publications in junk magazines. And, for example, to the criteria introduced in Moscow State University, I have questions. They hardly encourage the professional growth of a scientist, but they encourage his imitation. I see that the system can be circumvented, for example, by making a poor-quality publication, in order to get a big premium. And this happens all the time. The HSE system is much more difficult to circumvent because it is compiled with regard to the ratings of publications, although I admit that it is also possible.



If we talk about international conferences at the level of ICML and IJCAI, then one of your work with colleagues about Bayesian thinning of deep networks (“Variational Dropout Spars of Deep Neural Networks”, arxiv ), published on ICML, received many responses from the scientific community. Can you tell about it - is it such a small gradient step in the development of science or is it a revolutionary thing? How theoretically and practically will this help the development of deep learning? And in general, you can talk about Bayesian methods in deep learning. Or in the depths :)



Let's not talk about the revolutionary contribution. It seems that revolutionary articles can be counted on the fingers. We have taken a step in the right direction, the direction is, in my opinion, technically important, with significant prospects. This is what we in our group are trying to do - to cross the Bayesian approach to machine learning with deep neural networks. And the work you mentioned really aroused some interest in the scientific community. We took the well-known procedure of regularization of neural networks - dropout, and based on the work of our colleagues from the University of Amsterdam, who showed that dropout can be considered as a Bayesian procedure, we proposed its generalization. It includes the usual dropout as a special case, but also allows you to adjust the dropout rate automatically using variational Bayesian output. That is, the probability with which every weight or every neuron is thrown out in our network is chosen not by eye or by means of cross-validation, but automatically. As soon as we learned how to do this automatically, it became possible to enter an individual dropout rate for each weight in the neural network and optimize the functionality for all these parameters. In the end, this procedure leads to amazing results. It turns out that over 99% of the weights can be simply removed from the network (i.e., their dropout rate becomes equal to one), while the quality of work on the test sample does not sag. That is, we will keep a high generalizing ability, a low test error, but the neural network can be compressed 100 or even 200 times.



Does this mean that the intensity of the dropout can be chosen even analytically?



Not analytically, of course, but here is the most common optimization. Here the functionality was strictly specified, which arises naturally from the Bayesian inference procedure. Our result suggests that we are moving in the right direction. It is known that modern neural networks are highly redundant, but it is not clear how to eliminate this redundancy. Attempts, of course, were, for example, take a smaller network, but the quality subsided. So now, it seems, a more correct way is to take the redundant neural network, train it, and then eliminate redundancy using the Bayesian dropout procedure.



Clear. But a more general question. How do you see the development prospects of Bayesian methods in relation to depth learning? What are the possible problems here?



Modern deep neural networks are trained, in fact, by the method of maximizing likelihood, about which it is known from statistics that this is the optimal method under certain conditions. The problem is that the situation that arises with the training of deep neural networks does not satisfy the conditions that guarantee the optimality of the maximum likelihood method. The conditions are very simple. It is necessary that the number of training examples for which you configure the parameters of the machine learning algorithm is much more than the number of these parameters. In modern deep-seated networks, this is not the case. And the maximum likelihood method can be applied, but at your own risk and without any guarantees. It turns out that in such a situation, when the number of weights is comparable or even more than the size of the training sample, Bayesian statistics replace the frequency approach with the classical estimation methods. Bayesian methods can be used for any sample size, up to zero. It can be shown that if the sample size in relation to the number of estimated parameters tends to infinity, then the Bayesian approach goes into the likelihood maximization method. That is, the classical and Bayesian approaches do not contradict each other. On the contrary, Bayesian statistics can be considered as a generalization of the classical to a wider class of problems. The application of the Bayesian approach to depth learning leads to the fact that the neural network has a number of additional advantages.



First, it is possible to work with gaps in the data, that is, when in the training sample for some examples the values ​​of some features are not indicated. The most appropriate way to work in such a situation is the Bayes probability model.



Secondly, the training of a Bayesian neural network can and should be viewed as the derivation of the distribution in space of various networks to which the ensemble technique can be applied. That is, we are able to average the forecasts of many neural networks obtained from the a posteriori distribution in the weights space. Such an ensemble in full accordance with Bayesian statistics leads to an increase in quality with respect to the use of one (even the best) neural network.



Thirdly, Bayesian neural networks are much more resistant to retraining. Retraining is now one of the most acute problems of machine learning, and in the 2016–17 publications. it is shown that modern neural network architectures are catastrophically subject to retraining. And Bayes neural networks are practically not retrained. Especially noteworthy is how our ideas about regularization change with the development of Bayesian methods. Classical regularization is simply an addition to the optimized functional of an additional term, the regularizer. For example, this may be the rate of customizable parameters. The regularizer shifts the optimum point and partially helps to cope with retraining. Now we understand that it is possible (and necessary) to carry out regularization differently: by adding noise to the process of optimizing the functional, which will not allow the methods of stochastic optimization to converge to the exact optimum value. The most successful regularization methods for today, for example, dropout or batch normalization work exactly like this. This is not an additive of the regularizer as a function of losses, but controlled injection of noise into the problem. This is a completely different look at the regularization of machine learning algorithms! But what should be the intensity of this noise, where should it be added? This question can be answered correctly by applying the procedure of stochastic variational inference in the Bayesian model of a neural network.



Fourth, the potential resistance to what is called adversarial attacks, when we artificially create examples that mislead the neural network. One network can be deceived, and 10 networks can be deceived, but it is not so easy to deceive the continuum of neural networks, which are obtained as a result of Bayesian inference in the learning process. I think the combination of neural network and Bayesian approaches is extremely promising. There is beautiful math, amazing effects and good practical results. So far, we have insufficient Bayesian tools to conduct effective Bayesian output. But the required scalable methods of approximate Bayesian inference are now actively developing in the world.



But just a clarification. Is it true that a dropout can be viewed as a transition to distribution over neural networks, and then the result of learning when using dropout will be an ensemble of neural networks?



Yes. And in the original wording of the dropout, we also come to the ensemble of neural networks, but it is not clear where this ensemble came from. If, however, to reformulate the dropout in terms of Bayesian inference, then everything falls into place. It becomes clear how to set it up and how to automatically select the dropout rate. Moreover, we immediately have a number of possibilities for generalizing and modifying the original dropout model.



Can Bayesian methods offer some kind of understanding of what is going on when teaching neural networks? In particular, the configuration of the network hyperparameters is now a kind of heuristic procedure; by trial and error, we somehow understand that in one situation BatchNorm needs to be added, in the other - to drop the dropouts slightly. That is, so far we are far from a theoretical understanding of how numerous hyperparameters affect the learning of neural networks. Can Bayesian methods offer a new look?



Let me clarify. A question about our understanding of how neural networks make decisions or how they solve an optimization problem? This is an important difference.



First, what the hyperparameters are responsible for and how it affects learning. This is our first misunderstanding. Second, is it possible any theoretical guarantees for error generalization in the case of neural networks? As far as I know, the computational theory of learning is still applicable to perceptrons and networks with one hidden layer, but it is powerless as soon as we turn to deep neural networks. In particular, the same adversarial attacks show that we still have a poor understanding of how neural networks are capable of generalization. That is literally changed one pixel, and now the neural network says that the penguin is not a penguin, but a tractor. After all, it is a disaster if you think so! Even despite the excellent results of convolutional networks on ImageNet. Can Bayesian methods offer something here?



Many questions, let's take it in order. I have already said about resistance to adversarial examples, Bayesian neural networks are more resistant to such attacks, although there are still problems. The cause of this problem is in fact clear. All adversarial examples are extremely atypical from the point of view of the general population (to which the neural network is tuned, whether it is Bayesian or not). The fact that we do not visually see differences from the original image does not mean that they do not exist. And on atypical objects, the answer of any machine learning algorithm can be arbitrary. This logically follows the way to combat adversarial examples, but this is a completely different story ...



As for the statistical theory of learning and guarantees for generalizing ability: the situation is now such that the results of the theory are not transferred to modern neural networks, everyone understands this, therefore, experts in statistical learning theory are actively working to ensure that new methods are applicable to deep neural networks. I hope we will see it in the coming years. Can the network architecture be defined using Bayesian methods? Answer: hypothetically possible, practically - now the first steps are being taken in the world. Bayesian thinning can also be considered as a choice of neural network architecture. In order to more fully answer this question, new tools are needed, in particular, other ways of regularization, for example, batch normalization, need to be translated into Bayesian language. Obvious need for this, obviously desire. Such work is underway, but so far success has not been achieved. Hope this is a matter of time.



And in fact, the main advantage of the Bayesian approach is the automatic adjustment of the hyperparameters. The more procedures for building neural networks, we transfer to Bayesian rails, the more opportunities appear for automatic selection of the neural network topology. Well, the last question is about why the neural network makes this or that decision ... This is a question to which we can hardly get a comprehensive answer in the near future. From my point of view, one of the most promising techniques, for understanding what is happening in neural networks, is what is called the attention mechanism. Part of this mechanism is also built on Bayesian principles, but these methods are still quite raw. I hope that in the near future we will be able to reach a level at which it will be clear what is happening with neural networks. Nevertheless, a number of indirect experiments, including those carried out in our group, indicate that the computer understands the meaning of the data much better than is commonly believed. In some cases, you can get a computer to express its understanding in human language. I will tell you about one of these models and the incredible effects that we have seen in it at my next public appearance. I think this may be one of the ways to understand the logic of the neural network - it should generate an explanation of why this or that decision was made.



Okay, does the Bayesian method somehow get a boost from observing the human brain? In particular, in our brain far from all neural connections are involved, and this could serve as a motivation for the dropout technique. Do you know such cases when research in the field of neurophysiology served as a source of new ideas in the field of Bayesian statistics?



Well, firstly, I will immediately dispel the popular misconception that artificial neural networks supposedly simulate the work of the human brain. Not. This is not true. They have nothing to do with the human brain. More precisely, earlier, when artificial neural networks only appeared, they were associated with the human brain. But now we understand a lot more in machine learning and in neurophysiology, and we can safely say that these are different mechanisms. An artificial neural network is a separate model that has no more in common with a biological brain than, say, a decision tree. On the other hand, there are a lot of psychological studies that show that the human brain is largely working on Bayesian principles. I am not ready to comment on this in detail, but there is such an opinion.



Well, I will transfer the conversation to a different area. At school I studied, of course, mathematics, physics, and various sciences and quickly realized that the formulas in my head were absorbed simply immediately, once and for all. If I once found out what impulse is, then I don’t need to remember what impulse is - mass, multiplied by speed or speed in a square. And we had amazing lecturers in history both in school and in high school. At Fiztekh, for example, we could even have such a situation that, at a lecture in the specialty, computer architecture, for example, there were 15 people, and in the next lecture on history, it was head and shoulders, the entire audience was full. The whole thing, of course, is in the lecturer, he is also a brilliant actor, people almost with popcorn attend his lectures - every time as a performance. But, unfortunately, my historical information was very poorly perceived. In one ear flew - in the other flew. It seems that three times both domestic and foreign history took place, straight from the Rurikovichs to the Romanovs, but all of this just flew off instantly from me. I know about you that you read lectures on history, both at Moscow State University and at the FKN. Tell us how you understood that you can study both history and applied mathematics, that these two worlds can coexist in your head. And how do you support this interest in history now?



-, , , , , . . . , , , - . , , , — , , . , , . , .



… . , , , , . , , . , , . , , . , , , , . , , . , , “” . , , , , “ ”. .



, ...



, . , … , . , , , - — . . , — . . - , .



, ? , , , , , — , , — . , , - ? , ? . — 12 , 36, 144… , XX , 12 . 1905 — , 17 — , 29 — , 41 — , 53 — , 65 — , …. … , . , , ?



, , :) , , - , . , 13- 14- . , , , , , . , . … , , , . , , , . , . , , , , . , , , . , , , , , , . , , . . , . , , . - . , . , . . . . Data Culture, .



, , - , . , . . , , , , . , , , … — . , . , ?



. , . , . . . , , . . , , Web of Science, . , . .



. — , - , .



? , , , . . , , . , , , , , . . , -, , . , - . , , . , , , . 30-40 . — .



, . , ?



— . - , , Samsung, . : , , , , , . . , , -… . . , : , , . ! , , : « , ». , , , ? : « , - . ». . , , , .



, . -. , — . . — - . , , , . , , . , , . , , .



. , , .



, .



, . , , . . ? ? , , . , - ? , .



. , , , . Bishop "Pattern Recognition and Machine Learning". , . : 10 , , , . , « » . "Machine Learning: A Probabilistic Perspective", , . . , . , . , , . .



Ps. Dmitry Petrovich is open to your questions here in the comments.



')

Source: https://habr.com/ru/post/350806/



All Articles