Hi, Habr!
We do not often decide to place here translations of texts two years ago, without a code and obviously an academic focus - but today we will make an exception. We hope that the dilemma in the title of the article worries many of our readers, and you have already read the original work on evolutionary strategies with which this post is debated in the original or read it now. Welcome under the cut!

In March 2017, OpenAI created a stir in the deep learning community by publishing the article “
Evolution Strategies as a Scalable Alternative to Reinforcement Learning .” In this paper, impressive results were presented in favor of reinforcing learning (RL) did not converge, and when learning complex neural networks, it is advisable to try other methods. Then a discussion flared up about the importance of learning with reinforcement and how it deserves the status of “mandatory” technology in teaching problem solving. Here I want to talk about, you should not consider these two technologies as competing, one of which is definitely better than the other; on the contrary, they ultimately complement each other. Indeed, if we think a little about what is required to create a
common AI and such systems that throughout the whole existence would be capable of learning, judgment and planning, then almost certainly we will come to the conclusion that this will require some kind of a combined solution. . By the way, it was precisely to the combined solution that nature came, which in the course of evolution endowed mammals and other higher animals with complex intelligence.
')
Evolutionary strategies
The main thesis of the OpenAI article was that instead of using reinforcement training in combination with traditional back propagation, they successfully trained the neural network to solve complex problems using the so-called “evolutionary strategy” (ES). Such an ES approach is to maintain the distribution of weight values ​​across the network, with many agents operating in parallel and using the parameters selected from this distribution. Each agent acts in its own environment and upon completion of a specified number of episodes or episode stages, the algorithm returns a cumulative reward, expressed as a fitness score. Given this value, the distribution of parameters can be shifted towards more successful agents, depriving the less successful ones. Millions of times by repeating such an operation involving hundreds of agents, you can move the distribution of weights into such a space that allows you to formulate a qualitative policy for the agents to solve their task. Indeed, the results presented in the article are impressive: it is shown that if you run a thousand agents in parallel, the anthropomorphic movement on two legs can be studied in less than half an hour (whereas even the most advanced RL methods require spending more than one hour on this). For more information, I recommend reading a great
post from the authors of the experiment, as well as the
scientific article itself.
Various strategies for teaching anthropomorphic upright walking, studied by the ES method from OpenAI.Black box
The great advantage of this method is that it is easily parallelized. While RL methods, for example, A3C, require the exchange of information between workflows and a parameter server, the ES only needs validity estimates and summarized information on the distribution of parameters. Thanks to this simplicity, this method is far from scaling up the modern RL methods. However, all this comes not for nothing: you have to optimize the network on the principle of a black box. In this case, the “black box” means that when training, the internal structure of the network is completely ignored, and only the overall result (reward per episode) is used, and it depends on it whether the weights of a particular network will be inherited by subsequent generations. In situations where we do not receive pronounced feedback from the environment - and when solving many traditional problems associated with RL, the flow of rewards is very sparse - the problem turns from "a partly black box" into a "completely black box". In this case, you can seriously improve performance, so that, of course, such a compromise is justified. “Who needs gradients if they are still hopelessly noisy?” - this is the general opinion.
However, in situations where the feedback is more active, the affairs of the ES begin to fail. The OpenAI team describes how the simple MNIST classification network was trained using the ES, and this time the training went 1000 times slower. The fact is that the gradient signal when classifying images is extremely informative on how to teach the network a higher quality classification. Thus, the problem is connected not so much with the RL methodology, as with sparse rewards in environments that give noisy gradients.
The solution found by nature
If you try to learn from the example of nature, thinking through ways to develop AI, in some cases, AI can be represented as a
problem-oriented approach . In the end, nature operates within the framework of such restrictions that information scientists simply do not have. There is an opinion that a purely theoretical approach to solving a particular problem can provide more effective solutions than empirical alternatives. Nevertheless, I still think that it would be advisable to check how the dynamic system operating under certain restrictions (Earth) formed agents (animals, in particular mammals) capable of flexible and complex behavior. While some of these restrictions are not applicable in the modeled data science worlds, others are just fine.
Having examined the intellectual behavior of mammals, we see that it is formed as a result of a complex interaction of two closely interrelated processes:
learning from other people's experience and
learning from one’s own experience . The first is often identified with evolution due to natural selection, but here I use a broader term to take into account epigenetics, microbiomes and other mechanisms that ensure the exchange of experience between organisms that are not related to each other from a genetic point of view. The second process, learning by doing, is all the information that an animal has learned during its life, and this information is directly determined by the interaction of this animal with the outside world. This category covers everything from learning to recognize objects to mastering the communication inherent in the learning process.
Roughly speaking, these two processes occurring in nature can be compared with two options for optimizing neural networks. Evolutionary strategies, where information on gradients is used to update information about the body, converge with learning from other people's experiences. Similarly, gradient methods, where obtaining one or another experience leads to one or another change in the behavior of an agent, are comparable to learning from one's own experience. If you think about the varieties of intellectual behavior or about the abilities that each of these two approaches develops in animals, this comparison is more pronounced. In both cases, "evolutionary methods" contribute to the study of reactive behaviors that allow you to develop a certain fitness (enough to stay alive). Learning to walk or flee from captivity is in many cases equivalent to more “instinctive” behaviors that are “hard-wired” in many animals at the genetic level. In addition, this example confirms that evolutionary methods are applicable in cases where the signal-reward is extremely rare (such, for example, is the fact of successful rearing of the young). In such a case, it is impossible to correlate remuneration with any particular set of actions that may have been performed many years before the occurrence of this fact. On the other hand, if we consider the case in which the ES fails, namely the classification of images, the results will be remarkably comparable to the results of animal training achieved in the course of countless behavioral psychological experiments conducted over 100 years.
Animal learning
The methods used in training with reinforcement, in many cases, are taken directly from the psychological literature on
operant conditioning , and operant conditioning has been studied on the material of animal psychology. By the way, Richard Sutton, one of the two founders of reinforcement training, has a bachelor's degree in psychology. In the context of operant conditioning, animals learn to associate reward or punishment with specific behavioral patterns. Animal trainers and researchers can in one way or another manipulate such an association with reward, provoking animals to demonstrate ingenuity or certain behaviors. However, operant conditioning used in the study of animals is nothing more than a more pronounced form of the conditioning itself, on the basis of which animals learn throughout their lives. We constantly receive positive reinforcement signals from the environment and adjust our behavior accordingly. In fact, many neurophysiologists and cognitives believe that in fact people and other animals act even to a higher level and constantly learn to predict the results of their behavior in future situations, counting on potential rewards.
The central role of forecasting in learning by doing is changing the above dynamics in the most significant way. The signal that was previously considered very sparse (episodic reward) turns out to be very dense. Theoretically, the situation is approximately as follows: at each moment in time, the mammalian brain calculates the results based on a complex stream of sensory stimuli and actions, while the animal is simply immersed in this stream. In this case, the final behavior of the animal gives a dense signal, which has to be guided by the adjustment of forecasts and the development of behavior. The brain uses all these signals in order to optimize the forecasts (and, accordingly, the quality of the actions taken) in the future. An overview of this approach is given in the excellent book “
Surfing Uncertainty ” by the cognitive and philosopher Andy Clarke. If we extrapolate such reasoning to the training of artificial agents, then a fundamental flaw is found in reinforcement training: the signal used in this paradigm is hopelessly weak compared to what it could be (or should be). In cases where it is impossible to increase the signal saturation (perhaps because it is weak by definition or is associated with low-level reactivity), it is probably better to prefer a training method that is well parallelized, for example, ES.
More intensive neural network training
Based on the principles of higher nervous activity inherent in the mammalian brain, constantly engaged in forecasting, some progress has been made recently in reinforcement learning, which now takes into account the importance of such predictions. On the move, I can recommend you two similar works:
In both of these articles, the authors supplement the typical default policy of their neural networks with the results of future environmental forecasts. In the first article, prediction is applied to a set of measurement variables, and in the second, changes in the environment and the behavior of the agent itself. In both cases, the sparse signal associated with positive reinforcement becomes much richer and more informative, providing both accelerated learning and the assimilation of more complex behavioral models. Such improvements are only available when working with methods that use a gradient signal, but not with methods that operate according to the black box principle, such as, for example, an ES.
In addition, learning by doing and gradient methods are much more effective. Even in those cases when it was possible to study this or that problem using the ES method faster than with the help of reinforcement training, the gain was achieved due to the fact that the ES strategy involved many times more data than in RL. Reflecting in this case on the principles of learning from animals, we note that the result of learning on another's example manifests itself after many generations, whereas sometimes a single event experienced by itself is enough for an animal to learn a lesson forever. While such
training without examples does not quite fit into traditional gradient methods yet, it is much more user-friendly than ES. There are, for example, such approaches as
episodic neural control , where Q-values ​​are preserved in the learning process, after which the program checks with them before performing actions. It turns out a gradient method that allows you to learn how to solve problems much faster than before. In the article on episodic neural control, the authors mention the human hippocampus, which is able to store information about an event even after having experienced it once and, therefore, plays a
crucial role in the recall process. Such mechanisms require access to the internal organization of the agent, which is also by definition impossible in the ES paradigm.
So why not combine them?
Probably, most of this article could have left such an impression, as if in it I was upholding RL methods. However, in fact, I believe that in the long run, the best solution would be a combination of both methods, so that each is used in the situations in which it is best suited. Obviously, in the case of many reactive policies or in situations with very sparse signals, positive ES reinforcement benefits, all the more so if you have computing power at your disposal on which you can run mass-parallel learning. On the other hand, gradient methods using reinforcement training or teacher training will be useful when extensive feedback is available to us, and solving the problem requires learning quickly and with less data.
Turning to nature, we find that the first method, in essence, lays the foundation for the second. That is why in the course of evolution, mammals have developed a brain, which allows it to study extremely effectively on the material of complex signals from the environment. So, the question remains open. Perhaps evolutionary strategies will help us invent effective learning architectures that will also be useful for gradient learning methods. After all, the solution found by nature is indeed very successful.