📜 ⬆️ ⬇️

Evolutionary strategies as a scalable alternative to reinforcement learning

The presentation of the article from the fact that the long-known evolutionary optimization strategies can surpass learning algorithms with reinforcement.
Advantages of evolutionary strategies:


This study continues the current trend when high results are achieved using ideas from a decade ago. For example, in 2012, an article about AlexNet was published , showing how to construct, scale, and train a convolutional neural network (SNN) to obtain extraordinary accuracy in the pattern recognition problem. Similarly, in 2013, it was shown that the combination of the long-known Q-learning algorithm with a convolutional neural network allows us to successfully train a machine to play games for Atari. Alexnet revived interest in neural networks at a time when most scientists believed that CNS are unpromising in computer vision tasks. So this work demonstrates that the evolutionary strategy (ES) allows to achieve high results on training benchmarks with reinforcement, destroying the general opinion that ES is not applicable in high-dimensional tasks.

ES is easy to implement and scale. Our implementation is able to teach 3D humanoid MuJoCo to walk on a cluster of 80 machines (1,440 CPU cores) in just 10 minutes (A3C on 32 cores takes about 10 hours). On 720 cores, we also get performance for Atari, comparable to A3C, while reducing training time from 1 day to 1 hour.
')

Reinforcement training


Short for OP. Imagine that we are given a certain world (for example, a game) in which we want to train an agent. To describe the behavior of an agent, we set the policy function (agent brain). This function calculates how the agent should act in each specific situation. In practice, politics is usually a neural network, at the input of which is the current state of the game, and at the output is the probability to take each of the permitted actions. A typical policy function may include about 1 million parameters, so our task is to find values ​​of these parameters for which the policy works well (that is, the agent wins in most games).


In the game of tennis, the policy analyzes the pixels on the screen and calculates the probability of the racket moving down, up (or motionless).

The learning process is as follows. Starting with a random initialization of parameters, the agent interacts with the game world for some time, and accumulates interaction episodes (for example, each episode is one tennis game). So we get a complete record of everything that happened: what states we met, what actions we took in each state, what reward we received at each step. The recipe for policy improvement: all that the agent did and what led to the improvement of the reward is good. To reduce the reward - bad. We can use backward propagation to calculate small changes in network parameters, which will lead to an increase in positive reward and a decrease in negative reward in the future. The process repeats iteratively: we accumulate a new batch of episodes, update the parameters, and repeat.

Search in policy space through noise injection


The OP usually uses stochastic policies, since we calculate only the probabilities of performing each of the actions. Thus, during the training, the agent may be in the same state many times, but each time he will take different actions in accordance with probabilistic sampling. This gives a signal for learning: some of these actions will lead to improvement, while others - to a decrease in remuneration. Thus, we embed noise in the agent's actions for searching in the policy space, by sampling the distribution of actions at each moment in time. This is different from the ES, which will be described below.

Evolutionary strategies


Evolution


Before diving into the ES, we note that the word "evolution" here has nothing to do with biology. The mathematical foundations of the process are so abstracted from their biological analogies that it is better to consider ES simply as a class of stochastic optimization methods of the black box type.

Black Box Optimization


We'll forget about the agent, the game world, neural networks, time ... The ES considers a black box with 1 million input numbers (parameters of the policy function), 1 number output (total compensation). We just want to find the best value of these 1 million numbers.

Algorithm of evolutionary strategy


Intuitively, optimization is a “guess and check” process. We start with a random set of parameters, and then sequentially:

1) randomly change slightly the parameters
2) move slightly in the direction of improving the parameters.

Specifically, at each step we take the vector of parameters w, and generate a population, for example, from 100 slightly different vectors w1 ... w100, adding Gaussian noise. Then we evaluate each of the 100 candidates independently by launching an agent in our world for some time, and we add up the reward values ​​in each case. Then the updated parameter vector is equal to the weighted sum of all 100 vectors, where the weighting factors are proportional to the amount of remuneration (i.e., more successful candidates get higher weight). Mathematically, this is equivalent to estimating the gradient of expected remuneration in the parameter space, only we make an estimate for only 100 randomly chosen directions.

Code example


To clarify the algorithm, look at a short example of optimizing a quadratic function using the ES.

# simple example: minimize a quadratic around some solution point import numpy as np solution = np.array([0.5, 0.1, -0.3]) def f(w): return -np.sum((w - solution)**2) npop = 50 # population size sigma = 0.1 # noise standard deviation alpha = 0.001 # learning rate w = np.random.randn(3) # initial guess for i in range(300): N = np.random.randn(npop, 3) R = np.zeros(npop) for j in range(npop): w_try = w + sigma*N[j] R[j] = f(w_try) A = (R - np.mean(R)) / np.std(R) w = w + alpha/(npop*sigma) * np.dot(NT, A) 

The introduction of noise in the parameters


Note that the objective function is the same as in OP: optimization of expected remuneration. However, in an OP, noise is implanted into the action space and, using back propagation, the change in parameters is calculated. In ES, noise is introduced directly into the parameter space. Therefore, you can use deterministic policies (which we did in our experiments). You can also combine both approaches.

Comparing learning with reinforcement and evolutionary strategy


Some advantages of ES over OP:


ES is also not free from shortcomings. In practice, random noise in the parameter space should lead to different results in order for the training signal to form. For example, in Montezuma's Revenge, it is almost impossible to get a key on the first level using a random network, while it becomes possible with random actions.

A detailed description of the experiments and comparison with the best OP algorithms are given in the article Evolution Strategies as a Scalable Alternative to Reinforcement Learning .

The source code is published on github.

Source: https://habr.com/ru/post/330342/


All Articles