Evolutionary strategies as a scalable alternative to reinforcement learning

The presentation of the article from the fact that the long-known evolutionary optimization strategies can surpass learning algorithms with reinforcement.
Advantages of evolutionary strategies:

Ease of implementation
No back distribution required
Scales easily in a distributed computing environment.
A small number of hyperparameters.

This study continues the current trend when high results are achieved using ideas from a decade ago. For example, in 2012, an article about AlexNet was published , showing how to construct, scale, and train a convolutional neural network (SNN) to obtain extraordinary accuracy in the pattern recognition problem. Similarly, in 2013, it was shown that the combination of the long-known Q-learning algorithm with a convolutional neural network allows us to successfully train a machine to play games for Atari. Alexnet revived interest in neural networks at a time when most scientists believed that CNS are unpromising in computer vision tasks. So this work demonstrates that the evolutionary strategy (ES) allows to achieve high results on training benchmarks with reinforcement, destroying the general opinion that ES is not applicable in high-dimensional tasks.

ES is easy to implement and scale. Our implementation is able to teach 3D humanoid MuJoCo to walk on a cluster of 80 machines (1,440 CPU cores) in just 10 minutes (A3C on 32 cores takes about 10 hours). On 720 cores, we also get performance for Atari, comparable to A3C, while reducing training time from 1 day to 1 hour.
')

Reinforcement training

Short for OP. Imagine that we are given a certain world (for example, a game) in which we want to train an agent. To describe the behavior of an agent, we set the policy function (agent brain). This function calculates how the agent should act in each specific situation. In practice, politics is usually a neural network, at the input of which is the current state of the game, and at the output is the probability to take each of the permitted actions. A typical policy function may include about 1 million parameters, so our task is to find values of these parameters for which the policy works well (that is, the agent wins in most games).

In the game of tennis, the policy analyzes the pixels on the screen and calculates the probability of the racket moving down, up (or motionless).

The learning process is as follows. Starting with a random initialization of parameters, the agent interacts with the game world for some time, and accumulates interaction episodes (for example, each episode is one tennis game). So we get a complete record of everything that happened: what states we met, what actions we took in each state, what reward we received at each step. The recipe for policy improvement: all that the agent did and what led to the improvement of the reward is good. To reduce the reward - bad. We can use backward propagation to calculate small changes in network parameters, which will lead to an increase in positive reward and a decrease in negative reward in the future. The process repeats iteratively: we accumulate a new batch of episodes, update the parameters, and repeat.

Search in policy space through noise injection

The OP usually uses stochastic policies, since we calculate only the probabilities of performing each of the actions. Thus, during the training, the agent may be in the same state many times, but each time he will take different actions in accordance with probabilistic sampling. This gives a signal for learning: some of these actions will lead to improvement, while others - to a decrease in remuneration. Thus, we embed noise in the agent's actions for searching in the policy space, by sampling the distribution of actions at each moment in time. This is different from the ES, which will be described below.

Evolutionary strategies

Evolution

Before diving into the ES, we note that the word "evolution" here has nothing to do with biology. The mathematical foundations of the process are so abstracted from their biological analogies that it is better to consider ES simply as a class of stochastic optimization methods of the black box type.

Black Box Optimization

We'll forget about the agent, the game world, neural networks, time ... The ES considers a black box with 1 million input numbers (parameters of the policy function), 1 number output (total compensation). We just want to find the best value of these 1 million numbers.

Algorithm of evolutionary strategy

Intuitively, optimization is a “guess and check” process. We start with a random set of parameters, and then sequentially:

1) randomly change slightly the parameters
2) move slightly in the direction of improving the parameters.

Specifically, at each step we take the vector of parameters w, and generate a population, for example, from 100 slightly different vectors w1 ... w100, adding Gaussian noise. Then we evaluate each of the 100 candidates independently by launching an agent in our world for some time, and we add up the reward values in each case. Then the updated parameter vector is equal to the weighted sum of all 100 vectors, where the weighting factors are proportional to the amount of remuneration (i.e., more successful candidates get higher weight). Mathematically, this is equivalent to estimating the gradient of expected remuneration in the parameter space, only we make an estimate for only 100 randomly chosen directions.

Code example

To clarify the algorithm, look at a short example of optimizing a quadratic function using the ES.

# simple example: minimize a quadratic around some solution point import numpy as np solution = np.array([0.5, 0.1, -0.3]) def f(w): return -np.sum((w - solution)**2) npop = 50 # population size sigma = 0.1 # noise standard deviation alpha = 0.001 # learning rate w = np.random.randn(3) # initial guess for i in range(300): N = np.random.randn(npop, 3) R = np.zeros(npop) for j in range(npop): w_try = w + sigma*N[j] R[j] = f(w_try) A = (R - np.mean(R)) / np.std(R) w = w + alpha/(npop*sigma) * np.dot(NT, A)

The introduction of noise in the parameters

Note that the objective function is the same as in OP: optimization of expected remuneration. However, in an OP, noise is implanted into the action space and, using back propagation, the change in parameters is calculated. In ES, noise is introduced directly into the parameter space. Therefore, you can use deterministic policies (which we did in our experiments). You can also combine both approaches.

Comparing learning with reinforcement and evolutionary strategy

Some advantages of ES over OP:

No back distribution is required. In the ES there is only a direct distribution of the policy, which makes the code easier and runs 2-3 times faster. In the case of limited memory, there is also no need to store episodes for future updates. There is also no problem of the explosive growth of gradients characteristic of recurrent networks. Finally, you can explore a wider class of policy functions, such as differentiable ones (such as binary networks) or those containing complex modules (path searching or various optimizations).
High degree of parallelism. In the ES, only a small exchange of data between processes is required - scalar data on remuneration. In OP, synchronization of full feature vectors is required (millions of numbers). Therefore, in the ES, a linear increase in productivity is observed with the addition of computational cores.
Resistance to the choice of hyperparameters. OD is sensitive to scale, for example, when learning for Atari, we even observed failures depending on the choice of frame rate. The ES in this experiment is practically insensitive to the frame rate of the game.
Structured study. Some OP algorithms lead to the fact that the agent learns to “chatter” in management. For example, in a tennis game, you often need to continually push the up button for a few bars of the game. Q-training overcomes this disadvantage of EP, and ES does not suffer from this disadvantage due to the use of deterministic policies.
The planning horizon is wider. Analysis of ES and OP showed that ES is more preferable when the number of time ticks in the episode is large, since actions give a longer lasting effect. For example, in Seaquest, an agent successfully learns that a submarine must float when oxygen levels rise.

ES is also not free from shortcomings. In practice, random noise in the parameter space should lead to different results in order for the training signal to form. For example, in Montezuma's Revenge, it is almost impossible to get a key on the first level using a random network, while it becomes possible with random actions.

A detailed description of the experiments and comparison with the best OP algorithms are given in the article Evolution Strategies as a Scalable Alternative to Reinforcement Learning .

The source code is published on github.

Source: https://habr.com/ru/post/330342/

All Articles