Development of sustainability strategies

In the previous article, I described several evolutionary strategies (ES) algorithms that help optimize the parameters of a function without the need to explicitly calculate gradients. When solving learning problems with reinforcement (reinforcement learning, RL), these algorithms can be used to search for suitable sets of model parameters for a neural network agent (neural network agent). In this article I will discuss the use of ES in some RL-tasks, and also describe the methods for finding more stable and stable policies.

Evolutionary learning strategies with reinforcement

Since RL-algorithms need to send a reinforcement signal (reward signal) to the agent each time, only the resulting cumulative reinforcement received by the agent after it has been run in the environment is important for these algorithms. In many cases, we only know the output at the end of the task, for example, did the agent succeed, did the robot pick up the object, did the agent survive, and so on. In all these tasks, the ES can be more efficient than the traditional RL. Below I have given a pseudocode in which the agent's run in the OpenAI Gym environment is encapsulated. Here we are only interested in cumulative reinforcement:

def rollout(agent, env): obs = env.reset() done = False total_reward = 0 while not done: a = agent.get_action(obs) obs, reward, done = env.step(a) total_reward += reward return total_reward

You can define rollout as a target function that maps the parameters of an agent model with fitness scores, and use an ES solver (solver) to find a suitable set of parameters, as described in the previous article :

 env = gym.make('worlddomination-v0') # use our favourite ES solver = EvolutionStrategy() while True: # ask the ES to give set of params solutions = solver.ask() # create array to hold the results fitlist = np.zeros(solver.popsize) # evaluate for each given solution for i in range(solver.popsize): # init the agent with a solution agent = Agent(solutions[i]) # rollout env with this agent fitlist[i] = rollout(agent, env) # give scores results back to ES solver.tell(fitness_list) # get best param & fitness from ES bestsol, bestfit = solver.result() # see if our task is solved if bestfit > MY_REQUIREMENT: break

Deterministic and stochastic policies

The result of monitoring the environment is the input data for the agent, and the output is its action in each step during the run inside the environment. We can model the agent as we want and use the methods from the rules set in the code, decision trees and linear functions for recurrent neural networks. In this article, I will use a simple feed-forward network with two hidden levels so that the result of the agent’s observation of the medium (vector x ) is directly translated into action (vector y ):

h ₁ = f _h (W ₁ x + b ₁ )
h ₂ = f _h (W ₂ h ₁ + b ₂ )
y = f _out (W _out h ₂ + b _out )

The activation functions of f _h and f _out can be tanh , sigmoid , relu or any other. In all my experiments, I used tanh . If there is such a need, then in the output level it is possible to take f _out as the pass-through function without nonlinearities. If we concatenate all the weights and error parameters (bias parameters) into a single vector W, then we see that the neural network described above is a deterministic function y = F (x, W). Then you can use ES to find the solution W using the search cycle described in the previous article.

And if we don’t want the agent’s policy to be deterministic? For some tasks, even as simple as a stone-scissors-paper, the optimal policy is a random action. That is, the agent must be trained in stochastic politics. One of the ways to make y = F (x, W) into a stochastic policy is to make W random. Each model parameter w _i ∈W can be a random value derived from the normal distribution N (μ _i , σ _i ).

This kind of stochastic neural network is called a Bayesian neural network . This is a network with a preliminary distribution of weights. In our case, the parameters of the model for which we are looking for a solution are a set of vectors μ and σ, not weights W. With each pass of the neural network, we from N (μ, σI) get a new W. There are many interesting works on using Bayes neural networks to solve different tasks, as well as the problems of learning these networks. ES can be used to directly find solutions of a stochastic policy by setting μ and σ instead of W as the solution space.

Stochastic networks are often found in works on RL. For example, in the Proximal Policy Optimization (PPO) algorithm, the last level is a set of parameters μ and σ, as well as an action selected from N (μ, σI). Adding noise to the parameters can stimulate the agent to explore the environment and avoid a local optimum.

I found that when an agent has to explore the environment, we often don’t need the vector W to be completely random - just bias is enough. In the difficult tasks associated with the movement, for example in the environment of roboschool , I often have to use ES to find a stochastic policy, in which only the error parameters are extracted from the normal distribution.

Emerging resilience policies for bipedal walker

This is one of the areas where ES is useful for finding sustainability policies. I want to control the balance between data efficiency and policy sustainability over several random attempts. I experienced ES in a great environment BipedalWalkerHardcore-v2 , developed by Oleg Klimov . The environment uses the physics engine Box2D Physics Engine , which was used in Angry Birds .

Our agent decided BipedalWalkerHardcore-v2 .

In this environment, our agent studied the walking policy without falling over a randomly generated obstacle course with a time limit. 24 input signals were used: 10 lidar sensors, angles and contact with the surface. The agent does not know where he is in the track. The action space consists of four continuous values controlling the torques of the four motors. The total reinforcement (total reward) is calculated on the basis of the entire distance traveled by the agent. If the agent passes the whole route, he earns more than 300 points. True, some points will be deducted depending on the amount of applied torque, so energy consumption is also a limitation.

In BipedalWalkerHardcore-v2, the problem is considered solved if the agent scores an average of 300 or more points for 100 consecutive random attempts. Although using the RL algorithm, it is relatively easy to train an agent to successfully complete the entire route, but it is much harder to get to pass efficiently and with a stable result. The task turns out quite interesting. As far as I know, as of October 2017, my agent passes the route best of all.

Start. Learning to walk.

Learning to correct mistakes, but it turns out more slowly ...

Since for each attempt a new random route is generated, sometimes the route is easy and sometimes very difficult. We do not need, in the course of natural selection, agents with weak politicians, who were lucky to be on the easy track, pass into the next generation. In addition, agents with good policies should be able to prove that they are no worse than others. So I took the average result of 16 random runs as an agency episode, and used the average value of accumulative reinforcements over 16 runs as a fitness score.

You can look at it from the other side and see that although we tested the agent on 100 attempts, but the attempts were single, so the test task does not correspond to the training task for which we optimized our system. If in the stochastic environment averaged repeatedly each agent in the population, then the gap between the training and test sets can be reduced. If you can retrain under the training set, then you can retrain and under the test set, especially in RL with this there are no difficulties :)

Of course, the efficiency of the algorithm data is now 16 times worse. But the final policy has become much more stable. When I tested the final policy on 100 consecutive random attempts, it took an average of more than 300 points to go through this environment. Without this averaging method, the best agent on 100 attempts could score ∼220–230 points. As far as I know, our decision was the first to go through this Wednesday (as of October 2017).

The winning solution was studied using the PEPG using an average of 16 runs per episode.

I also used PPO , a great policy gradient algorithm for RL. To the best of my ability, I tried to set it up so that it worked well in my task. I managed to achieve only ∼240-250 points on 100 random attempts. But I am sure that someone will still be able to use PPO or another RL-algorithm to go through this environment.

In real situations, when we need secure policies, a useful and highly effective function is to manage the balance between data efficiency and policy sustainability. Theoretically, if the computing power were enough, one could even average the data over the 100 necessary runs and optimize the bipedal walker right at the requirements level. Professional engineers in device design often have to consider the requirements of the quality control service and safety factors. They need to be taken into account when teaching agents to politicians who can influence the real world around them.

Several solutions found by ES:

CMA-ES solution.

OpenAI-ES solution.

I also trained the agent using a network that uses stochastic policies and has high initial noise parameters, so the agent saw noise everywhere, even his movements were noisy. As a result, the agent was trained to solve the problem, even if there was no confidence in the accuracy of its input and output signals (although he could not score more than 300 points):

Two-legged walker using stochastic policies.

Kuka grab robot arm

I tried to use a combination of ES and averaging techniques to solve a simpler problem with the Kuka robot manipulator. This medium is available in the pybullet environment. The Kuka model used in the simulator corresponds to the real Kuka manipulator. In this task, the agent gets the coordinates of the object.

In more advanced RL environments, an agent may be required to perform actions based on pixel input signals, but, in principle, we can combine this simplified model with a pre-trained convolutional neural network and then we can also calculate the coordinates.

The task of grabbing the manipulator using a stochastic policy.

If an agent successfully grabs an object, then it receives 10,000 points, and if not, then 0. Part of points is deducted for energy consumption. If we average reinforcements over 16 random attempts, we can optimize ES in terms of sustainability. However, in the end, I was able to get a policy in which the object is grabbed in about 70-75% of cases under deterministic and stochastic policies. There is still something to strive for.

We teach Minitaur several tasks

If we learn to simultaneously perform several complex tasks, then we begin to perform better and single tasks. For example, Shaolin monks, lifting weights, standing on the tops of the columns, much better balance without burdening weights. If you try not to spill water from a cup while driving a car at a speed of 140 km / h on a mountain road, you will be an excellent driver for illegal street racing. We can also train agents to perform several tasks at once, and then agents will master more stable policies.

Shaolin agents.

Drift training.

Recent work devoted to self-playing agents proves that agents who have mastered difficult tasks like sumo wrestling (and this sport requires many skills) can perform simple tasks like wind resistance when walking without additional training. Erwin Kumans recently tried to experiment with adding a duck on the lid of a Minitaur that is learning to walk. If the duck falls, the task is not counted. So we can hope that such additions to the task will help to translate the studied policies from the simulator to the real Minitaur. I took one of the examples and experimented with Minitaur and a duck, using ES for training.

CMA-ES pacing policy in pybullet .

A real Minitaur from Ghost Robotics .

The Minitaur model in pybullet is modeled on the present Minitaur. However, policies learned in an ideal virtual environment usually do not work in the real world. She might not even be able to generalize small additions to the problem inside the simulator. For example, in the previous video, Minitaur was trained to go forward (with the help of CMA-ES), but this policy does not always allow you to move a duck across the room if you put it in the simulator on top of a robot.

The policy of walking works with a duck.

The policy studied with the help of a duck.

Politics, studied on a simple walk without a duck, still somehow works when you put a duck on a robot, that is, this does not complicate the task too much. The duck is stable, so Minitaur was not so hard not to drop it. I tried to replace the duck with a ball in order to make the task very difficult.

Learning to cheat.

But this did not lead to the immediate emergence of a stable balancing policy. Instead, the CMA-ES has developed a policy that technically allows you to move the ball, rolling it into the recess between the legs and holding it there. It can be concluded that the algorithm, which seeks a solution in accordance with its goal (objective-driven search algorithm), learns to use any structural disadvantages of the environment in order to accomplish the task.

Stochastic policy learned on the ball.

The same policy, but with a duck.

After I made the ball smaller, CMA-ES was able to find a stochastic policy that allows you to walk and at the same time balance the ball. This policy has been transferred to a simpler duck task. I hope that in the future such methods of additions to the tasks will be useful for transferring experience to real robots.

Estool

One of the important features of ES is that calculations can be parallelized across several workers working in different execution threads on different CPU cores or even on different machines .

Python's multiprocessing library allows you to run processes in parallel. I prefer to run separate Python processes for each task using the Message Passing Interface (MPI) and mpi4py . This allows you to bypass the global blocking of the interpreter and be sure that each process will receive its own numpy and gym sandbox instances, which is important when it comes to initializing random number generators.

Roboschool Hopper, Walker, Ant.

Roboschool Reacher.

Agents with estool were trained in various roboschool tasks.

I implemented a simple estool tool estool , using the es.py library described in the previous article, trains simple feed-forward policy networks to perform GL-based RL tasks with a continuous control. I used estool to simplify learning in all the experiments described above, as well as in various tasks with constant control inside the gym and roboschool. estool uses MPI for distributed processing, so it doesn’t take a lot of gestures to spread workers across different machines.

ESTool with pybullet

Github

In addition to the gym and roboschool environments, estool works well with most pybullet gym environments. You can modify the existing pybullet-environment to get more suitable for your tasks. For example, I effortlessly made a medium with Minitaur wearing a ball (in the custom_envs directory in the repository). The ease of setting up environments makes it easy to check new ideas. If you want to implement 3D models from other software packages like ROS or Blender , then you can create new interesting pybullet-environments and suggest others to go through them.

Many models and environments in pybullet, for example Kuka and Minitaur, are modeled in the image and likeness of real robots to transfer current knowledge of the trained algorithms to them. In fact, in many recent studies ( 1 , 2 , 3 , 4 ), the pybullet is used to control knowledge transfer experiments.

But in order to experiment with the transfer of knowledge from simulators to real devices, you do not need to acquire expensive robots. In pybullet there is a racecar model, created based on the hardware open source MIT racecar set. There is even a pybullet environment that mounts a virtual camera on a virtual race car, so that the agent has a virtual screen as an input to the surveillance tool.

We will first try a simpler version, where the car needs only to study the policy of moving to a giant ball. In the RacecarBulletEnv-v0 environment, the agent receives the relative coordinates of the ball at the input, and at the output - continuous actions to control the speed of the motor and the direction of the rudder. The task is simple, learning on a 2014 Macbook Pro (with an eight-core processor) takes 5 minutes (50 generations). If you use estool , this command will start training in eight processes and assign 4 tasks to each process, resulting in 32 workers. For policy development, CMA-ES is used:

 python train.py bullet_racecar -o cma -n 8 -t 4

The current learning result, as well as the found model parameters, will be stored in the log subdirectory. Run this command to visualize the agent in the environment using the best policy found:

 python model.py bullet_racecar log/bullet_racecar.cma.1.32.best.json

Wednesday pybullet racecar, based on MIT Racecar .

In the simulator, the mouse cursor can move the ball and even move the car.

Using the IPython notepad plot_training_progress.ipynb you can display the learning history of all generations of agents. For each generation, you can see the best and worst scores, as well as the average result for the entire population.

The standard motion task is similar to that used in a roboschool like Inverted Pendulum. Also in the pybullet are Hopper, Walker, HalfCheetah, Ant and Humanoid. I developed a policy for Ant, which with the help of PEPG per hour reaches 3000 points on a multi-core machine, with a population of 256 agents:

 python train.py bullet_ant -o pepg -n 64 -t 4

An example of the run on AntBulletEnv . With the help of gym.wrappers.Monitor you can record runs in MP4-video.

Conclusion

In this article, I explained how to use ES to develop policies for agents of cross-cutting neural networks in order to perform various RL tasks with constant control defined by the gym interface. I described the estool tool, it allows me using the MPI framework in a distributed computing environment to quickly test ES algorithms with different settings.

So far I have described only the methods of teaching agents through trial and error. This form of learning from scratch is called reinforcement learning without using models ( model-free ). In the next article (if I write it at all) I will talk more about model-based learning, when the agent learns to use a previously trained model to perform the current task. And yes, I will again use evolutionary algorithms.

Interesting links

Estool
Stable or stable? What is the difference?
OpenAI Gym documentation
Evolutionary strategies as a scalable alternative to reinforcement learning
Edward - a library for probabilistic modeling, conclusions and criticism

BipedalWalkerHardcore-v2
roboschool
pybullet
Emergent Complexity
GraspGAN

Source: https://habr.com/ru/post/343008/

All Articles