Machine learning of a deep neural network with reinforcement at tensorflow.js: stunts
Teaching deep neural networks from scratch is not an easy task.
It takes a lot of data and time to learn, but some tricks can help speed up the process, which I will discuss under the cut.
Demonstration of the passage of a simple maze using tricks. Duration of training network: 1 hour 06 minutes. Record accelerated by 8 times. ')
For each task you need to develop your own set of tricks to speed up network learning. I will share a few tricks that helped me train the network much faster.
For theoretical knowledge, I recommend switching to the channel sim0nsays . And I will tell you about my modest success in learning neural networks.
Formulation of the problem
Approximate the convergence function minimizing the quadratic loss function by back propagation of error by deep neural networks.
I had a choice of strategy how to train a neural network. Encourage the successful completion of the task or encourage as you approach the completion of the task.
I chose the second method for two reasons:
The probability that the network will ever reach the finish line on its own is very small, so it will be doomed to receive a lot of negative reinforcement. This will reset the weights of all neurons and the network will be incapable of further learning.
Deep neural networks are powerful. I do not exclude that the first method would have been successful if I had huge computational power and a lot of time for training. I went the way of the least expenses - having developed tricks.
Neural network architecture
The architecture is developed experimentally, based on the experience of the architect and good luck.
Architecture for solving the problem:
3 input neurons - the coordinates of the agent and the value of the traversed cell (normalized in the range from 0 to 1).
2 hidden layers of 256 and 128 neurons (reduce the dimension of the layers in the direction of the network output).
1 layer reset random neurons for the sustainability of learning network.
4 output neurons - the probability of making a choice of the side for the next step.
Neuron activation function: sigmoid. Optimizer: adam.
sigmoid gives at exit 4 probabilities in the range from 0 to 1, choosing the maximum, we get the side for the next step: [jumpTop, jumpRight, jumpBottom, jumpLeft].
Architecture development
Overtraining occurs when using overly complex models.
This is when the network remembers the training data and for the new data that the network has not yet seen, it will work poorly, because the network did not need to look for generalizations, since it had plenty of memory to remember.
Under-training - with not enough complex models. This is when the network had little training data to find generalizations.
Conclusion: the more layers and neurons in them, the more data is needed for training.
Playing field
Rules of the game
0 - Entering this cell, the agent is destroyed. 1..44 - Cells whose values ​​increase with each step. The farther the agent has gone, the greater the reward he will receive. 45 - Finish. No training takes place, it is only when all agents are destroyed, and the finish is an exception that simply uses the already trained network for the next prediction from the very beginning of the maze.
Description of parameters
The agent has “antennae” in four directions from it - they play the role of environmental intelligence and are descriptions for the coordinates of the agent and the value of the cell on which it stands.
The description plays the role of predicting the next direction for the movement of the agent. That is, the agent scans in advance, that there the network learns how to move further towards the increase in the cell value and not to go beyond the limits of the permissible movement.
The purpose of the neural network: get more reward. The purpose of training: to encourage for the right actions, the closer the agent to the task, the higher the reward will be for the neural network.
Tricks
The first attempts at training without tricks took several hours of training and the result was far from complete. Applying certain techniques, the result was achieved in just one hour and six minutes!
Agent looping
During the training, the network began to make decisions, make moves back and forth - the problem of “use”. Both moves give the network a positive reward, which stopped the process of exploring the maze and did not allow to get out of the local minimum.
The first attempt at a solution was to limit the number of agent moves, but this was not optimal, since the agent spent a lot of time looping before self-destructing. The best solution was to destroy the agent if he went to a cell with a lower value than the one on which he stood - the prohibition to go in the opposite direction.
Explore or use
To explore the paths around the current position of the agent, a simple trick was used: at every step, 5 agents would be “voluntary” researchers. The progress of these agents will be chosen randomly, and not by the neural network prediction.
Thus, we have an increased likelihood that one of the five agents will move beyond the rest and will help in training the network with the best results.
Genetic algorithm
Each era on the playing field involved 500 agents. Prediction for all agents is performed asynchronously for all agents at once, and calculations are delegated to gpu. Thus, we obtain a more efficient use of the computing power of the computer, which leads to a reduction in the time required to predict a neural network for 500 agents simultaneously.
Prediction works faster than learning, therefore the network is more likely to move further along the maze with the least amount of time and the best result.
Training on the best in a generation
Throughout the era, for 500 agents, the results of their progress through the maze are preserved. When the last agent is destroyed, the top 5 agents out of 500 are selected - who have reached the furthest maze through the maze
On the results of the best in the era, the neural network will be trained.
In this way, we will reduce the amount of memory used by not saving and training the network on agents that do not advance the network.
Completion
Not being a specialist in this field, I managed to achieve some success in learning the neural network, and it will turn out for you - dare!
Strive to learn faster computers, while we do it better.