Training with reinforcements on neural networks. Theory

I wrote here the article The Problem of “Two or More Teachers”. The first touches , trying to show one difficult unsolved problem. But the first touches were a bit difficult. Therefore, I decided for readers to chew a little theory. Alas, now apparently they are learning / (learning?) In a somewhat stereotyped way, such as how each task has its own methods.

So I was told that for the classification task - neural networks (training with a teacher), genetic algorithms (training without a teacher) is a clustering task, and there is also reinforcement training (Q-training) - as a task of an agent who is wandering and something does. And many are judged by such patterns.

Let's try to figure out what the use of neural networks gives, as some say, to a task that they cannot solve - namely, with reinforcement training.
')
And at the same time we will analyze the thesis of Burtsev MS, “The study of new types of self-organization and the emergence of behavioral strategies” , in which the application of unpretentious neural networks in the reinforcement learning problem was no less beautiful.

Theory

In genetic algorithms and methods for learning with reinforcement (for example, Q-learning) there is one significant problem - you need to set a fitness function. Moreover, it is given explicitly by a formula. Sometimes it appears indirectly - then it seems that this function is clearly not (we will follow later on the analysis of the dissertation of MS Burtsev). And all the “wonders” of agent-based behavior come only from this formula.

But what is a formula? - Function. Or the same mapping of inputs to outputs. And what does a neural network / perceptron do? This is what he does — he is learning to display the inputs to the outputs.

Take an exaggerated theoretical example. An agent is some kind of organism that wants to survive. For this he needs to eat. He is two types of animals - well, let the hares and mice. Accordingly, it has two input parameters - the number of kilograms that are eaten by hare and mouseat :). He (the body) wants to estimate how full he is. Then, depending on how full he is, he can run with more or less speed and desire. But this is another task, and we will focus on the assessment of satiety.

We do not know how to evaluate satiety, except the kilos of both. Therefore, the first natural assessment is to learn how to add kilograms. Those. we introduce a simple fitness function c = a + b. Remarkably, but with such a rigid evaluation function, we cannot correct our behavior.

Therefore, a neural network is used. She is first taught to add these two numbers. After training, the neural network can accurately fold. The agent uses the output of the neural network and understands how full it is.

But then misfortune happens - he estimated that he was full for 7 points = after eating 4 kilos of hare and 3 mice. And he ran thinking that he ate, so he almost overdid it and did not die. It turned out that 4 kilos of hare and 3 mice are not the same as 7 kilos of hare. Myshatin does not give the same satiety, and in fact you need to add up as 4 + 3 = 6. He draws this conclusion he draws into the neural network as an exact given. She retrains, and her fitness function is no longer a simple addition, and takes on a completely different look. Thus, having a neural network, we can adjust the fitness function, which we cannot do in other algorithms.

You can say that in other algorithms you just had to take no addition function. But which one? You do not have the appropriate parameters or principles on which you would deduce the desired pattern. You simply could not formalize such a task, because could not understand the state space.

Practice on the example of Burtsev MS

What is his model environment:

Population P of agents A, which are in a one-dimensional cellular environment, closed in the form of a ring. With cells, with some probability, the resource necessary for the agents to perform actions appears. In the same cell can be only one agent. A descendant agent can only appear as a result of the crossing of two parent agents. The agent has 9 inputs

1,2,3 - 1 if there is a resource in this cell of the field of view (on the left, next, on the right), 0 otherwise;
4, 5 - 1 if there is an agent in the left / right cell, 0 otherwise;
6,7 - motivation to cross Mr neighbor left / right;
8 - personal motivation to find food
9 - self motivation to cross

The motivation for crossing and searching for food is determined by the ratio of two coefficients chosen by the experimenter - r0 - the values of the internal resource for saturation and r1 - for crossing.

There are 6 actions: cross with the neighbor on the right, cross with the neighbor on the left, jump, move one cell to the right, move one cell to the left, consume a resource, relax

How to solve:

It was originally tuned by the INS, of course it's difficult to call it a network — without the inner layer, but let it be. She had different coefficients, as inputs are related to actions (outputs). Next, a genetic algorithm was applied, which adjusted the neural network coefficients, and rejected an organism with bad behavior.

In fact, genetic algorithms are not much needed here. It is possible to randomly sort through different strategies of behavior and fix suitable ones, correcting the neural network.

In fact, the problem with Q-learning and genetic algorithms is that a new behavior is searched randomly. And randomness is an equally probable search throughout the entire space of possible states. Those. There is no targeted search. And if the space of possible states is large, then elementary strategies will never be found.

Therefore, in general, it is necessary not to accidentally sort through different strategies, but purposefully (but this should be discussed in the next article after understanding the first strokes :)).

Conclusion

I ran fast, because we are only important conclusions, not details. But you may need to read more about it yourself.

As a result, the agent's suitability function is represented by a neural network. As in the above theory, Burtsev also has a dissertation.

But we distinguish it from the fitness function of the environment. It is here indirectly given through the coefficients r0 and r1. Moreover, this function of the suitability of the medium is known to the experimenter. Therefore, there is not any fiction when Burtsev discovers how the function of the agent's suitability begins to approach the function of the suitability of the medium.

The article The problem of "two or more teachers." The first strokes of our situation is worse. Even the experimenter does not know the suitability of the medium; more precisely, the form of this function. This is the problem, it is neither indirectly defined. Yes, only the final function of fitness is known - the maximum of money, but inside it contains a subfunction, which is not known. The functions of the suitability of the agent, one must strive for this subfunction - and it cannot be calculated.

Hopefully now will be a little more clear.

Source: https://habr.com/ru/post/148830/

All Articles

Training with reinforcements on neural networks. Theory

More articles: