TL; DR: Training with reinforcements (RL) has always been difficult. Do not worry if the standard depth learning techniques do not work.
The
article by Alex Irpan well outlines many of the current problems of deep-seated RL. But most of them are not new - they have always existed. In fact, these are the fundamental problems underlying RL since its inception.
In this article, I hope to bring you two thoughts:
')
- Most of the flaws described by Alex come down to two main problems with RL.
- Neural networks help solve only a small part of the problems while simultaneously creating new ones.
Note : the article in no way disproves Alex’s claims. On the contrary, I support most of his conclusions and I believe that researchers should more clearly explain the existing limitations of RL.
Two main problems of RL
At the highest level, reinforcement learning is the maximization of some form of long-term return from action in a given environment. There are two fundamental difficulties in solving RL problems: exploration-vs-exploitation balance and long-term credit assignment.
As noted on the first page of the first chapter
of Sutton and Barto’s book on reinforcement learning , these are unique learning problems with reinforcement.
There are related varieties of basic RL problems that manifest themselves with their own scary monsters, such as partial observability, multi-agent environments, learning from people and with people, etc. We will omit all this for now.
The constant state of the researcher in the field of RL. [Caption: "This is normal"]On the other hand, learning with a teacher deals with the problem of generalization. Generalization is the assignment of labels to invisible data, given that we already have a bunch of visible data with labels. Some parts of the fundamental problems of RL can be solved by good generalization. If you generalize the model well to invisible states, then such extensive intelligence will not be required. This is where deep learning usually comes up.
As we will see, reinforcement learning is another and fundamentally more difficult problem than learning with a teacher. There is nothing strange in that an extremely successful teaching method with a teacher, such as in-depth training, does not fully solve all problems. In fact, deep learning while improving generalization introduces its own demons here.
What is
really strange is the surprise of the current limitations of RL. The inability of DQN to work in the long run or a million steps when learning in an environment is nothing new here, it is not some mysterious feature of depth learning with reinforcement. All this is due to the very nature of the problem - and it has always been so.
Let's take a closer look at these two fundamental problems, and then it will become clear: there is nothing surprising in the fact that reinforcement training is not working yet.
Intelligence vs exploitation
Sampling inefficiency, reproducibility and exit from local optimaEvery agent must learn to answer the question from the very beginning: whether to continue to follow this strategy, which gives good results, or to take some relatively non-optimal actions that can increase the gain in the future? This question is so complicated because there is no one right answer to it - there is always a compromise here.
Good start
Bellman equations guarantee convergence to the optimal value of the function
only if each state is checked infinitely many times, and every action infinitely many times tested on it. So from the very beginning we need an infinite number of samples for training, and they are needed everywhere!
You might ask: “Why dwell so much on optimality?”
Fair In most cases, if a successful strategy is developed relatively quickly and does not spoil too many things, then this is enough. Sometimes in practice we are happy that a good policy can be learned in a finite number of steps (20 million is much less than infinity). But it is difficult to define these subjective concepts without numbers to maximize / minimize a certain parameter. Even harder to guarantee something. More on that later.
So, we agree on the fact that we will be happy about the optimal solution (whatever that means). The number of samples for obtaining the same approximation increases exponentially with the space of actions and states.
But hey, it gets worse
If you do not make any assumptions, the best way to explore would be random. You can add heuristics, such as
curiosity , and in some cases they work well, but so far we do not have a complete solution. In the end, you have no reason to believe that an action in a certain state will bring more or less reward if you do not try it.
Moreover, model-free model-free learning algorithms with reinforcement usually try to solve the problem in the most general form. We have few assumptions about the form of distribution, the dynamics of medium transition or optimal strategies (for example, see
this paper ).
And it makes sense. A one-time large reward does not mean that you will receive it every time in this state as a result of the same actions. Here, the only sensible behavior is not to trust too much any particular reward, and slowly raise an assessment of how good this action is in this state.
So, you make small conservative updates of functions that try to bring expectations of arbitrarily complex probability distributions across an arbitrarily large number of states and actions.
But hey, it really gets worse.
Talk about continuous states and actions.
The world of our size seems mostly continuous. But for RL, this is a problem. How to take an infinite number of states an infinite number of times and perform an infinite number of actions an infinite number of times? If only to generalize some acquired knowledge into invisible states and actions. Training with the teacher!
Let me explain a little.
A generalization in RL is an
approximation of a function . Approximation of the function reflects the idea that the state and actions can be transferred to a function that calculates their values - and then there is no need to store the values of each state and actions in a giant table. You teach a function on data - and you actually study with a teacher. Mission Complete.
Not so fast
Even this is not done elementary in RL.
To begin with, let's not forget that neural networks have their own unreasonable inefficiency of sampling due to the slow pace of gradient descent.
But hey, the situation is actually even worse.
In RL, the data for network learning must arrive on the fly during interaction with the environment. As exploration and data collection changes, the evaluation of the utility function Q changes.
Unlike teaching with a teacher, ground truth labels are not fixed here! Imagine that at the beginning of your ImageNet training you mark an image as a cat, but later you change your perception and see a dog, car, tractor, etc. in it. The only way to get closer to the true assessment of the objective function is to continue exploration.
In fact, even in the training kit, you will never get samples of a true objective function, which is the optimal value of a function or policy. Nevertheless, you are still able to learn!
That is the reason for
the popularity of training with reinforcements.
To date, we have two very unstable things that need to be changed slowly to prevent complete collapse. Rapid reconnaissance can lead to sudden changes in the target landscape, which the network so painstakingly tries to meet. Such a double blow from the intelligence and training of the network leads to the complexity of the sample compared to the usual tasks of training with a teacher.
Unstable dynamics intelligence also explains why RL is more sensitive to hyperparameters and random initial values than training with a teacher. There are no fixed data sets on which neural networks are trained. Learning data directly depends on the output of the neural network, the exploration mechanism used and the randomness of the environment. Thus, with the same algorithm on the same environment in different runs, you can get completely different sets for training, which will lead to a
strong performance difference . Again, the main problem in controlled intelligence is to see similar state distributions — and the most general algorithms make no assumptions about this.
But wait! The situation becomes even ...
For spaces of continuous action,
on-policy methods are most popular. These methods can use only samples that comply with the already implemented rules (policy). It also means that as soon as you update the current rules, all experience learned in the past will immediately become unusable. Most of the algorithms that are mentioned in connection with strange yellow people and animals in the form of a bundle of tubes (
Mujoco ) fall into the category of on-policy.
Cheetah
Tube modelOn the other hand, off-policy methods can learn the optimal rules by monitoring the implementation of any other rules. Obviously, so much better, but we are still not good enough in this, unfortunately.
But wait!
No, actually already everything. However, it will be worse after all, but in the next chapter.
It starts to look simple.Summing up, these questions arise because of the main problem of learning with reinforcement, and in a broader sense of all AI systems: because of intelligence.
RainbowDQN requires training for 83 hours, because it has no prior knowledge of what a video game is, that enemies shoot bullets at you, that bullets are bad, that a bunch of pixels on the screen that move all the time together are bullets, that bullets exist in the same world as the objects, that the world is organized according to some principles, it is not just the maximum distribution of entropy. All these priors (presets) help us humans to sharply limit intelligence to a small set of high-quality states. DQN must learn all this by random intelligence. The fact that after training he is able to beat the real masters and surpass the centuries-old wisdom of the game, as in the case of AlphaZero, still seems surprising.
Long term merit assignment
Functions of remuneration, their design and assignmentDo you know how some people scratch lottery tickets with just a lucky coin, because once they did it - and won a lot of money? RL agents essentially play the lottery at every turn - trying to figure out exactly what they did to hit the jackpot. They maximize one indicator resulting from actions over multiple steps, mixed with a high degree of environmental randomness. Finding out what specific actions actually brought a high reward is the task of assigning credit (credit assignment).
You want to easily determine the reward. Training with reinforcements promises that you tell the robot about the right actions - and over time he will learn the right behavior safely. You yourself in reality do not need to know the correct behavior and do not need to provide supervision at every step.
In fact, the problem arises for the reason that the scale of possible rewards for intelligent tasks is much wider than today's algorithms are capable of handling. The robot works on a much tighter time scale. He has to regulate the speed of each hinge every millisecond, and the person will reward him only when he makes a good sandwich. There are many events between these awards, and if the gap between the important choice and the award is too large, then any modern algorithm will simply fail.
There are two options. One of them is to reduce the scale of rewards, i.e. issue them more smoothly and often. As usual, if you show the optimization algorithm some weak spot, it will begin to constantly exploit it. If the reward is not well thought out, then this can result in a
hack of reward .
Ultimately, we fall into such a trap, because we forget: the agent optimizes the entire landscape of value, and not just the immediate reward. Thus, even if the structure of immediate rewards seems harmless, the landscape picture may be unintuitive and contain many of these exploits, if not shown to be sufficiently accurate.
This begs the question, why are rewards primarily used? Remuneration is a way to set goals that allow you to use optimization opportunities to make good rules. Formation of rewards is a way to introduce more specific knowledge in this area from above.
Is there a better way to set goals? In simulation training, you can slyly bypass the entire RL problem by requesting tags directly from the target distribution, i.e. optimal policy. There are
other ways of teaching without direct reward , it is possible
to give agents remuneration in the form of images (do not miss the IMCL workshop on target specification in RL!)
Another promising way to cope with the long-term perspective (strongly delayed remuneration) is hierarchical learning with reinforcement. I was surprised that Alex did not mention it in his article, because this is the most intuitively suitable solution to the problem (although I may be
biased in this regard!)
Hierarchical RL tries to decompose the long-term task into a number of goals and subtasks. Expanding the problem, we effectively expand the time frame in which decisions are made. Really interesting things happen when rules are learned in subtasks that are applicable to other goals.
In general, the hierarchy can be as deep as you want. The canonical example is a
trip to another city . The first choice is the decision to go or not. After this, it is necessary to determine how each stage of the journey will be completed. Boarding the train to the airport, flight, taxi ride to the hotel seem reasonable stages. For the railway phase, we single out subtasks: viewing the schedule, buying tickets, etc. Calling a taxi involves a lot of movements to pick up the phone and activate the vibration of the vocal cords.
Legal request in the RL studyAlthough a bit simplistic, this is a convincing example in the old-fashioned spirit of the 1990s. A single scalar reward for getting to the desired city can be extended through the Markov chain to different levels of hierarchy.
The hierarchy promises great benefits, but we are still far from them. Most of the best systems consider hierarchies of only one level of depth, and transferring acquired knowledge to other tasks is difficult to achieve.
Conclusion
My conclusion is generally the same as that of Alex.
I am very pleased that in this area such activity, and we finally took up the problems that I always wanted to solve. Training with reinforcements has finally gone beyond the limits of a primitive simulator!
Do not panic!I want to add only one thing: do not despair, if the standard methods of deep learning do not kill the monsters of training with reinforcements. Reinforcement learning has two fundamental difficulties that are missing in teacher-training - intelligence and long-term merit-taking. They have always been there, and solving them will require more than a really good function approximator. Much better ways of exploring, using samples from past explorations, transferring experience between tasks, learning from other agents (including people), acting at different time scales and solving difficult problems with scalar rewards should be found.
Despite the extremely difficult problems in RL, I still think that today it is the best framework for developing strong artificial intelligence. Otherwise I would not do this. When DQN played atari according to visual data, and AlphaGo defeated the world champion in go - at these moments we actually observed small steps on the way to a strong AI.
I admire the future of learning with reinforcement and artificial intelligence.