Deep Reinforcement Learning (or what you bought DeepMind for)
I continue to talk about the success of DeepMind. This post is about their first public achievement: an algorithm that learns to play Atari games without knowing anything about games except pixels on the screen.
Here, in fact, the main artifact (if you have not seen this video, look necessarily, it explodes the brain)
That's about as much publicly known about the company when it is bought for half a billion dollars. Disclaimer: The post is written on the basis of fairly edited chat logs closedcircles.com , hence the style of presentation, and the availability of clarifying questions. ')
Seminal paper, how it works - https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
So, quickly and on the fingers about Q-learning
Suppose we are in some kind of environment that has the current state s , and at each moment we can perform the action a from the discrete set. This action takes the system to a new state (stochastically, i.e. there is anything random in the system) and can give us a reward or end game
The concept of cumulative return over time is introduced - this is the total amount of reward that we can get from the current moment to the end of the game, with future rewards decreasing by y for each time tick, i.e. the next time is reward - this is y * r , after another - y 2 * r
So, the optimal behavior can be described by a certain function Q * (s, a) , which returns a single number. The main thing is to understand what it means!
It means the following: if we are in the state s , and we take action a , what will be our cumulative reward until the end of the game, if after the action a we act in an optimal way? If such a function is given from above, with its help it is easy to play optimally - try Q * (s, a) for all a, choose the maximum one.
Is everyone clear? I'll go put the tea. Okay, your problems.
Next, there is a key recurrence equation for this imaginary Q * function, the so-called Bellman Equation That is so beautiful. What he says!
He says that if we know with what probabilities s goes into the next state s` due to the action a ... Then Q * for the current state s should be equal to the maximum from the choice of the next action in the next step.
Well, they say, if we know the future, then to understand what the reward will be on the current team when performing action a , we can look into this future, see what actions are optimal there and thus say what will be a cumulative reward.
This expression, of course, can be represented as an iterative step of improving the function. That is, once again, if we know from which state into which system probabilistically passes in the case of certain actions, we can use this to improve our approximation of the function Q.
Oh, the tea is boiling.
Listen, I mean what I didn’t understand at all - how can she then play all Atari games? She does not play all the games at once. She trains the Q function for each game.
So. The idea itself is simple. Let's introduce this Q-function as CNN! (in the sense of convolutional neural network) This is just the usual one, with pixels on the inputs and with the number of neurons in the output equal to the number of possible actions. More precisely, the network receives pixels of several past emulator frames (4 frames in the original implementation), that is, the network can track moving objects. On each neuron of the last layer of the network - Q (s, a) for the desired a .
How to train this network?
Here is a network version in the current step, with some weights. A new step happened, we learned a new transition from s to s' as a result of a and whether it was reward. Let us keep all such stupid (s, s', a, r) save, all that we saw for the game.
Choose a few random ones, and use the equation above to tweak the weights of the network a little, so that they predict the maximum of what the old weights predicted on the next stage.
This moment is difficult to describe in words, but just to understand: We calculate the maximum of the predictions of the past network for s-s' by iterating through all the actions one step at a time, and let us say we train in order to immediately predict what we got by looking at the options and maximum. And actually almost everything.
As a result, we remember in memory of the algorithm all transitions between the steats that the system saw and train the neural network, which stupidly in the picture predicts where the next action will lead us.
The neural network itself does not have a memory and a state at all, the entire system memory is in trained weights. If anything, muscular such memory. (why it is necessary to train on a random dial-up from history, and not on the latter - so that a positive feedback loop does not happen, that is, so that the system does not adapt to the consequences of a particular recent choice, but to the data as a whole)
Small details - they still use the so-called e-greedy policy. At each step, with probability e , a stupidly random action will be done, and with probability (1-e) the Q-function predicted by the current estimate. From time e , of course, decreases.
Here is the pseudocode algorithm. And that's all. Do not do anything else.
Only train CNN to predict the Q function along the way. On an array of states, iteratively.
I wonder how long it is learning Tens and hundreds of hours
in, and also tell about experience replay (they worked out common phrases there, but in the code everything is too simple). Experience replay is their name for the fact that you need to memorize all state transitions that were during the work and train them all. Yes, very cool.As I understand it, they didn’t really bother with the choice of the “forgetting” strategy (they’re stupidly “forgetting” the old one by the fact of rewriting it). Yeah, how much memory is enough, and remember so much.
about preprocessing - I correctly understand that the point of packing four (three) images in "fi" is stupidly saving computing, and not completely trying to somehow work with local animation / direction of movement, etc.? They give the input past 4 frames of history. It seems to me that it is independent of frame skipping to save resources, so this allows the network to look at the animation, yes.
By the way, during Diphak in one team, the guys reduced the number of frames to 3, with no apparent deterioration in the results They write in the article that everywhere was ok with 4 (and this is better than 3 for training time), except for Space Invaders. In them, the laser shot every 4 frames and therefore became invisible with gaps. Made for them 3.
Vooot, and finally a stupid question while we discuss the input frames and the Main Formula with Gamma Discount Rivord where did they get reward from? From the game. Stupid scores. In the original version, they simply looked at the sign of change scores.
"from the game" - is it from the emulator with an additional emulator hack? The short answer to the question and at the same time to the unasked questions about the frames on the screen and how to press the buttons - use ready-made libraries like http://www.arcadelearningenvironment.org , otherwise your research on DQN will quickly stop. How good that we live in the XX1 century! Both the rovers and end-of-game signals will be extracted, for example, by this very ALE from more than 50 atari games.
It is very interesting to communicate with the outside world, I think. In order for this entire deep reinforcement learning to occur, a lot of things must come together. First of all, there must be games.
Games - the main engine of progress (and apparently porn yet)
Since there are games, we have a very powerful GPU. Since there are games on Atari, there is a powerful community about the emulator and other infrastructure. Since there are games and people play them, we have created other games - excellent training programs, perfectly varying degrees of difficulty. The further, the more realistic. Do not save yourself.