Curiosity and procrastination in machine learning

Reinforcement training (RL) is one of the most promising machine learning techniques that is currently being actively developed. Here the AI agent receives a positive reward for the correct actions, and a negative reward for the wrong ones. This method of carrot and stick is simple and versatile. With it, DeepMind taught the DQN algorithm to play the old Atari video games, and AlphaGoZero the ancient Go game. So OpenAI taught the OpenAI-Five algorithm to play the modern Dota video game, and Google taught robotic hands to grab new objects . Despite the success of RL, there are still many problems that reduce the effectiveness of this technique.

RL algorithms are hard to work in an environment where the agent rarely receives feedback. But this is typical of the real world. As an example, imagine searching for your favorite cheese in a large maze like a supermarket. You are looking for and looking for a department with cheeses, but you can’t find it. If at each step you receive neither a “stick” nor a “carrot”, then it is impossible to say whether you are moving in the right direction. In the absence of a reward, what keeps you from wandering around forever? Nothing, except perhaps your curiosity. It motivates to go to the grocery department, which looks unfamiliar.

The scientific work "Episodic curiosity through attainability" is the result of collaboration between the Google Brain team , DeepMind and the Swiss Technical School of Zurich . We offer a new episodic memory-based RL reward model. She looks like a curiosity that allows you to explore the environment. Since the agent must not only study the environment, but also solve the original problem, our model adds a bonus to the initially sparse reward. The combined reward is no longer sparse, which allows standard RL algorithms to learn from it. Thus, our method of curiosity extends the set of problems solved with the help of RL.

Episodic curiosity through attainability: observational data is added to memory, the reward is calculated based on how far the current observation is from similar observations in memory. The agent receives a greater reward for observations that are not yet represented in the memory.
')
The main idea of the method is to store the observations of the environmental agent in episodic memory, as well as to reward the agent for viewing observations that are not yet represented in memory. “Lack of memory” is the definition of novelty in our method. The search for such observations means the search for a stranger. Such a desire to search for a stranger will lead the AI agent to new locations, thereby preventing him from wandering around in a circle, and ultimately helps him stumble upon a target. As we will discuss later, our formulation can keep an agent from undesirable behavior, which some other formulations are subject to. Much to our surprise, this behavior has some similarities with what a non-expert would call “procrastination.”

Previous curiosity wording

Although in the past there have been many attempts to formulate curiosity ^{[1] [2] [3] [4]} , in this article we will focus on one natural and very popular approach: curiosity through surprise based on prediction. This technique is described in a recent article, “Surveying the environment with the help of curiosity through prediction under independent control” (usually referred to as ICM). To illustrate the connection of surprise with curiosity, we again use the analogy of cheese search in the supermarket.

Illustration of Indira Pasko , under license CC BY-NC-ND 4.0

Wandering around the store, you are trying to predict the future ( "Now I am in the meat department, so I think that the department around the corner is the fish department, they are usually near this supermarket chain" ). If the forecast is incorrect, you are surprised ( “Actually, the vegetables department is here. I didn’t expect it!” ) - and thus you get a reward. This increases the motivation in the future to look around the corner again, exploring new places just to check that your expectations correspond to reality (and, perhaps, stumble upon cheese).

Similarly, the ICM method builds a predictive model of the dynamics of the world and gives the agent a reward if the model fails to make good predictions - a marker of surprise or novelty. Please note that exploring new places is not directly articulated in ICM curiosity. For ICM, visiting them is only a way to get more “surprises” and thus maximize the overall reward. As it turns out, in some environments there may be other ways to cause surprise, which leads to unexpected results.

An agent with a curiosity system based on surprise hangs when meeting a TV. Animation from Deepak Patac 's video, used under license CC BY 2.0

The danger of "procrastination"

In the article “Large-scale study of curiosity-based learning,” the authors of the ICM method along with researchers from OpenAI show the hidden danger of maximizing surprise: agents can learn to indulge in procrastination instead of doing something useful for the task. To understand why this is so, consider a thought experiment, which the authors call the "problem of television noise." Here the agent is placed in a labyrinth with the task of finding a very useful item (like “cheese” in our example). The environment has a TV, and the agent has a remote control. There is a limited number of channels (on each individual program), and each press on the remote switches the TV to a random channel. How will the agent act in such an environment?

If curiosity is formed on the basis of surprise, then changing channels will give more reward, since every change is unpredictable and unexpected. It is important to note that even after a cyclic review of all available channels, random selection of the channel ensures that each new change will still be unexpected - the agent makes a prediction that he will show TV after switching the channel, and most likely the prediction will be wrong, which will cause surprise. It is important to note that even if the agent has already seen every program on every channel, the change is still unpredictable. Because of this, the agent, instead of looking for a very useful item, will eventually remain in front of the TV - it looks like a procrastination. How to change the wording of curiosity to prevent such behavior?

Episodic curiosity

In the article “Episodic curiosity through attainability,” we explore the episodic curiosity model based on memory, which is less prone to instant pleasures. Why is that? If we take the example above, then after a while the channel is switched all the transmissions will eventually end up in memory. Thus, the TV will lose its attractiveness: even if the order of the programs on the screen is random and unpredictable, they are all in memory! This is the main difference from the method based on surprise: our method is not even trying to predict the future, it is difficult (or even impossible) to predict it. Instead, the agent examines the past and checks whether there are observations in the memory like the current one. Thus, our agent is not inclined to instant pleasures, which are given by “television noise”. The agent will have to go and explore the world outside of the TV in order to get more rewards.

But how do we decide whether the agent sees the same thing that is stored in memory? Checking the exact match is meaningless: in a real environment, an agent rarely sees the same thing twice. For example, even if the agent returns to the same room, he will still see this room from a different angle.

Instead of checking the exact match, we use a deep neural network that is trained to measure how similar the two experiences are. To train this network, we must guess how closely the observations occurred. Proximity is a good indicator of whether two observations should be considered part of the same thing. Such training leads to a general concept of novelty through attainability, which is illustrated below.

Reach graph determines novelty. In practice, this graph is not available - so we train the neural network approximator to estimate the number of steps between observations.

Experimental Results

To compare the performance of different approaches to describing curiosity, we tested them in two visually rich 3D environments: ViZDoom and DMLab . Under these conditions, the agent was assigned various tasks, such as finding a target in a maze, collecting good objects and evading bad ones. In the DMLab environment, the agent is equipped by default with a fantastic gadget like a laser, but if the gadget is not needed for a specific task, the agent can not use it freely. Interestingly, an ICM agent based on surprise actually used a laser very often, even if it was useless for the task! As in the case of the TV, instead of searching for a valuable item in the maze, he chose to spend time shooting at the walls, because it gave a lot of reward in the form of surprise. Theoretically, the result of firing on the walls should be predictable, but in practice it is too difficult to predict. This probably requires a deeper knowledge of physics than is available to the standard AI agent.

An ICM agent based on surprise constantly firing into the wall instead of exploring the maze

In contrast, our agent has mastered sensible environmental behavior. This happened because he does not try to predict the result of his actions, but rather looks for observations that are “farther” from those that are in episodic memory. In other words, the agent implicitly pursues goals that require more effort than a simple shot at the wall.

Our method demonstrates reasonable environmental behavior.

It is interesting to observe how our approach to remuneration punishes an agent running in a circle, because after completing the first circle, the agent does not encounter new observations and, thus, does not receive any reward:

The visualization of the award: red corresponds to negative reward, Green to positive. From left to right: an award map, a map with locations in memory, first-person view

At the same time, our method promotes good environmental research:

The visualization of the award: red corresponds to negative reward, Green to positive. From left to right: an award map, a map with locations in memory, first-person view

We hope that our work contributes to a new wave of research that will go beyond the scope of the technique of surprise, in order to train agents to more intelligent behavior. For an in-depth analysis of our method, please take a look at the preprint of scientific work .

Thanks:

This project is the result of a collaboration between the Google Brain team, DeepMind and the Swiss High School of Zurich. The main research team: Nikolay Savinov, Anton Raichuk, Raphael Marinier, Damien Vincent, Mark Pollefeys, Timothy Lillicrap and Sylvain Zheli. We wanted to thank Olivier Pietkina, Carlos Riquelme, Charles Blundell and Sergei Levine for discussing this document. We are grateful to Indira Pasco for helping with the illustrations.

References to the literature:

[1] “Study of the environment on the basis of counting with neuron density models” , Georg Ostrovsky, Mark G. Bellemar, Aaron Van den Oord, Remi Munoz
[2] “Learning the environment on the basis of counting for in-depth training with reinforcements” , Khaoran Tan, Rein Huthuft, Davis Foote, Adam Knock, Xi Chen, Yan Duan, John Schulman, Philip de Tours, Peter Abbel
[3] “Teaching without a teacher the location of goals for internally motivated research” , Alexander Pere, Sebastien Forestier, Olivier Sego, Pierre-Yves Udeye
[4] “VIME: intelligence with the maximization of changing information” , Rein Huthuft, Xi Chen, Yan Duan, John Schulman, Philippe de Turk, Peter Abbel

Source: https://habr.com/ru/post/427847/

All Articles