Multi-armed bandits: introduction and algorithm of UCB1

This is the first post from the blog Surfingbird , which I bring to the general hubs of algorithms and artificial intelligence; To be honest, I simply did not guess before. If you are interested, come to us to read the previous texts - I do not know what will happen if you simply add new hubs to posts several months old.

Summary of previous series of recommender systems:

This time we are starting a new topic - about multi-armed bandits. Bandits are the simplest, but this only makes the task more important in the so-called reinforcement training ...
')

We always say that what we do is machine learning. But think about it: how do you actually learn? Do we build a regression of the movement of the fingers on the basis of observations of typists in order to learn the blind typing? Does a child, in order to learn how to walk, first collects and summarizes a collection of examples of upright walking of parents? In fact, of course, not: in such cases, we just try to do something, see if it worked out, and then try (usually unconsciously) to correct our behavior based on the result.

Such a statement of the problem is called learning with reinforcement : there is an agent who acts in an environment; the agent tries to do something, and the environment in response to his actions gives the agent buns or, conversely, beats him with current. The agent is trying to get more buns, and a little less current. In general, a typical mouse in the maze.

Today, however, we will talk about a mouse that did not even get into the maze, but into the most common skinner box - let's assume that the state of the environment / agent does not change from action to action. In other words, the agent has a certain set of possible actions, the agent chooses one of them, gets some reward for it (which is a random variable), and then can again choose from the same actions. In machine learning, such a task is called multi-armed bandits : you are sitting in a room in front of several slot machines and must try to win as much as possible.

(picture from Microsoft Research )

Why is this needed in real life? For example, imagine that you are a Yahoo! or, say, mail.ru. You have a homepage with millions of people every day. You want to place links on it so that people click on them as often as possible. How to choose those links that give the maximum CTR (click-through ratio)? Here, each display of the link to the user corresponds to “pull the handle”; each click - success, positive reinforcement from the environment; no-click is bad luck. The task of the algorithm will be to understand as quickly as possible that this or that news is “hot”, and start to show it (yes, this is a slightly modified task, in which gangsters from time to time die and are born again).

Another example: imagine that you want to test a set of changes in the interface of your site. This is usually done with the help of the so-called A / B testing - you must select a group for the experiment (for each change separately), a control group for which nothing will change, then collect statistics and assess whether the improvement is significant. However, such a scheme in itself does not answer the main question - how long should statistics be collected? Here, gangsters can also come to the rescue, the statement of the problem is exactly gangster: showing one option or another corresponds to jerking the handle, and the user's actions determine the response of the environment (see, for example, this post ).

But back to the statement of the problem. At first glance, it seems that the task of choosing the optimal handle is very simple: you need to pull each handle a thousand times, and then choose the handle with the highest average. Indeed, such an algorithm will most likely make the right choice - but how much money it will lose on the suboptimal pens! Many of these losses could have been avoided. In general, the best metric for assessing the quality of the “gangster” algorithm is its loss (regret) compared with the optimal algorithm, i.e. the expectation of the difference in income from the "always pull the optimal handle" strategy and the strategy under consideration.

As an example, I will immediately cite an algorithm with one of the best guarantees for this very measure of training losses. This is the UCB1 algorithm; the algorithm itself is very simple, the proof of its optimality, of course, is not so simple, but we will not go into this, and I’m referring to those interested in the article [ Auer et al., 2002 ]; You can also find a more “advanced” version with improved constants.

Initialization: pull each handle once.
While the work continues:
- pull the handle j, for which the maximum value
  ,
  Where - the average income from the handle j, n _j - how many times we pulled the handle j, and n - how many times we pulled all the handles.

Although the formal proof that there should be two natural logarithms here, and not something else, is beyond the scope of this article, it is intuitively clear what happens in this algorithm: we came up with heuristics, the priority estimate we choose, beyond which handle to pull. It is logical that this priority is greater, the greater the average income, but not only: those pens, for which we pulled rarely (compared to other pens), also receive a bonus. Since the numerator of a fraction grows much slower than the denominator, with

Heuristics are becoming more and more like a simple choice of a pen with an optimal average. However, if one of the n is completely stops growing (we are completely disappointed in one of the pens), sooner or later even the logarithmic numerator will force us to double-check it.

The implementation of this algorithm in your favorite language, of course, presents no difficulties - you just need to memorize how many times you pull each pen. The main pathos here is that such a simple heuristic works well: you can simply choose the handle with the highest priority and not pay attention to anything else. This is related to the so- called Gittins theorem : it turns out that, under fairly general assumptions, the choice of the optimal strategy in the problem of multi-armed thugs boils down to counting independent priorities for each handle separately. True, the Gittins theorem in itself gives only an extremely complex and slow algorithm for calculating these priorities (the so-called Gittins indices ), and the UCB algorithm is not just a special case of the general theorem, but a separate result.

In the next series, we will continue to look at the “gangster” tasks and summarize them to the tasks of following the trends, when the incomes of different “pens” can change over time, and our task is to quickly understand what has changed.

Source: https://habr.com/ru/post/168611/

All Articles

Multi-armed bandits: introduction and algorithm of UCB1

More articles: