Probabilistic programming

Probabilistic modeling is one of the most powerful tools for a data analyst. Unfortunately, for its use it is necessary not only to confidently own the apparatus of probability theory and mathematical statistics, but also to know the details of the work of the approximate Bayesian inference algorithms, which makes the threshold of entry very high. In this lecture you will learn about a relatively young paradigm in machine learning - probabilistic programming. His task is to make all the power of probabilistic modeling available to anyone with programming experience and minimal data analysis experience.

The lecture was delivered by Boris hr0nix Yangel at the Faculty of Computer Science, opened at the Higher School of Economics with the support of Yandex. Boris himself graduated from the Moscow State University Moscow University and the School of Yandex Data Analysis. He worked at Microsoft Research Cambridge in the Christopher Bishop group on the Infer.NET framework. Now Boris is the lead search developer for Yandex.
')
Under the cut - decoding the story.

My task today is to tell you what probabilistic programming is, and how this highly underestimated machine learning paradigm allows you to put all the power of probabilistic modeling into the service of anyone who can program and has basic knowledge of probability theory, mathematical statistics and the basics of modeling.

What do we actually do in machine learning as a matter of fact? In fact, we want to somehow talk about the processes in the real world and how these processes are organized. There are some objects that are involved in these processes. For example, a user who views search results and decides whether a search result is interesting to him or not, click on it, see the next result, or leave the search page altogether. This is an example of objects. Objects have some properties. For example, a user may have some immediate information needs of him, his interests as a whole. Accordingly, there is an environment in which these objects exist and which also has some properties. And there is, in fact, a very complex process: the process of throwing a coin or the process of viewing a user of search results, which we want to talk about. And, as a rule, we have some observable manifestations of this process, we want to draw it and understand what were the properties of objects and environments that led to what we observed. For example, having looked several times at whether a coin fell by an eagle or a tail, we want to know whether the coin was fair or not. Or, looking several times at where the user clicked, we want to understand whether he left the search happy or unhappy, he found what he needed, or not, and whether those documents that we showed him were relevant or not.

There is a small problem. The complex process that we want to study, as a rule, is rather difficult to describe in advance as it happens. In the case of our user, these are some complex biochemical processes that take place in his brain, and which ultimately force him to make this or that decision. It is clear that to convert it is an incredibly complex computational task (even if it can be described), and no one will be engaged in this. Therefore, we resort to the reception, which is called probabilistic modeling and which consists in the following.

First, we build object models, focusing our attention only on those properties that, in our opinion, are important in this process. And those who have little effect on him, we ignore. For example, we can say that the coin is described by one number of its honesty, or we can say that the user is described by a vector of numbers, where each component of the vector shows his interest in a particular topic. For example, the first component tells how interesting it is for him to find a video, the second one how interesting the photos with funny cats are, and so on.

Then we have a complicated process, which we replace with a certain rounded version of it, which, on the one hand, has something in common with what is happening in reality, but, on the other hand, is simple enough to work with. In our hypothetical example with the user, we could say that he looks at the issue from top to bottom, and if the next result satisfies his information need, corresponds to his current interests, then he clicks and stops, and if not, he goes further. A small problem immediately arises: if we look at the real clicks of users, it turns out that the models that I just suggested are not satisfied in the sense that the model cannot actually produce all of the click patterns that we see. For example, users sometimes click on something that they are completely uninteresting. Why? He wanted it that way, or maybe he saw some result that surprised him, and wanted to click and see what would be there. Accordingly, our model cannot produce such data, and this is the problem.

What to do? Here a trick is applied, because of which these models are called probabilistic: we add a bit of noise to the model, which is designed to simulate those complex phenomena that we cannot directly model. We believe that they are so complex that they are very indistinguishable from noise, but adding this noise will allow our model to extricate all possible observations from itself and then it will be possible to reverse it.

In the example with the user, we can say that when he sees another result, he throws up a coin and, depending on how it fell, decides to see this result for him, if he likes it, or go further, or get tired and stop watching the issue. Thus, all click patterns are possible within this model. Although some are more likely. And the constructed model still makes sense and correlates with the real physical process that we want to simulate.

If we have built such a model, we can formally say that we have built a joint probability distribution on the properties of objects and observable data. Then we can reverse this process with the help of an important theorem, which is called the Bayes theorem, which just tells us how from our a priori knowledge about the objects we want to study, having observed some data, refine this knowledge and move on to the posterior probability distribution . That's basically the essence of probabilistic modeling.

With probabilistic modeling, unfortunately, there are a number of problems. The model itself is relatively easy to develop for any subject area. Not in the sense that it is easy to develop a model that best exists. This, as a rule, is rather difficult, because in certain subject areas people have been doing what they think up for years and decades. But if, for example, you are faced with some kind of new process, a new subject area, in which you more or less understand and want to model, and have some basic concepts about how probabilistic models are built, then come up with a reasonable model. probably will not take much time. Maybe hours or days. Then you need to make a deduction in the model, use the Bayes theorem to find the a posteriori distribution on the variable of interest under the observation conditions. Earlier it was said that the Bayes theorem gives a computational recipe that will allow you to do this for any model. On the one hand, this is true, but on the other hand, this computational recipe can be very difficult. Roughly speaking, it is simply not computationally possible to calculate all terms in the Bayes theorem for a slightly complicated model. Therefore, for several decades, approximate methods of Bayesian inference have been developed, which allow us to find some approximation for the posterior distribution, which under certain conditions will be a good approximation. In general, if you want to do probabilistic inference of complex probabilistic models, then you will inevitably have to resort to such methods of approximate inference.

Here are several such methods, it is EP, VMP, MCMC. All these methods are quite complex in implementation, conceptual and not very simple. As a result, to write a conclusion in some of your probabilistic models, you need to spend a lot of time to figure it out. At the same time, truly complex models will require knowledge of the PhD level in machine learning, and not just in machine learning, but rather specialized in the Bayesian output. Then you have the chance to write the output code in this model correctly and without errors, which will actually do what you think.

Accordingly, all this greatly complicates the experimentation. For example, it will take several weeks to demand one model, and it is not known whether it will work out or not. What to say about trying several alternative models - few people have enough patience and strength for that. There is such a tendency that few people are ready to try complex models in their tasks, because, as a rule, the more complex the model, the more difficult the output code is, and few people can write it. Therefore, all limited to simple models. However, very often, more complex models have greater predictive power, and potentially a lot of profit can be obtained from their use.

Scientists thought how to deal with this problem, and as one of the solutions was proposed such a paradigm as probabilistic programming. This paradigm is based on two ideas. First, you don't have to write probabilistic inference code yourself. It is difficult, and it is better to leave all this to the experts. All you have to do is set probabilistic models, say that you have observed, a posteriori distributions for which variables in your model you are interested in. Then the black box, the engine of probabilistic inference, must make a conclusion for you. The second idea is that we want as many people as possible to use all the power of probabilistic modeling. We need such a method to formally describe probabilistic models so that, on the one hand, it can be understood by the probabilistic inference engine, and on the other hand, a sufficiently large number of people can use it. What kind of formal languages are known to the overwhelming majority of people who may be doing this? These are programming languages.

The idea of probabilistic programming is as follows. You describe a probabilistic model using a program that actually implements your physical process model. That is, it is a program that, conditionally speaking, accepts input or generates properties of objects of interest to you at the very beginning, and then with the help of some algorithm generates observable data, periodically referring to a random number sensor to add some noise there. Such a program sets a probabilistic model, sets a joint distribution, the engine can analyze the program and understand how to draw a conclusion. Sounds, probably, a little utopian, but at present there are indeed a number of engines that may not always work and not for all models, but allow us to make a conclusion in interesting models of medium and high complexity. Several are listed on this slide.

What does a probabilistic program look like? Let us first look at a very simple and useless, from a practical point of view, example, which, however, will illustrate the concepts we are talking about.

The model is very simple: we pop two coins, see if they fell by tail or tail, then look if they fell together with tail or tail, or not.

How to set such a model using the program? First, we will add a logical variable to each coin. These variables will take the value of "true" if the coin fell eagle, and "false" if the coin fell tails. And we will get another variable, which will take the value "true", only when both coins fell by an eagle.

The function, here called Bernoulli Sample, is simply a call to a random number generator that generates a sample from the Bernoulli distribution with a given parameter of 0.5. That is, this function with a probability of 0.5 will return the value “true” and with a probability of 0.5 it will return the value “false”. Her call is equivalent to a coin flip.

Here is our model - now we can make some requests for this model. The code of probabilistic programs that will be shown is some kind of pseudo-code that does not correspond, in fact, to the code of any of the probabilistic inference engines that were discussed. Because the syntax of these engines is not intended to be shown on slides and makes it very difficult to understand the material, but all the code shown can be easily transferred to existing engines without much effort and changes to this code, this is a purely automatic and syntactic transformation.

Let's return to the program. We can ask some kind of request for this model. For example, we can say that we have observed that the value of the variable bothHeads is false, that is, both coins did not fall out of the eagle. Now we can ask, what is the probability that the first coin fell out of an eagle? What answer should the probabilistic programming engine give in this case? In our process there are four tail and eagle configurations. We know that the eagle-eagle configuration is excluded, and in the remaining three configurations only in one of them the first coin falls out of the eagle, true. The correct answer, which any probabilistic programming engine in this case should give, is that the posterior distribution on coin1Heads is the Bernoulli distribution with the value of the parameter 1/3. With probability 1/3 - an eagle, 2/3 - tails. Very simple probabilistic program.

What are the semantics of these programs, how can you think about them? You can imagine that we take this program and run it an infinite number of times, each time referring to the random number sensor, which generates numbers from some distribution, and we have two more operators: infer - lets you ask questions, observe - says that we know. The idea is that when we endlessly run this program, every time we reach the infer operator, it keeps somewhere the current value of the variable somewhere and, relatively speaking, remembers all the values of the variable so that when you ask it, return the distribution to all values. And the observe operator looks at whether the value of a variable in the current program execution is equal to what it should be equal to. And if it is not equal, then it interrupts the current execution of the program and tells all the infer operators that it was not necessary to memorize anything in this round, throw it all out, this is a bad performance, it does not correspond to the limitations we know.

In fact, no probabilistic inference engine works that way. It would be extremely inefficient, but you can think about it this way. In fact, such a program sets a joint distribution to the values of all variables in this program, if we consider all the infinite sets of program executions. Excluding all those performances that do not comply with the restrictions.

Let's consider the not very practical, but already more complex example of a probabilistic program. Namely, such a simple model as linear regression. Let's see how it could be written in the probabilistic programming language. What is linear regression? We have some observable value, and we consider that it depends on some characteristics approximately linearly. Characteristics (they are features): let's say we have some points, each of them has a feature vector that affects linearly the observed value. That is, we consider its scalar product with a certain vector of scales, we add a bit of noise so that this linear dependence is not exact, but approximate. And so we get the observed value. How to write this in a probabilistic programming language? To begin with, we will declare an array of attributes, which is denoted here by X, - this is a two-dimensional array, where the first dimension includes the objects of our sample, and the second one - the values of attributes. This array is always observable, we can, for example, always call the GetFeatures function, which counts it from a file, as in ordinary programming.

Now we need a weight vector, the actual parameters of this model. This is a random variable and we will assign a normal prior distribution to it. How? We simply generate all the weights from the normal distribution with the expectation 0 and variance 1, calling a function that simply generates a number from the distribution with the given parameters. So, in fact, an a priori distribution is specified - we simply write the code that generates from it. Now let's get the vector Y for the regression value and fill it. Its values are equal to the scalar product of the weight vector and the feature vector plus some random noise. The string Y [n] initializes the value of Y to random noise with a mean of 0 and a variance of 0.1, and then this cycle adds to the value we initialized, the scalar product of the feature vector of the corresponding object with the weight vector.

Now we can say that we have observed the values of Y, having counted them from a file or something else, and ask the probabilistic programming engine what a posteriori distribution on W with the known Y.

It was a couple of simple examples that were supposed to illustrate the concept. Now let's move on to interesting examples - what can you easily do with probabilistic programming?

First, consider the model True Skill. This model is needed if you have a certain two-player game in which the outcome - either won the first one, or won the second. , - . , , , , . ? : , .

True Skill Microsoft Xbox Live, Halo , , , . , . , - .

, , , - , - , — , , , , .. . – True Skill. , .

: , . , , . , , , ? , , , , . , , .

? . , , . 0 PlayerCount-1, — 0 GameCount-1. , , 0 PlayerCount-1. , , , , , .

. , , . , , , .

. , , , , , , . . «» — , , , , . , . . , , , . .

. Microsoft Research , True Skill through time.

, , . , - — «» . 20 , 20- . , . .

– DARE. ? , , - , - , , , , . , , , — , .

? , -, . , , , . , , , , . , , , . , . , . , . , , . , . , , . , . , , 0,7 . ? : Whitehil, BCC, DARE.

, – . , , . . -, – , . -, discrimination – . discrimination , , , . , , , . – . discrimination .

? . , - . , , . — . -, . - – . discrimination . 1 . , , , , , .

. , True Skill, — . : discrimination, , . , . – . , discrimination, .

DARE. IQ- DARE, , IQ . IQ. , . : — IQ, , – IQ. , , , , , .

DARE ? , - , , . , - , , , . , , , , , , , . , , – , . , - , , , discrimination. , .

. : , , , , - — , .

, . , , , , . , , , — ( ).

– . - , . , , . , , , . , - . : , , , , , , — , ?

, , . . , , , , , .

. . - , – . , , . , , , . , , , , , Google.

, , - , . , . , , , . , . - . , , , .

. , . . , , - . - , , – - , . . ? , , , . ? – ( , Google ..) , , , , , — . - , . .

– , . , . – , , . , - . , , . , , , , .

***
– , , . , — , , , MIT, Microsoft Research, . . , .

Source: https://habr.com/ru/post/263991/

All Articles

Probabilistic programming

More articles: