Probabilistic Scala programming

Hello dear readers. Today we are publishing an extraordinary translation - this will be a review article by brilliant Noel Welch on the principles of probabilistic programming. The article is published at the request of readers who ask our blog an ever higher bar - and this is definitely great!

At the Typelevel Summit conference in Philadelphia, I had the opportunity to make a report on probabilistic programming, which I have been interested in for some time. Probabilistic programming is at the junction of two smart research areas that blend well with each other - talking about functional programming and machine learning (more specifically, about Bayesian output). In this article I will try to explain the key ideas of probabilistic programming. I suppose that you, dear reader, are more a programmer than a statistician, but the numbers do not frighten you either. Therefore, I will pay more attention to programming, and not machine learning.

Probabilistic models and inference

To understand probabilistic programming, you first need to understand probabilistic models and logical inference. A probabilistic model describes how observable phenomena can depend on phenomena that we cannot observe. It is usually assumed that such a relationship is not deterministic, therefore, the dependence between the observed and the unobservable obeys a certain probability distribution. A model of this type is sometimes called generative.
')
Suppose a doctor is trying to make a diagnosis. He observes various symptoms - for example, fever, rash, shortness of breath - on the basis of which can make an informed conjecture about what kind of disorder could cause such symptoms. Of course, this process is imperfect. Different diseases can cause very similar symptoms, as well as the symptoms of the same disease often differ, so this diagnostic model suggests some probabilistic distribution of results.

Let's take a closer look at another example. Suppose we want to sort documents on the topics covered in them - perhaps, to recommend people reading that they might like. In this case, we can simulate this generative process:

Select multiple topics and assign weight categories to them;
Each topic is characterized by its own distribution of words;
Based on this distribution of words, we select specific words read on the page in proportion to the weight categories assigned to one or another topic.

A model of this type is called thematic and is widely used. Some of the relatively unusual options for its use include the classification of Etsy's aesthetic preferences and an attempt to figure out which rooms were used as one of the remaining houses in Pompeii .

I hope these brief examples are enough, and you grasped the idea of probabilistic models. Many other generative processes can also be imagined - for example, the properties of a celestial body influence the form in which we observe it, the user's behavior online depends on his interests, and the activity of NGOs affects the health situation in the region where such an organization operates. .

Generative models work well in cases where we have a priori knowledge about the structure of the world, and we can use them in building a model. Working with them is very fun, because they allow you to create fictitious data, for example, automatically generated scientific articles - and thus confuse and amuse our colleagues.

When we created a probabilistic model (and had a lot of fun with fake data), this model can be used for inference. Inference is the process of “rewinding a model.” We are looking at the real data that we have been able to observe, and we are trying to guess how the situation that could not be viewed could look like. For example, you can use the text of the document and try to find out what topic it is dedicated to, or collect astronomical data and try to determine which star systems are most likely to live. Since we cannot judge with confidence the state of these unobservable elements of our model, we can construct a probability distribution of all possibilities. Creating such a distribution is exactly what distinguishes Bayesian inference from alternatives — for example, from the maximum likelihood method, which reveals only one state of unobservable properties.

Why do we need probabilistic programming

We briefly described generative models and the problem of Bayesian inference. This area of statistics and machine learning has long been studied, and over time, researchers have noticed a number of points:

Build a generative model is pretty easy; and
all these models have the same structure; but
the inference algorithm for each particular model is written slowly, and usually errors occur.

How great it would be if one could make an inference algorithm based only on the model description. The situation can be compared with programming in assembly language and in high-level language. Using an assembler, we optimize everything manually, implement our own and control structures, etc., in the context of a specific problem facing us. The same thing happens when creating a special logical inference algorithm for a specific generative model. When programming in high-level language, we use ready-made constructs that the compiler itself translates into assembler. Productivity increases significantly at the cost of a small decrease in machine productivity (which we often do not even notice). The goal of probabilistic programming is exactly that.

Of course, this is a monad!

It seems that the goal of probabilistic programming is worthwhile and noble, but in order to achieve it, you first need to understand how to structure our generative models. An experienced functionary programmer should not be surprised that for this we use the monad. Now I will explain exactly how.

Let's go back to the example of the generative model for creating a document. To create a document, you need:

select multiple topics and their respective weight categories;
each topic regulates the distribution of words;
From such word distributions, we select specific words read on the page, in proportion to the weight categories corresponding to the topic.

You can annotate this model with types to show how the problem can be solved. With the help of Distribution[A] present the distribution by type A. Now the document is generated as follows:

select some topics - this is how we can get Distribution[Seq[Topic]] (or just Distribution[Topic] for a simpler model).
from each topic we get word distribution - this is the function of Topic => Distribution[Words] .
From these word distributions we select specific words that we read on the page - extraction is done from Distribution[Words] .

(Here we are trying to balance simplicity and accuracy, so our model does not take into account the grammar or the number of words that make up the document. You may be able to supplement it, but if you get confused, get acquainted with my report, which presents a simpler and more complete example. )

The main part of this model is to connect Distribution[Topic] and Topic => Distribution[Words] to create Distribution[Words] from which to build a document. By the way, what needs to be replaced ??? so that the following identity is respected?

 Distribution[Topic] ??? Topic => Distribution[Words] = Distribution[Words]

The answer is flatMap , that is, we have a monad. (You can see for yourself that the laws of monads are also observed, however, in this case we need to define the semantics of flatMap . See below.)

If you have ever worked with ScalaCheck or similar systems, it means that you used the monad of security.

Construction of logical inference algorithms

There are many ways to implement the probability monad. If we are dealing only with discrete domains, then everything can be represented inside the domain as List[(A, Probability)] (where Probability can be a Double nickname) and calculate the results exactly. In real-world applications, the benefits of this approach are small, since we will most likely have to deal with solid domains, either discrete, but still large. Then the size of the presentation in each flatMap will grow exponentially. An example implementation is shown here .

You can use a sample-based view. This approach works with solid domains, and the user can create as many samples as they want - to achieve the desired accuracy. However, this option is still imperfect. Algorithms for logical inference have been studied for many years, this work is reminiscent of compiler optimization, and we would like to create a system that runs probabilistic mechanisms to analyze the structure of the program and apply the optimal algorithm for the obtained results. Just like when working with a compiler, we express a probabilistic program in the form of an abstract syntax tree, which can then be manipulated by optimization algorithms.

Now you can begin to build logical inference algorithms, and it is here that I’ll get rid of explanations for two reasons: firstly, I believe that the reader is not a machine learning professional, and he will not be interested in discussing logical inference algorithms. second, which is actually more important - as long as it is the cutting edge of research. In modern systems, generalized inference algorithms are implemented, which, however, are almost not adapted for problem-oriented optimizations. The general principle, as with compilers, is this: the more information you save, the more you can optimize. At the same time, most optimizations are useful only in a few programs. In many respects, it remains an open question how much you can rely on generalized inference algorithms and on computational power to make such a conclusion fast enough, and how much it is worth zealous with complex mathematics to optimize specific cases.

Conclusion and further work

I am very excited about probabilistic programming, because, I believe, it allows you to radically simplify logical inference, and this will be useful in very many areas - I have already cited a few examples above. In most subject areas, the issue of time is not so critical, so I am more interested in the study of generalized inference algorithms and parallel / distributed realizations, rather than complex mathematics. However, I like to run into any things at first in practice, and therefore I am looking for people who need to solve real problems.

In addition, probabilistic programming pretty much intersects with my interests in the field of machine learning and programming languages. If you are interested - here is the report , slides , code .

Source: https://habr.com/ru/post/322760/

All Articles

Probabilistic Scala programming

More articles: