Some water
More than a year and a half ago, there was a news that
"DARPA intends to make a revolution in machine learning .
" Of course, DARPA just allocated money for a research program related to probabilistic programming. The very same probabilistic programming exists and develops without DARPA for quite a long time, and research has been conducted both in leading universities, such as MIT, and in large corporations, such as Microsoft. And it’s not for nothing that DARPA, Microsoft, MIT, etc. Pay close attention to this area, because it is really promising for machine learning, and, perhaps, for artificial intelligence in general. It is said that probabilistic programming for machine learning will play the same role as high-level languages for ordinary programming. We would bring another parallel - with the role of the Prologue, which he played for the good old AI. Here only in RuNet on this topic you can still find only single references, and then mostly containing only descriptions of general principles. Perhaps this is due to the fact that the potential of probabilistic programming has just begun to unfold and it has not become the main trend. However, what are probabilistic languages capable of or will be capable of?
Two main classes of probabilistic programming languages can be distinguished: these are languages that allow the generation of generative models only in the form of Bayesian networks (or other graphical probabilistic models), or Turing-complete languages.
A typical representative of the first is Infer.NET, developed by Microsoft. In it, due to the use of Bayesian networks as generative models, it is possible to apply known for them effective methods of derivation. Naturally, the use of a well-known class of models with well-known inference methods does not lead to the possibility of solving some fundamentally new tasks (and even such generative models as deep learning networks based on limited Boltzmann machines are not representable in such languages), but it gives quite practical tool. As the developers say, using this tool, it is possible to implement a non-trivial probabilistic model in a couple of hours, such as the full Bayesian version of the main component analysis, which will only take a couple dozen lines of code and for which a separate implementation of an efficient output procedure in ordinary language would require a significantly larger amount knowledge and several weeks of work. Thus, due to probabilistic programming, the use of graphical models becomes much simpler and more accessible.
However, Turing-complete probabilistic languages have much greater potential. They allow you to go beyond the class of tasks that existing methods of machine learning are already able to solve. Naturally, in such languages a problem arises of the efficiency of inference, which is still far from being solved, which leads to poor scalability for real-world problems. However, this direction is being actively developed, and there are a number of works showing how to achieve effective inference for interesting practical problems in general probabilistic languages. It is hoped that in the near future these solutions will be available for use in specific languages. In addition, Turing-complete probabilistic languages are already very useful in research related to cognitive modeling and general artificial intelligence. For these reasons, we consider the basic principles of probabilistic programming using the example of Turing-complete languages, of which we chose Church (Church), which is an extension of Lisp (more specifically, its dialect - Scheme). The convenience of this language (at least for the purposes of initial acquaintance with it) lies in the existence for it of a web-implementation (web-church) with which one can experiment without installing additional software.
')
So, to the point
A program in a probabilistic language may, at first glance, be no different from a program in an ordinary language. That is what is done in Church. As in a regular Lisp, variables can be defined in this language, functions can be performed deterministic calculations. For example, the following program sets a function of one argument, which calculates the factorial by the recursive formula n! = N * (n – 1) !, and calls this function for n = 10
(define (fn) (if (= n 0) 1 (* n (f (– n 1))))) (f 10)
Also in this language there may be calls to (pseudo) random functions. For example, if you make a call (flip 0.3) with probability 0.3, #t will be returned, and with probability 0.7 - #f. Such a function is simply implemented in Lisp as
(define (flip p) (< (random) p))
Church, like other probabilistic languages, includes many built-in functions that return random values in accordance with a particular distribution. For example, (gaussian x0 s) returns a real random variable distributed over a Gaussian with given parameters. As other implemented probability distributions, there are usually uniform, multinomial, Dirichlet, beta, gamma. All these distributions are not so difficult to implement manually in ordinary language, and there is not yet a fundamental difference between Church and Lisp.
However, besides the usual semantics, the program on Church has probabilistic semantics, within which it is assumed that a program containing random function calls does not just generate some specific values of random variables when it starts, but specifies the probability distribution over them. So, (gaussian x0 s) is not just a function that returns some specific value of a random variable distributed over a Gaussian, but precisely the Gaussian distribution itself.
But how to get these probability distributions specified by the program? Imagine, for example, a program
(if (flip 0.4) (flip 0.1) (flip 0.6))
That is, with a probability of 0.4, the value of this expression is P (#t) = 0.1 and P (#f) = 0.9, and with a probability of 0.6, P (#t) = 0.6 and P (#f) = 0.4. Where does the final distribution defined by this expression come from: P (#t) = 0.4 and P (#f) = 0.6? This probabilistic semantics is often implemented through the sampling process: we can simply run a program many times and build a sample of the results of its execution. Such a procedure, of course, is also easy to implement in ordinary language (and, indeed, even Simula-67 was regularly used in this way for modeling stochastic processes in this way).
However, modern probabilistic languages go further and add to the sampling process a condition imposed on the results of the program. This idea leads to the simplest sampling with failures, which in Church is implemented by the rejection-query function. This function accepts a probabilistic program as an input (as a set of define), the penultimate expression in which calculates the return value, and the last expression is a condition (predicate) that must be true during execution. Consider the program
(rejection-query (define A (flip 0.4)) (define B (flip 0.6)) B (or AB))
A rejection-query executes the program submitted to it until the last condition is fulfilled - here (or AB) - and returns (once) the value of the penultimate expression - here B. You can use the repeat function to get a sample of values. Church also has built-in functions for plotting histograms. Consider a slightly advanced program:
(define (get-sample) (rejection-query (define A (flip 0.4)) (define B (flip 0.6)) B (or AB))) (hist (repeat 1000 get-sample))
When starting, we will get the following result: #f - 21%, #t - 79% (the numbers from launch to launch may vary slightly). This result means that the value of B is equal to #t with a probability slightly lower than 0.8. Where did this probability come from, if in program B is a binary random variable for which P (#t) = 0.6? Obviously, the point is to impose a condition: (or AB). In the process of sampling, we take only such values of B, which is either A or B itself. In fact, we consider the a posteriori probability P (B | A + B). One could use the Bayes rule to calculate this probability manually:
P (B | A + B) = P (A + B | B) P (B) / P (A + B) =
= (P (A | B) + P (B | B) –P (A | B) P (B | B)) P (B) / (P (A) + P (B) –P (A) P (B)) =
= (P (A) + 1 – P (A)) P (B) / (P (A) + P (B) –P (A) P (B)) = 0.6 / (0.4 + 0.6–0.4 * 0.6 ) = 0.789.
However, for such an elementary program, manual application of the Bayes rule takes some time, and for non-trivial programs, analytically calculating the values may not be possible at all.
So, sampling allows us to calculate a posteriori probabilities of random variables of interest when applying certain conditions. It replaces the Bayes Rule, widely used in machine learning for model selection or prediction fulfillment. At the same time, recording a program in a probabilistic language for many people may be much clearer than applying the Bayesian rule. Of course, notch sampling itself can be very simply implemented in a conventional programming language, but probabilistic languages are not limited to this.
In Church, in particular, implemented another function for sampling - enumeration-query. Run the program
(enumeration-query (define A (flip 0.4)) (define B (flip 0.6)) B (or AB))
At the output we get: ((#t #f) (0.7894736842105263 0.2105263157894737)). Here, the exact values are derived (of course, with a discount on the final bit grid) of the probabilities P (B | A + B). The enumeration-query no longer simply runs the program many times, but it analyzes the ways to execute it and iterates through all possible values of random variables, taking into account their probabilities. Of course, such a “sampling” will work only when the set of possible combinations of values of random variables is not too large.
There is a more advanced replacement for the notch-based sampling based on MCMC (Monte Carlo Markov Chains), namely the Metropolis Hastings algorithm, hence the name of the procedure - mh-query. This query procedure immediately generates a specified number of samples (and also receives one additional parameter at the input, the lag). This procedure is also non-trivial in implementation, so the use of a finished probabilistic language (rather than the own implementation of simple sampling procedures in ordinary language) makes sense.
However, the main thing that probabilistic programming gives is the style of thinking.
From basics to application
Different developers find different uses for probabilistic programming. Many use it directly to solve machine learning problems. The authors of the Church language, Noah D. Goodman and Joshua B. Tenenbaum, in their web book Probabilistic Models of Cognition show the use of probabilistic programming for cognitive modeling. It is also known how it is convenient to represent the solution of planning problems in terms of inference in probabilistic languages. It also turns out to be applicable for knowledge representation and output over them, as well as for computer perception tasks (including image recognition). All these applications are still more or less fragmented, but the presence of a common framework for all of them suggests that probabilistic programming can become a “theory of great unification” for AI. Let's look at the simplest examples of possible use.
One of the most classic examples of the use of expert systems is medical diagnostics. In particular, the MYCIN system was built on a system of rules of the form:
Rule 52:
If
- THE SITE OF THE CULTURE IS BLOOD
- THE GRAM OF THE ORGANISM IS NEG
- THE MORPHOLOGY OF THE ORGANISM IS ROD
- THE BURN OF THE PATIENT IS SERIOUS
Then there is weakly suggestive evidence (0.4) that
- THE IDENTITY OF THE ORGANISM IS PSEUDOMONAS
Obviously, rules of this type are well described in a language like Church. Moreover, there is no need to also implement the inference procedure - it is enough to simply write down the rule system. Let's give an example from the mentioned book “Probabilistic Models of Cognition”:
(define samples (mh-query 1000 100 (define lung-cancer (flip 0.01)) (define TB (flip 0.005)) (define cold (flip 0.2)) (define stomach-flu (flip 0.1)) (define other (flip 0.1)) (define cough (or (and cold (flip 0.5)) (and lung-cancer (flip 0.3)) (and TB (flip 0.7)) (and other (flip 0.01)))) (define fever (or (and cold (flip 0.3)) (and stomach-flu (flip 0.5)) (and TB (flip 0.2)) (and other (flip 0.01)))) (define chest-pain (or (and lung-cancer (flip 0.4)) (and TB (flip 0.5)) (and other( flip 0.01)))) (define shortness-of-breath (or (and lung-cancer (flip 0.4)) (and TB (flip 0.5)) (and other (flip 0.01)))) (list lung-cancer TB) (and cough fever chest-pain shortness-of-breath))) (hist samples "Joint inferences for lung cancer and TB")
In this program, a priori probabilities of the occurrence of lung cancer, tuberculosis, colds, etc. in the patient are determined. Further, the probabilities of observing coughing, fever, chest pain, and shortness of breath for certain diseases are determined. The return value is a pair of boolean values, whether the patient has cancer and / or tuberculosis. And finally, the condition is a combination of the observed symptoms (that is, sampling is performed under the condition that the value of all variables is cough fever chest-pain shortness-of-breath - #t).
The result of the program will be as follows: (#f #f) - 4%, (#f #t) - 58%, (#t #f) - 37%, (#t #t) - 1%.
It is easy to make samples to be a function in which a list of symptoms is submitted, which later on in mh-query is used for sampling, which will make it possible to make diagnoses for different patients. Of course, this example is greatly simplified, but it can be seen that in the style of probabilistic programming, it is possible to represent knowledge and draw a conclusion over them.
Naturally, you can solve the problem of machine learning. Their difference will be only in that the unknown parameters will be the parameters of the model itself, and the condition for sampling will be the generation of the training sample by this model. For example, in the program presented above, we would replace the numbers in the strings of the form (define lung-cancer (flip 0.01)) with variables that would be given as random, for example (define p-lung-cancer (uniform 0 1)) , and then for each patient from the training set, the value of lung-cancer would already be determined with the probability of p-lung-cancer.
Let us consider this possibility using a simple example of estimating the parameters of a polynomial by a set of points. In the next program, the calc-poly function calculates the value of a polynomial with parameters ws at the point x. The generate function applies calc-poly to each value from the given xs list and returns a list of the corresponding ordinates. Noisy-equals procedure? “Approximately” compares two given values (if these values are equal, then the function returns #t with a probability of 1; if they are not equal, then the more they differ, the less likely it is to return #t).
(define (calc-poly x ws) (if (null? ws) 0 (+ (car ws) (* x (calc-poly x (cdr ws)))))) (define (generate xs ws) (map (lambda (x) (calc-poly x ws)) xs)) (define (noisy-equals? xy) (flip (exp (* -3 (expt (- xy) 2))))) (define (samples xs ys) (mh-query 1 100 (define n-coef 4) (define ws (repeat n-coef (lambda () (gaussian 0 3)))) ws (all (map noisy-equals? (generate xs ws) ys)))) (samples '(0 1 2 3 4) '(0.01 1.95 6.03 12.01 20.00))
Inside the mh-query call, the n-coef parameter determines the number of coefficients in the polynomial (that is, its degree plus one); ws is a list of random variables generated in accordance with the normal distribution. The return value is a list of parameters of the polynomial. The condition for sampling is the “approximate” equality of all given values of ys to all ordinates generated by a polynomial given ws. Here we request only one implementation that passes by the condition (since it is not very convenient to build a histogram for the parameter vector). The result of this query may be, for example, a list (2.69 1.36 0.53-00), which sets the polynomial 2.69 + 1.36x + 0.53x ^ 2–0.10x ^ 3.
In general, the conclusion on models with real parameters is not the strongest side of the Church language (but this should not be considered a global disadvantage of probabilistic programming in general). However, in this example, mh-query works somehow. To verify this, instead of defining the values of the parameters in the query, you can ask to return the prediction at some point. Rewrite the last code snippet like this:
(define (samples xs ys) (mh-query 100 100 (define n-coef 4) (define ws (repeat n-coef (lambda () (gaussian 0 3)))) (calc-poly 5 ws) (all (map noisy-equals? (generate xs ws) ys)))) (hist (samples '(0 1 2 3 4) '(0.01 1.95 6.03 12.01 20.00)))
That is, we request the most probable (given the available data) value at the point x = 5. For different launches, the maximum of the histogram, unfortunately, will fall on slightly different values (the MCMC method, theoretically, guarantees a convergence to the true distribution, but only in the limit), but usually these values will be quite intelligible. It is worth noting that here we were “free” (replacing one line) with a complete Bayesian prediction: instead of choosing the best model and prediction just by it, we got the a posteriori distribution of values at x = 5, averaged over many models taking into account their own probabilities .
But that's not all. Again, replacing one line - (define n-coef 4) -> (define n-coef (random-integer 5)) we can make an automatic choice between models with a different number of parameters. Moreover, the sampling of the n-coef value shows (although not very stable) that the most likely value is n-coef = 3 (that is, a parabola, which is embedded in a given set of points). With this modification, the prediction becomes more stable. In other words, there is no retraining effect! Why aren't polynomials of a higher degree chosen, because they can more precisely pass to given points? The fact is that when sampling, “guessing” suitable values of parameters of a polynomial of a smaller degree is simpler than a polynomial of a higher degree; therefore, the probability to generate such parameters that pass the test is higher for a polynomial of the second degree than for a third. At the same time, a polynomial of the first degree will give large deviations, for which the probability of triggering noisy-equals? will be greatly reduced.
Let's look at another application, which in the framework of probabilistic programming may seem unexpected. This is a solution to "deductive" tasks. Let's take the factorial calculation function given at the beginning, but instead of calling it with a fixed value, we will assume that the argument is a random variable, but the factorial value itself is subject to a restriction:
(define (fn) (if (= n 0) 1 (* n (f (- n 1))))) (enumeration-query (define n (random-integer 20)) n (equal? (fn) 120))
As a response, we will see n = 5 with a probability of 1. If instead of 120 we specify 100, the program will not loop (unlike the rejection-query or mh-query case, which can be considered a disadvantage), but simply return the empty set . You can put as a condition and not a strict equality, but some other restriction.
In the same way, you can solve more complex problems. Suppose we want to solve the problem of the sum of subsets: it is necessary to find such a subset from a given set of numbers, the sum of which is equal to a given number (usually 0 is taken as this number and it is required that the subset is not empty; but to get rid of checking for non-trivial solution, we take a nonzero sum). It would seem, where does probabilistic programming? But random variables are simply unknown quantities (for which a priori probabilities are given). In any problems we need to find something unknown, including in the problem of the sum of subsets. Let's look at the following elementary program (it could even be simplified by writing summ through fold).
(define (solution xs v) (rejection-query (define ws (repeat (length xs) flip)) (define (summ xs ws) (if (null? xs) 0 (+ (if (car ws) (car xs) 0) (summ (cdr xs) (cdr ws))))) ws (equal? (summ xs ws) v))) (solution '(-1 3 7 5 -9 -1) 1)
Here ws is a list of random boolean values. The summ procedure calculates the sum of the elements of the xs list for which the corresponding elements of the ws list are true. Further, the values of ws are requested, for which the condition of equality of the obtained sum to the given number v is satisfied. By running this program, you can get this result: (#f #t #t #f #t #f), which, of course, is correct (since 3 + 7-9 = 1).
Naturally, Church does not make a miracle and he will not cope with it if the dimension of this task is increased. However, it cannot but surprise that such different AI tasks can be at least posed (and partially solved) using the same language. Well, the problem of effective withdrawal as it was, and remains. In probabilistic languages, it at least stands out in its purest form.