📜 ⬆️ ⬇️

Bayesian analysis in Python

This post is a logical continuation of my first post on Bayesian methods, which can be found here .
I would like to talk in detail about how to carry out the analysis in practice.

As I already mentioned, the most popular tool used for Bayesian analysis is the R language with JAGS and / or BUGS packages. For an ordinary user, the difference in packages is in cross-platform JAGS (first-hand experience has been made of the conflict between Ubuntu and BUGS), as well as the fact that in JAGS you can create your own functions and distributions (though infrequently, if ever, arises). By the way, an excellent and convenient IDE for R is, for example, RStudio .
But in this post I will write about an alternative to R - Python with the pymc module.
As a convenient IDE for Python, I can offer spyder .
I prefer Python because, firstly, I see no reason to study a language that is not so common as R just to find some kind of problem related to statistics. Secondly, from my point of view, Python with its modules is not inferior to R in simplicity, clarity and functionality.

I propose to solve a simple problem of finding the coefficients of linear data dependence. A similar problem of optimization of parameters is quite common in the most diverse areas of knowledge, so I would say that it is very revealing. At the end we will be able to compare the result of the Bayesian analysis and the least squares method

Software installation


First of all, we need to install Python (for those who have not yet done so). I did not use Python 3.3, but with 2.7 everything works fine.
Then you need to install all the necessary modules for Python: numpy , Matplotlib .
If you wish, you can also install additional modules: scipy , pyTables , pydot , IPython and nose .
All of these settings (except for Python itself) are easier to do through setuptools .
And now you can install the actual pymc (you can also install it via setuptools).
')
A detailed guide on pymc can be found here .

Task statement


We have data obtained during a hypothetical experiment, which linearly depend on a certain value of x. Data arrives with noise whose dispersion is unknown. It is necessary to find the coefficients of linear dependence.

Decision


First, we import the modules that we have installed and which we will need:

import numpy import pymc 

Then we need to get our hypothetical linear data.
To do this, we determine how many points we want to have (in this case, 20), they are evenly distributed on the interval [0, 10], and set the real coefficients of a linear relationship. Next, we apply noise to the Gaussian data:

 #Generate data with noise number_points = 20 true_coefficients = [10.4, 5.5] x = numpy.linspace(0, 10, number_points) noise = numpy.random.normal(size = number_points) data = true_coefficients[0]*x + true_coefficients[1] + noise 

So, we have the data, and now we need to think about how we want to conduct an analysis.
First, we know (or assume) that our gaussian noise means that our likelihood function will be gaussian. It has two parameters: mean and variance. Since the noise average is zero, the mean for the likelihood function will be set by the model value (and the model is linear, so there are two parameters). While the variance is unknown to us, therefore, it will be another parameter.
As a result, we have three parameters and a Gaussian likelihood function.
We do not know anything about the values ​​of the parameters, so we assume a priori them to be uniformly distributed with arbitrary boundaries (these boundaries can be moved as far as desired).
When specifying an a priori distribution, we must specify a label by which we will learn about the a posteriori values ​​of the parameters (the first argument), as well as indicate the boundaries of the distribution (the second and third arguments). All of the above arguments are required (there are additional arguments that can be found in the documentation).

 sigma = pymc.Uniform('sigma', 0., 100.) a = pymc.Uniform('a', 0., 20.) b = pymc.Uniform('b', 0., 20.) 

Now we need to set our model. In pymc there are two of the most frequently used classes: deterministic and stochastic. If, given the input data, it is possible to uniquely determine the value (s) that the model (s) returns, then this is a deterministic model. In our case, given the coefficients of linear dependence for any point, we can uniquely determine the result, respectively, this is a deterministic model:

 @pymc.deterministic(plot=False) def linear_fit(a=a, b=b, x=x): return a*x + b 

And finally, we set the likelihood function, in which the mean is the value of the model, sigma is a parameter with a given prior distribution, and data is our experimental data:

 y = pymc.Normal('y', mu=linear_fit, tau=1.0/sigma**2, value=data, observed=True) 

So, the whole model.py file looks like this:

 import numpy import pymc #Generate data with noise number_points = 20 true_coefficients = [10.4, 5.5] x = numpy.linspace(0, 10, number_points) noise = numpy.random.normal(size = number_points) data = true_coefficients[0]*x + true_coefficients[1] + noise #PRIORs: #as sigma is unknown then we define it as a parameter: sigma = pymc.Uniform('sigma', 0., 100.) #fitting the line y = a*x+b, hence the coefficient are parameters: a = pymc.Uniform('a', 0., 20.) b = pymc.Uniform('b', 0., 20.) #define the model: if a, b and x are given the return value is determined, hence the model is deterministic: @pymc.deterministic(plot=False) def linear_fit(a=a, b=b, x=x): return a*x + b #LIKELIHOOD #normal likelihood with observed data (with noise), model value and sigma y = pymc.Normal('y', mu=linear_fit, tau=1.0/sigma**2, value=data, observed=True) 


Now I propose to make a small theoretical digression and discuss what pymc does after all.

From the point of view of mathematicians, we need to solve the following problem:
p (a, b, sigma | Data) = p (Data | a, b, sigma) * p (a, b, sigma) / p (Data)

Since a, b and sigma are independent, then we can rewrite the equation as follows:
p (a, b, sigma | Data) = p (Data | a, b, sigma) * p (a) * p (b) * p (sigma) / p (Data)

On paper, the task looks very simple, but when we solve it numerically (we must also think that we want to solve any problem of this class numerically, and not only with certain types of probabilities), then difficulties arise.

p (Data) is, as discussed in my previous post , a constant.
p (Data | a, b, sigma) is definitely given to us (that is, with known a, b and sigma, we can uniquely calculate the probabilities for our available data)
a here, instead of p (a), p (b) and p (sigma), we, in fact, have only pseudo-random variable generators distributed according to the law we specified.
How to get a posteriori distribution from all this? That's right, generate (sample) a, b and sigma, and then read p (Data | a, b, sigma). As a result, we get a chain of values, which is a sample of the posterior distribution. But this raises the question of how we can make this sample correctly. If our a posteriori distribution has several modes ("hills"), then how can we generate a sample covering all the modes. That is, the task is how to effectively make a sample that would “cover” the entire distribution in the least amount of iterations. There are several algorithms for this, the most used of which are MCMC (Markov chain Monte Carlo). The Markov chain is such a sequence of random events in which each element depends on the previous one, but does not depend on the previous one. I will not describe the algorithm itself (this may be the topic of a separate post), but just note that pymc implements this algorithm and as a result gives the Markov chain, which is a sample of the a posteriori distribution. Generally speaking, if we do not want the chain to be Markov, then we just need to “thin out” it, i.e. take, for example, every second element.
So, we create a second file, let's call it run_model.py, in which we will generate a Markov chain. The model.py and run_model.py files must be in the same folder, otherwise the code should be added to the run_model.py file:

 from sys import path path.append("/////model.py/") 

First we import some modules that will be useful to us:

 from numpy import polyfit from matplotlib.pyplot import figure, plot, show, legend import pymc import model 

polyfit implements the least squares method — we will compare the Bayesian analysis with it.
figure, plot, show, legend are needed in order to build a final schedule.
model is, in fact, our model.

Then we create the MCMC object and run the sample:

 D = pymc.MCMC(model, db = 'pickle') D.sample(iter = 10000, burn = 1000) 

D.sample takes two arguments (in fact, you can specify more) - the number of iterations and burn-in (let's call it the “warm-up period”). The warm-up period is the number of first iterations that are clipped. The fact is that MCMC initially depends on the starting point (this is a property), so we need to cut off this period of dependence.

This completes our analysis.
Now we have an object D, in which the sample is located, and which has various methods (functions) allowing to calculate the parameters of this sample (average, most probable value, variance, etc.).

In order to compare the results, we make the analysis of the method of least squares:

 chisq_result = polyfit(model.x, model.data, 1) 

Now we print all the results:

 print "\n\nResult of chi-square result: a= %f, b= %f" % (chisq_result[0], chisq_result[1]) print "\nResult of Bayesian analysis: a= %f, b= %f" % (Davalue, Dbvalue) print "\nThe real coefficients are: a= %f, b= %f\n" %(model.true_coefficients[0], model.true_coefficients[1]) 

We build standard for pymc graphics:

 pymc.Matplot.plot(D) 

And finally, we build our final schedule:

 figure() plot(model.x, model.data, marker='+', linestyle='') plot(model.x, Davalue * model.x + Dbvalue, color='g', label='Bayes') plot(model.x, chisq_result[0] * model.x + chisq_result[1], color='r', label='Chi-squared') plot(model.x, model.true_coefficients[0] * model.x + model.true_coefficients[1], color='k', label='Data') legend() show() 

Here is the full content of the run_model.py file:

 from numpy import polyfit from matplotlib.pyplot import figure, plot, show, legend import pymc import model #Define MCMC: D = pymc.MCMC(model, db = 'pickle') #Sample MCMC: 10000 iterations, burn-in period is 1000 D.sample(iter = 10000, burn = 1000) #compute chi-squared fitting for comparison: chisq_result = polyfit(model.x, model.data, 1) #print the results: print "\n\nResult of chi-square result: a= %f, b= %f" % (chisq_result[0], chisq_result[1]) print "\nResult of Bayesian analysis: a= %f, b= %f" % (Davalue, Dbvalue) print "\nThe real coefficients are: a= %f, b= %f\n" %(model.true_coefficients[0], model.true_coefficients[1]) #plot graphs from MCMC: pymc.Matplot.plot(D) #plot noised data, true line and two fitted lines (bayes and chi-squared): figure() plot(model.x, model.data, marker='+', linestyle='') plot(model.x, Davalue * model.x + Dbvalue, color='g', label='Bayes') plot(model.x, chisq_result[0] * model.x + chisq_result[1], color='r', label='Chi-squared') plot(model.x, model.true_coefficients[0] * model.x + model.true_coefficients[1], color='k', label='Data') legend() show() 

In the terminal we see the following answer:

Result of chi-square result: a = 10.321533, b = 6.307100

Result of Bayesian analysis: a = 10.366272, b = 6.068982

The real coefficients are: a = 10.400000, b = 5.500000

I note that since we are dealing with a random process, the values ​​you see in yourself may differ from the above (except for the last line).

And in the folder with the file run_model.py we will see the following graphics.
For parameter a:
image
For parameter b:
image
For the sigma parameter:
image
On the right, we see a histogram of the posterior distribution, and the two pictures on the left belong to the Markov chain.
I will not focus on them now. Let me just say that the bottom graph is the autocorrelation graph (you can read more here ). It gives an idea of ​​the convergence of MCMC.
And the top graph shows the sample trace. That is, it shows how the sample took place over time. The average of this trace is the average of the sample (compare the vertical axis in this graph with the horizontal axis in the histogram on the right).

In conclusion, I will talk about one more interesting option.
If you still put the pydot module and include the following line in the run_model.py file:

 pymc.graph.dag(D).write_png('dag.png') 

He will create the following image in the folder with the file run_model.py:

image

This is a direct acyclic graph representing our model. White ellipses show stochastic nodes (these are a, b and sigma), triangles are deterministic nodes, and a darkened ellipse includes our pseudo-experimental data.
That is, we see that the values ​​a and b come into our model (linear_fit), which itself is a deterministic node, and then go to the likelihood function y. Sigma is first set by the stochastic node, but since the parameter in the likelihood function is not sigma, but tau = 1 / sigma ^ 2, the stochastic value sigma is first squared (the upper triangle on the right), and then it is considered tau. And tau already enters the likelihood function, as well as our generated data.
I think that this graph is very useful both for explaining the model and for self-checking of logic.
models.

Source: https://habr.com/ru/post/170633/


All Articles