Is it easy? I tried
Alexey Kuzmin, director of development and work with DomKlik data, teacher of Data Science in Netology, translated the article by Rahul Agarwal about how Monte Carlo methods work with Markov chains to solve problems with a large state space.
Anyone who is connected to Data Science has ever heard of the Monte Carlo methods with Markov chains (MCMC). Sometimes the topic is covered while studying Bayesian statistics, sometimes when working with tools like Prophet.
But MCMC is hard to understand. Whenever I read about these methods, I noticed that the essence of MCMC is hidden in the deep layers of mathematical noise, and behind this noise it is difficult to make out. I had to spend many hours to understand this concept.
In this article, an attempt to explain Monte Carlo methods with Markov chains is available, so that it becomes clear what they are used for. I will discuss a few more ways to use these methods in my next post.
')
So, let's begin. MCMC consists of two terms: Monte Carlo and Markov chains. Let's talk about each of them.
Monte Carlo
In the simplest expressions
, Monte Carlo methods can be defined as a simple simulation.
Monte Carlo methods get their name from the Monte Carlo casino in Monaco. In many card games you need to know the probability of winning from the dealer. Sometimes calculating this probability can be mathematically complex or difficult to solve. But we can always run a computer simulation to reproduce the whole game many times and consider probability as the number of victories divided by the number of games played.
That's all you need to know about Monte Carlo methods. Yes, it's just a simple simulation technique with a fancy name.
Markov chains
Since the term MCMC consists of two parts, one must also understand what
Markov chains are . But before moving on to Markov chains, let's talk a little about Markov properties.
Suppose there is a system from M-possible states, and you move from one state to another. Let nothing be confusing you yet. A specific example of such a system is weather, which varies from hot to cold and moderate. Another example is the stock market, which jumps from bearish to bullish and stagnant.
The Markov property says that for a given process, which is in a state of X
n at a particular point in time, the probability X
n + 1 = k (where k is any of the M-states to which the process can go) depends only on what is the state at the moment. And not about how it has reached the current state.
Mathematically speaking, we can write it in the form of the following formula:
For clarity: you do not care about the sequence of conditions that the market took to become “bullish”. The likelihood that the next state will be “bearish” is determined only by the fact that the market is currently in a “bullish” state. This also makes sense in practice.
A process with a Markov property is called a Markov process. Why is Markov chain important? Because of its stationary distribution.
What is stationary distribution
I will try to explain the stationary distribution by calculating it for the example below. Suppose you have a Markov process for the stock market, as shown below.
You have a transition probability matrix, which determines the probability of transition from state X
i to X
j .
Transition probability matrix, Q
In the above matrix of transition probabilities Q, the probability that the next state will be “bull”, considering the current state “bull” = 0.9; the probability that the next state will be “bearish” if the current state is “bull” = 0.075. And so on.
Well, let's start with some particular state. Our state will be set by vector [bull, bear, stagnation]. If we start with the “bearish” state, the vector will be: [0,1,0]. We can calculate the probability distribution for the next state by multiplying the current state vector by the transition probability matrix.
Notice that probabilities give a total of 1.
The following state distribution can be found by the formula:
And so on. In the end, you will reach a steady state in which the state stabilizes:
For the matrix of transition probabilities Q described above, the stationary distribution s is:
You can get the stationary distribution with the following code:
Q = np.matrix([[0.9,0.075,0.025],[0.15,0.8,0.05],[0.25,0.25,0.5]]) init_s = np.matrix([[0, 1 , 0]]) epsilon =1 while epsilon>10e-9: next_s = np.dot(init_s,Q) epsilon = np.sqrt(np.sum(np.square(next_s - init_s))) init_s = next_s print(init_s) ------------------------------------------------------------------ matrix([[0.62499998, 0.31250002, 0.0625 ]])
You can also start from any other state - reach the same stationary distribution. Change the initial state in the code if you want to verify this.
Now we can answer the question why stationary distribution is so important.
The stationary distribution is important because it can be used to determine the probability of a system being in a certain state at a random time.
For our example, we can say that in 62.5% of cases the market will be in a “bullish” state, 31.25% - in a “bearish” state and 6.25% - in stagnation.
Intuitively, you can think of it as a random walk along a chain.
Random walk
You are at a certain point and select the next state, observing the probability distribution of the next state, taking into account the current state. We can visit some sites more often than others, based on the probabilities of those sites.
In this way, at the dawn of the Internet, Google solved the search problem. The problem was sorting the pages, depending on their importance. Google solved the problem using the Pagerank algorithm. The Google Pagerank algorithm should consider the state as a page, and the probability of a page in a stationary distribution as its relative importance.
We now turn directly to the consideration of the methods of MCMC.
What are the methods of Monte Carlo with Markov chains (MCMC)
Before answering what MCMC is, let me ask one question. We know about beta distribution. We know its probability density function. But can we take a sample from this distribution? Can you think of a way to do this?
Think ...
MCMC allows you to choose from any probability distribution. This is especially important when you need to make a selection of the posterior distribution.
The figure shows the Bayes theorem
For example, you need to make a sample of the a posteriori distribution. But is it easy to calculate the a posteriori component along with the normalizing constant (evidence)? In most cases, you can find them in the form of likelihood and a priori probability. But to calculate the normalizing constant (p (D)) does not work. Why? We will understand in more detail.
Suppose H accepts only 3 values:
p (D) = p (H = H1) .p (D | H = H1) + p (H = H2) .p (D | H = H2) + p (H = H3) .p (D | H = H3)
In this case, p (D) is easy to calculate. What if the value of H is continuous? Would it work out just as easily, especially if H took infinite values? For this, one would have to solve a complex integral.
We want to make a random selection from the a posteriori distribution, but we also want to consider p (D) to be a constant.
Wikipedia
writes :
Monte Carlo methods with Markov chains is a class of algorithms for sampling from a probability distribution based on the construction of a Markov chain, which has the desired form as a stationary distribution. The state of the chain after a series of steps is then used as a sample of the desired distribution. Sample quality improves with increasing number of steps.
We will understand by example. Suppose you need a sample from the
beta distribution . Its density is:
where C is the normalizing constant. In fact, this is some function of α and β, but I want to show that it is not needed for a sample from the beta distribution, so we take it as a constant.
The beta distribution task is really difficult, if not to say, practically insoluble. In fact, you may have to work with more complex distribution functions, and sometimes you will not know the normalizing constants.
The MCMC methods make life easier by providing algorithms that could create a Markov chain, which has a beta distribution as a stationary distribution, given that we can choose from a uniform distribution (which is relatively simple).
If we start from a random state and move to the next state based on a certain algorithm several times, we will eventually create a Markov chain, which has a beta distribution as a stationary distribution. And the states in which we will find ourselves after a long time can be used as a sample from the beta distribution.
One of these MCMC algorithms is the
Metropolis-Hastings algorithm.
Metropolis-Hastings Algorithm
Intuition:
So what is the purpose?
It is intuitively clear that we want to walk on some piecewise surface (our Markov chain) in such a way that the amount of time we spend at each location is proportional to the height of the surface at that location (the desired probability density from which we want to take a sample).
For example, we would like to spend twice as much time on the top of a hill 100 meters high than on a nearby 50 meter hill. It’s good that we can do this even if we don’t know the absolute heights of points on the surface: all you need to know is relative heights. For example, if the top of hill A is twice as high as the top of hill B, then we would like to spend twice as much time as A than B.
There are more complex schemes for proposing new places and the rules for their adoption, but the basic idea is this:
- Select a new "proposed" location.
- Find out how much higher or lower this location is compared to the current one.
- Stay in place or move to a new place with a probability proportional to the heights of places.
The purpose of an MCMC is to select from a certain probability distribution without having to know its exact height at any point (no need to know C).
If the “wandering” process is configured correctly, you can make sure that this proportionality (between the time spent and the height of the distribution) is reached .
Algorithm:
Now let's define and describe the task in a more formal way. Let s = (s1, s2, ..., sM) be the desired stationary distribution. We want to create a Markov chain with such a stationary distribution. We start with an arbitrary Markov chain with M-states with the transition matrix P, so that pij represents the probability of transition from state i to j.
Intuitively, we know how to roam the Markov chain, but the Markov chain does not have the required stationary distribution. This circuit has some stationary distribution (which we do not need). Our goal is to change the way we wander the Markov chain so that the chain has the desired stationary distribution.
To do this:
- Start with a random initial state i.
- Randomly select a new estimated state, looking at the transition probabilities in the i-th row of the transition matrix P.
- Calculate a measure called decision probability, which is defined as: aij = min (sj.pji / si.pij, 1).
- Now flip a coin that lands on the surface with an eagle with probability aij. If an eagle falls, accept the offer, that is, move to the next state, otherwise, reject the offer, that is, remain in the current state.
- Repeat many times.
After a large number of tests, this circuit will converge and have a stationary distribution of s. We can then use the states of the circuit as a sample of any distribution.
Doing this for a beta distribution, the only time that you have to use probability density is to search for a probability of decision making. To do this, divide sj by si (that is, the normalizing constant C is reduced).
Sample from beta distribution
We now turn to the problem of sampling from the beta distribution.
The beta distribution is a continuous distribution on [0,1] and can have infinite values ​​on [0,1]. Suppose that an arbitrary Markov chain P with infinite states on [0,1] has a transition matrix P such that pij = pji = all elements in the matrix.
We do not need the P Matrix, as we will see later, but I want the description of the problem to be as close as possible to the algorithm we proposed.
- Start with a random initial state i, obtained from a uniform distribution on (0,1).
- Randomly select a new estimated state, looking at the transition probabilities in the i-th row of the transition matrix P. Suppose we selected another state Unif (0,1) as the estimated state j.
- Calculate the measure, which is called the probability of a decision:
What is simplified to:
Since pji = pij, and where
- Now throw a coin. With probability aij, an eagle will fall out. If an eagle falls, then it is necessary to accept the offer, that is, go to the next state. Otherwise, it is worth rejecting the proposal, that is, to remain in the same condition
- Repeat the test many times.
Code:
It's time to move from theory to practice. Let's write our beta sample in Python.
impo rt rand om
Check the results with the actual beta distribution.
impo rt num py as np import pylab as pl import scipy.special as ss %matplotlib inline pl.rcParams['figure.figsize'] = (17.0, 4.0)
As you can see, the values ​​are very similar to the beta distribution. Thus, the MCMC network has reached a steady state
In the above code, we created a beta sampler, but the same concept applies to any other distribution from which we want to make a sample.
findings
It was a great post. Congratulations if you have read it to the end.
In essence, MCMC methods can be complex, but they provide us with more flexibility. You can select from any distribution function using MCMC selection. Typically, these methods are used to sample from the a posteriori distributions.
You can also use MCMC to solve problems with a large state space. For example, in a backpack problem or for decoding. I will try to provide you with more interesting examples in the
next post. Keep for updates.
From the Editor