Probability theory never ceased to amaze me, starting from the moment I first encountered it, and to this day. At different times, I was overtaken in different degrees, let's call them âwow effectsâ, shock shocks to the cerebellum, from which I was covered with the effect of the third eye, and the world forever ceased to be the same.
I experienced the first âwow effectâ from the Central Limit Theorem. We take a bunch of random variables, direct them to infinity and get a normal distribution. And it doesnât matter how these values ââare distributed, it doesnât matter whether it is tossing a coin or raindrops on the glass, flares on the Sun or the remains of coffee grounds, the result will always be the same - their sum always tends to normality. Unless, one needs to demand their independence and the existence of a dispersion (I later learned that there is a theorem for extreme heavy tailing distributions with infinite dispersion). Then this paradox did not let me fall asleep for a long time.
At some point of study at the university such subjects as discrete mathematics and functional analysis merged together and surfaced in the theater under the guise of the expression "almost sure." Standard example: you randomly select a number from 0 to 1. How likely are you to stick a rational number (hello, Dirichlet function)? Spoiler: 0. Zero, Karl! An infinite set has no power if it is countable. You have an infinite number of options, but you will not choose any of them. You will not select 0, or 1, or 1/2, or 1/4. You will not select 3/2.
Yes, yes, what to choose 1/2, what to choose 3/2, the probability is zero. But in 3/2 you will not pinch exactly, these are the conditions, and in 1/2 you will not get well ... almost surely. The concept of âalmost everywhereâ / âalmost certainlyâ is amused by mathematics, and the average man is forced to twist his finger at his temple. Many break their brains in an attempt to classify zeros, but the result is worth it.
The third in a row, but not in strength, âwow effectâ overtook already at the transition to the advanced level - when reading books on stochastic calculus. The reason for this was Itoâs lemma. Since the days of school, when our virgin eyes were first shown a derivative, we have not doubted the correctness of such a formula:
d X 2 = 2 X c d o t d X .
And she is true. Thats only if X - This is not a random process. The hell mixture of the properties of the normal distribution and âalmost surelyâ proves that in the opposite situation this formula is generally incorrect. A volume of mathematical analysis with solutions of ordinary differential equations can now be thrown into the furnace. People in the subject giggle softly, the others eagerly leaf through the wiki articles with Ito calculus.
But most recently, I experienced the fourth, so-called âwow effectâ. This is not a single fact, but a whole theory, which I am going to tell in a series of several articles. And if the previous feints of the theory of probability do not surprise you any more, then I beg you for kat (I know, you are already here). ')
Hermite Polynomials
Let's start with ordinary algebra - let's define âprobabilisticâ (they are slightly different from âphysicalâ) Hermite polynomials :
H n ( x ) = ( - 1 ) n e x 2 / 2 f r a c d n d x n ( e - x 2 / 2 ) , q u a d n i n m a t h b b N 0 .
Values ââof the first polynomials: H0(x)=1,H1(x)=x,H2(x)=x2â1,H3(x)=x3â3x,dots
Hermite polynomials have the following properties:
Hâ˛n(x)=nHnâ1(x),
Hn(âx)=(â1)nHn(x),
Hn(x)=xHnâ1(x)â(nâ1)Hnâ2(x).
The last relation will help us in calculating n Hermite polynomials for a given x . We will be programmed in Haskell, because it allows mathematicians to express themselves in their usual language - Haskell is pure, strict and beautiful like mathematics itself.
-- | 'hermite' is an infinite list of Hermite polynomials for given x hermite :: (Enum a, Num a) => a -> [a] hermite x = s where s@(_:ts) = 1 : x : zipWith3 (\hn2 hn1 n1 -> x * hn1 - n1 * hn2) s ts [1..]
The hermite function takes a parameter as input. x , and the output gives an endless sheet of n polynomials for n=0,1,... Who is not familiar with the concept of lazy computing, I strongly advise you to read. For those who know this concept, but not yet fully functional programming: what happens here? Imagine that we already have an infinite sheet with all the values ââof Hermitian polynomials:
s = [1, x, x^2-1, x^3-3x, x^4-6x^2+3, ... ]
The tail of this sheet (without the first element):
ts = [x, x^2-1, x^3-3x, x^4-6x^2+3, ... ]
Afterwards, we will take another sheet with natural numbers:
[1, 2, 3, ... ]
The zipWith3 function combines the last three sheets using the operator given to it:
Add ahead 1 and x, and we get the full set of Hermite polynomials. In other words, we got a sheet with the values ââof polynomials, using a sheet with these values, that is, a sheet, which we are trying to get. Rumor has it that a full awareness of the beauty and power of FP is akin to the ability to look into your ear.
Check: first 6 values ââfor x=1 :
Prelude> take 6 (hermite 1) [1,1,0,-2,-2,6]
What we expected to see.
Hilbert space
Let's move a little to another steppe - let us recall the definition of the Hilbert space. In scientific terms, this is a complete metric linear space with a scalar product given on it langleX,Yrangle In this space, each element corresponds to a real number, called the norm and equal to
|X|=sqrtlangleX,Xrangle.
Nothing extraordinary. When I try to imagine the Hilbert space, I start from simple and gradually come to complex.
The simplest example is the space of real numbers: H=mathbbR . In this case, the scalar product of two numbers X and Y we'll have
langleX,Yrangle=XY.
Then I move to Euclidean space H=mathbbRn . Now
langleX,Yrangle=sumni=0XiYi.
This space can be extended to the space of complex vectors: H=mathbbCn for which the scalar product will be
langleX,Yrangle=sumni=0XioverlineYi
(the top bar denotes complex conjugation).
Well, finally I come into the space for adults, a space with infinite dimension. In our case, this will be the space of square-integrable functions defined on some set Omega with a given measure mu . We will denote it as H=L2(Omega,mu) . Scalar product on it is defined as follows:
langleX,Yrangle=intOmega(XcdotY)dmu.
Usually under the set Omega implied interval [a,b] , and under the measure mu - uniform measure (Lebesgue measure), i.e. dmu=mu(domega)=domega . And then the scalar product is written in the form of ordinary Lebesgue integral
intbaX(omega)Y(omega)domega.
If we think in terms of probability theory, then Omega Is a space of elementary events X=X(omega) and Y=Y(omega) - random variables, and mu - probability measure. Each such measure has its own density function rho which may be different from a constant, then dmu=rho(omega)domega and the dot product coincides with the expectation:
It is time to introduce an element of chance into our thoughts. Suppose we have a Hilbert space H . Then we will call \ {W (h) \} _ {h \ in H}(isonormal) Gaussian process , if
random vector (W(h1),dots,W(hn)) distributed normally with zero expectation for any h1,dotshninH and
for h,ginH
mathbbE[W(h)cdotW(g)]=langleh,grangle.
By its mathematical essence W(h) Is a mapping from one Hilbert space to another, from some H at L2(Omega,mathcalF,mathbbP) - probabilistic space of random variables with a finite variance given by a triple Omega (many elementary events), mathcalF (sigma-algebra) and mathbbP (probability measure). It is easy to show that this mapping is linear:
(in the sense of equality "almost probably", hello "wow effect" # 2)
Example. Let be H=L2((0,infty),lambda) where lambda - uniform measure (Lebesgue). Dot product on it
langlef,grangle=int(fcdotg)dlambda.
Let be h(s)=1[0,t](s) - unit function on the interval [0,t] . Then |h|2=int1[0,t](s)ds=t and
B(t)=W(1[0,t])simmathcalN(0,t)
none other than the Brownian motion (or the Wiener process). Moreover,
intt0f(s)dB(s)=W(1[0,t]f)
called Ito integral of the function f regarding B .
In order to implement the Gaussian process, I will use the packages that noble people have already written for us.
import Data.Random.Distribution.Normal import Numeric.LinearAlgebra.HMatrix as H -- | 'gaussianProcess' samples from Gaussian process gaussianProcess :: Seed -- random state -> Int -- number of samples m -> Int -- number of dimensions n -> ((Int, Int) -> Double) -- function that maps indices of the matrix into dot products of its elements -> [Vector Double] -- m n-th dimensional samples of Gaussian process gaussianProcess seed mn dotProducts = toRows $ gaussianSample seed m mean cov_matrix where mean = vector (replicate n 0) cov_matrix = H.sym $ (n><n) $ map (\i -> dotProducts (quot in, rem in)) [0..]
The gaussianProcess function takes a seed parameter (standard for generators), nSamples is the sample size, dim is the dimension of the vector (h1,...,hn)T , dotProducts is a function that accepts input (i,j) , the index of the covariance matrix and the scalar product corresponding to this index langlehi,hjrangle . On output gaussianProcess produces nSamples of vectors (W(h1),dots,W(hn)) .
It is already time to combine all the knowledge we have gained together. But before that, it is worth mentioning one useful property of Hermitian polynomials and the normal distribution in the aggregate. Let be F(t,x)=exp(txât2/2). Then, using Taylor's decomposition,
Take X,YsimmathcalN(0,1) - two standard normally distributed random variables. Through the generating normal distribution function, we can pull out the following relation:
mathbbE[F(s,X)cdotF(t,Y)]=exp(stmathbbE[XY]).
Take (n+m) -th partial derivative fracpartialn+mpartialsnpartialtm equate s=t=0 on both sides of the equation above and we get
\ mathbb {E} [H_n (X) \ cdot H_m (Y)] = \ begin {cases} n! (\ mathbb {E} [XY]) ^ n, & n = m, \ 0, & n \ neq m. \ end {cases}
What does this tell us? First, we got the rate |Hn(X)|2=n! for XsimmathcalN(0,1) and, secondly, we now know that different Hermitian polynomials in normal random variables are orthogonal to each other. Now we are ready to realize something more.
Spread the space into chaos
Let be \ mathcal {H} _n = \ overline {\ operatorname {span}} \ Big \ {H_n (W (h)) \ Big | \ | h \ | = 1 \ Big \} - nth Wiener chaos . Then
Whoa whoa easy! Let's decompose this theorem on decomposition in pieces and translate from mathematical to human. We will not go into much detail, but only intuitively explain what it is about. Icon operatornamespan(X) denotes the linear span of a subset X Hilbert space H - the intersection of all subspaces H containing X . Simply put, this is the set of all linear combinations of elements from X . Bar above top operatornamespan denotes closure of a set . If a overlineoperatornamespan(X)=H then X called the complete set (roughly, " X tight in H "). Consequently, \ overline {\ operatorname {span}} \ {H_n (W (h)) | \ | h \ | = 1 \} - the closure of the linear hull of Hermite polynomials from the Gaussian process on a single hypersphere.
With the notation sort of sorted out. Now that is Wiener chaos. We go from simple: mathcalH0 contains all linear combinations of Hermitian polynomials with degree 0, that is, various combinations of numbers acdot1 , that is, the whole space of real numbers. Consequently, mathcalH0=mathbbR . Go ahead. Easy to see that \ mathcal {H} _1 = \ {W (h) | h \ in H \} , that is, the space composed of Gaussian processes. It turns out that all centered normal values ââbelong to mathcalH1 . If we add more mathcalH0 , then other normal random variables will join them, whose mathematical expectation is non-zero. Further sets mathcalHn already operate with n-degrees W(h) .
Example. Let be H=L2((0,infty),lambda) and X=B(t)2 - Square Brownian motion. Then
The first term belongs mathcalH2 , second - mathcalH0 . This is called decomposition into Wiener chaos.
We showed earlier that mathcalHnperpmathcalHm for nneqm . The decomposition theorem states that these sets are not only orthogonal to each other, but also form a complete system in L2(Omega,mathcalF,mathbbP) . What does this mean in practice? This means that any random variable X with finite variance can be approximated by a polynomial function of a normally distributed random variable.
In fact
In fact, such a decomposition is useful if the distribution X in a sense, close to the normal distribution. For example, if we are dealing with Brownian motion or with a lognormal distribution. And we have not just mentioned that m a t h c a l F is created W ( h ) This is a very important condition. In fact, the density of the normal distribution
rho(x)=frac1sqrt2pieâx2/2
very similar to the definition of the Hermite polynomial
If the distribution X far from Gauss, then you can try other orthogonal polynomials. For example, the density of the gamma distribution:
rho(x)=fracxnâ1eâxGamma(n).
Nothing like? Yes this is the Laguerre polynomials
Ln(x)=fracexn!Fracdndxn(xneâx)
The Legendre polynomials correspond to the uniform distribution, the Kravchuk polynomials, etc., to the binomial distribution. The theory that develops the idea of ââdecomposing a probability space into orthogonal polynomials is referred to in the English literature as âPolynomial chaos expansionâ .
Example. Let's take now H=mathbbR function f and set a random variable X such that
X=f(xi)inL2(Omega,mathcalF,mathbbP),
Where xi=W(1)simmathcalN(0,1) . By the decomposition theorem, we can represent it as a weighted sum of Hermite polynomials
Congratulations! Now, if you have a function of a standard normally distributed random variable, you can decompose it in a Hermitian polynomial basis. For example, throwing a 0-1 fair coin can be represented as
X=1(0,infty)(xi).
After a little conjuring with mathematics (we will leave the reader to calculate simple integrals), we get the decomposition:
Note that every second element in the decomposition in basis is zero.
-- | 'second' function takes a list and gives each second element of it second (x:y:xs) = y : second xs second _ = [] -- | 'coinTossExpansion' is a Wiener chaos expansion for coin-toss rv to n-th element coinTossExpansion :: Int -- number of elements in the sum -> Double -- gaussian random variable -> Double -- the sum coinTossExpansion n xi = sum (take n $ 0.5 : zipWith (*) fn (second $ hermite xi)) where fn = 1.0 / (sqrt $ 2 * pi) : zipWith ( \fn1 k -> -fn1 * k / ((k + 1) * (k + 2)) ) fn [1, 3..]
The coinTossExpansion function returns the sum obtained by decomposing a random coin into Wiener chaos for a given xi from 0 before n . The graph shows gradual convergence for selected randomly. xi with increasing n .
Judging by this schedule, somewhere after napprox$10 we can trim the amount, round and return as X .
-- | 'coinTossSequence' is a coin-toss sequence of given size coinTossSequence :: Seed -- random state -> Int -- size of resulting sequence -> [Int] -- coin-toss sequence coinTossSequence seed n = map (round.coinTossExpansion 100) (toList nvec) where nvec = gaussianProcess seed n 1 (\(i,j) -> 1) !! 0
Check what the sequence of 20 flips will look like.
Now, when you are asked to generate coin flips, you know what to show them.
Well, no joke, we counted something and laid out something, but what is the use of this all, you ask. Do not rush to feel deceived. In subsequent articles, we will show how this decomposition allows us to take a derivative of a random variable (in a certain sense), expand stochastic integration (and your consciousness), and find all this practical application in machine learning.