Passing infinity: do-it-yourself t-test

In this post we will discuss the implementation of the procedure for calculating the value of the Student’s distribution function without using any special mathematical libraries. Only Java (or C / C ++, the code is quite universal).

T-test is ...

To start, let us recall a little theory: the t-test, also known as the Student’s test, is used to test statistical hypotheses about the equality of the mathematical expectations of two samples, or about the equality of a certain value of the mathematical expectation of one sample.

T-test happens:

single-sample or two-sample (see above);
one-way or two-way (the estimated value is strictly positive or may have negative values, respectively, see below);
exact or asymptotic (estimated samples have a normal distribution or only tend to it, respectively).

For the test, we need to calculate the value of t-statistics.

For two samples:
$t = \ frac {\ overline X_1 - \ overline X_2} {\ sqrt {\ frac {D_1} {n_1} + \ frac {D_2} {n_2}}}$ where $\ overline X_1$ and $\ overline X_2$ - estimated mathematical expectations of samples, $D_1$ and $D_2$ - sample variance, and $N_1$ and $N_2$ - the number of elements in the first and second sample, respectively.

For one sample:
$t = \ frac {\ overline X - m} {\ sqrt {D / N}}$ where $\ overline X$ and D is the calculated expectation and variance of the sample, N is the number of elements, m is the value, the equality of which is the expected value in the test.

The calculated value of t has a t-distribution if the source data is distributed in Gaussian, or it tends to a t-distribution in the case of non-Gaussian samples.

The probability distribution is described by the function $F_n (t) = {P (x <t)}$ where n is the number of degrees of freedom: $n = N-1$ in the case of one sample and $n = N_1 + N_2 - 2$ in the case of two samples.

Like any probability distribution, $F_n$ monotonously tends to 0 or to 1 with an argument tending to negative or positive infinity, respectively.

We use the distribution using the example of a two-sided estimate of the expectation for one sample: suppose that the real value of the expectation is m, then the calculated statistics t must have a value from neighborhood 0. That is, there is some value T, such that: $P_0 (-T <t <T) -> 1$ . The probability that a value is accidentally out of this range is: where $& \ alpha &$ - A value called the level of significance. If the calculated statistic value t is outside the interval (-T; T), then with probability $P_1$ it can be argued that $\ overline X$ different from m.

In the case of a one-sided test, only one of the “tails” of the distribution and $P_1 = 1 - & \ alpha &$ .

As for the value of T, it can be taken from a table (for example, here ) or calculated as the value of the inverse function to $F_n$ for the required $P_1$ . Table values are not for all possible levels of significance and quantities of degrees of freedom, but calculating the inverse function is very laborious.

In many cases, it is easier to do: use the monotonicity property of the distribution function. Calculating the probability value $P_t = F_n (t)$ can be compared with $P_1$ . If it turns out that $P_t> P_1$ , it can be argued that $t> t$ . The reverse is also true.

The formula of the function itself $F_n (t)$ will be considered further.

At the junction of the infinite

To describe the student’s probability function, we need to define two additional functions:

Euler gamma function ;
hypergeometric function , it is important to note that the above formula is correct for values .

Student's distribution function: .

In the numerical implementation of this formula for large values of n, there is a problem with the multiplier . Despite the fact that the relation itself has rather small values, the numerator and the denominator separately can take values for which the floating point type is not enough. Numerically, we get the type uncertainty .

To resolve this problem, we use the following property of the gamma function:

Given this property, you can rewrite the ratio of gamma functions as follows:

Suppose that there is some value Q, such that the ratio of gamma functions is correctly calculated directly, if the arguments of the numerator and denominator do not exceed Q (we assume that we work only with positive real arguments). Then consider the function which is calculated recursively:

Taking into account the definitions introduced, you can rewrite the probability function:

and move on to implementation.

Implementation

All the procedures I consider below are implemented as a static method of a specialized class in Java in order to use them without creating instances. For implementation in C / C ++, the "static" specifier is not relevant (although not criminal).

Euler gamma function

Let's start with the Euler gamma function: based on it, many statistical quantities are calculated. We assume that the gamma function will be calculated for valid positive arguments.

The formula for the numerical calculation can be found, for example, here . Java code is as follows:

// Euler's gamma-function static double gamma(double x) { double tmp = (x - 0.5) * Math.log(x + 4.5) - (x + 4.5); double ser = 1.0 + 76.18009173 / (x + 0.0) - 86.50532033 / (x + 1.0) + 24.01409822 / (x + 2.0) - 1.231739516 / (x + 3.0) + 0.00120858003 / (x + 4.0) - 0.00000536382 / (x + 5.0); return Math.exp(tmp + Math.log(ser * Math.sqrt(2 * Math.PI))); }

A couple of easy moves to remove all the “Math.” And we get the code in C / C ++.

The ratio of gamma functions

The function takes two arguments as input: the first is the argument of the numerator, the second is the denominator. If the maximum of the arguments is less than or equal to 100, then we calculate the relationship directly. Otherwise, we calculate recursively in parts, dividing the arguments in two. The number 100 is chosen empirically.

By the way, such a calculation of the ratio of gamma functions has a logarithmic complexity of the maximum value of the argument (for large values).

  // gamma(x) / gamma(y) static double gammaRatio(double x, double y) { double m = Math.abs(Math.max(x, y)); if (m <= 100.0) return gamma(x) / gamma(y); return Math.pow(2, x - y) * gammaRatio(x * 0.5, y * 0.5) * gammaRatio(x * 0.5 + 0.5, y * 0.5 + 0.5); }

The search for the maximum of the arguments is performed for greater versatility of the function. When calculating the student’s t-distribution function, the numerator argument is always greater than the denominator argument. Without "Math." We get C / C ++ code.

Hypergeometric function

Since the function is calculated through the sum of a convergent series, we need to pass an additional argument to the function: the required number of terms.

  // hypergeometric function static double hyperGeom(double a, double b, double c, double z, int deep) { double S = 1.0, M, d; for (int i = 1; i <= deep; i++) { M = Math.pow(z, (double)i); for (int j = 0; j < i; j++) { d = (double)j; M *= (a + d) * (b + d) / ( (1.0 + d) * (c + d) ); } S += M; } return S; }

By default, we will use 20 terms.

  // hypergeometric function with default deep value= 20 static double hyperGeom(double a, double b, double c, double z) { return hyperGeom(a, b, c, z, 20); }

When used in C / C ++ code, we simply change the signature of the first function:

  // hypergeometric function double hyperGeom(double a, double b, double c, double z, int deep = 20) { ...

The second function is no longer needed.

By the way: the proposed algorithm for calculating the hypergeometric function for calculating the value of the Student's t-distribution can be improved. Some arguments always have the same values, therefore, part of the calculations can be previously performed once for a fixed value of deep (I am waiting for your suggestions in the comments).

Student’s distribution function

Finally, we put everything together:

  // Student's distribution static double tDist(double x, int n) { double dN = (double)n; return 0.5 + x * gammaRatio(0.5 * (dN + 1.0), 0.5 * dN) * hyperGeom(0.5, 0.5 * (dN + 1.0), 1.5, -x*x / dN) / Math.sqrt(Math.PI * dN); }

After compilation, you can compare the calculated values with the table values. To do this, call the function with the statistics value from the table, the number of degrees of freedom from the first column, compare the obtained value with the value in the table header (after taking this value away from one).

And yet, it must be remembered that the given implementation gives correct results for the number of degrees of freedom exceeding the value of the square of the estimated statistics (which happens in real problems with large amounts of data). This is due to the limitation of the hypergeometric function argument. .

Otherwise, you can simply reject the null hypothesis without calculating the statistics, or look for a realization, for example, with the integration of the Student's density distribution function.

Source: https://habr.com/ru/post/307712/

All Articles