Bootstrap, or application statistics with almost no formulas

At the institutes, students are taught to integrate analytically, and then it is found that in practice almost all integrals are considered numerical methods. Well, or at least check the analytical solution in this way.

In statistics, there is also a dishonest method that allows you to get an approximate answer to many practical questions without analysis, with brute computer force: bootstrap. Invented and published it in 1979 by Bradley Efron.

Suppose we have an online store where we trade in all sorts of things and attract customers in various ways. It is clear that we are constantly testing something - the arrangement of pictures and buttons on a page, advertising text, AdWords, banners on partner sites, and so on. And here are the latest results - in the test group of 893 who came from us bought something 34, and in the control group of 923 who came came bought 28.

The question arises - go to the authorities and say “in the test group the conversion is 3.81%, in the control group 3.03%, there is an improvement of 26%, where is my premium?” Or continue collecting data, because the difference of 6 people is not yet statistics?
')
This problem is easy to solve analytically. We see two random variables (the percentage of conversion in the test and control groups). With a large number of observations, the binomial distribution is similar to the normal one. We are interested in the difference. The normal distribution is infinitely divisible, we subtract the expectations and add up the variances, we get the expectation of 34 / 893-28 / 923 = 0.77% and the variance (34/893) * (1-34 / 893) / 893 + (28/923) * (1- 28/923) / 923. The standard deviation is equal to the root of the variance, in our case 0.85%. The true value with a 95% probability lies within plus or minus two standard deviations from the expectation, that is, between -0.93% and 2.48%.

So the premium does not shine yet, it is necessary to continue collecting data.

Now let's solve the same problem with bootstrap. The basic idea is this: it would be good to repeat our experiment many times and look at the distribution of results. But we can’t do this, so we’ll act unfairly - we pull samples from the available data and pretend that each of them is the result of repeating our experiment.

Algorithm:

We choose at random one observation from the available.
We repeat point 1 as many times as we have observations. However, we will select some of them several times, some we will not choose at all - this is normal.
We consider the metrics we are interested in for this new sample. Remember the result.
Repeat steps 1-3 many times. For example, 10 thousand. It is possible less, but the accuracy will be worse. It is possible more, but it will take a long time.

Now we have a distribution that we can look at or calculate something from. For example, confidence interval, median or standard deviation.

Please note that we do not make any assumptions about the distribution of something. Distributions can be asymmetric, with thick tails and generally have a fancy shape. The algorithm does not change.

Miracles, however, does not happen. If the distribution does not have mathematical expectations (such occur), the bootstrap will not find it. Well, that is, he will find the expectation of the sample, but not the general population. The same applies to the situation when the sample is unrepresentative or just small.

Bootstrap is easy to implement. The following example is written in C (it doesn't get any simpler) and besides I / O, it uses only two library functions: a pseudo-random number generator and sorting. We analyze it in parts.

Prologue does not require special comments. In C, there is no bool ; we will store the data in an int .

 #include <stdio.h> #include <stdlib.h> typedef int Data_t; #define ARRAY_SIZE(x) sizeof(x)/sizeof(x[0])

A function that makes a selection on the fly and calculates the conversion percentage from it.
It would be more correct to use a more accurate algorithm for calculating the average value
but for our example it is not important.

 static double bootstrap(const Data_t* data, unsigned n) { unsigned i; double sum = 0; for (i = 0; i < n; i++) { sum += data[rand() % n]; } return sum / n; }

Compare function to sort results

 static int compare(const void* a, const void* b) { if (*(double*)a > *(double*)b) return 1; if (*(double*)a < *(double*)b) return -1; return 0; }

Initial data

 int main(int argc, char* argv[]) { Data_t test[893] = { 0 }; Data_t control[923] = { 0 }; unsigned i; for (i = 0; i < 34; i++) { test[i] = 1; } for (i = 0; i < 28; i++) { control[i] = 1; }

Initialize the pseudo-random number generator with the parameter from the command line. If we
did everything right, the results should not swim much when changing this
parameter.

  if (argc == 2) { srand(atoi(argv[1])); }

Here we will add the results

  double t_minus_c[10000];

Main loop

  for (i = 0; i < ARRAY_SIZE(t_minus_c); i++) { t_minus_c[i] = bootstrap(test, ARRAY_SIZE(test)) - bootstrap(control, ARRAY_SIZE(control)); }

We determine the 95% confidence interval: we sort the results, discard 2.5% from the bottom and the same from above, show the result.

  qsort(t_minus_c, ARRAY_SIZE(t_minus_c), sizeof(double), compare); printf("LCL=%g%%\n", 100. * t_minus_c[250]); printf("UCL=%g%%\n", 100. * t_minus_c[9750]); return 0; }

Checking:

  $ gcc -Wall -o bootstrap bootstrap.c
 $ ./bootstrap
 LCL = -0.891368%
 UCL = 2.43898%

A couple more times with other pseudo-random numbers:

  $ ./bootstrap 42
 LCL = -0.86589%
 UCL = 2.43898%
 $ ./bootstrap 2013
 LCL = -0.959673%
 UCL = 2.52184%

It looks like a theoretical result (from -0.93% to 2.48%).

And why was the garden fuss?

This problem has a simple analytical solution, but many real problems have it or not at all, or there is one, but it is very difficult. Imagine that instead of a percentage of conversion, we are interested in the ratio of profit from a client to the cost of attracting it. The distribution of such a metric is unlikely to be normal, and the formulas will no longer fit into a couple of lines. But the bootstrap will work in the same way, it is enough to change Data_t to double and put new data there.

Primary sources

Efron B (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1–26
Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming, chapter 4.2.2, page 232. Addison-Wesley, Boston, third edition, 1998.

Source: https://habr.com/ru/post/192000/

All Articles

Bootstrap, or application statistics with almost no formulas

Algorithm:

And why was the garden fuss?

Primary sources

More articles: