Median: precisely, sometimes precisely and almost exactly

If you go through your colleagues and ask how many cell phones they have, then it turns out that on average there are about 2.5, but the vast majority of them have no more than one. Here, many questions arise at once, starting with why they are not suddenly an integer and how can one estimate how many phones there are on average in a person.

For such purposes, a median estimate is appropriate. That is, such statistics, that half of the sample values are smaller, and half more. More formally: let's sort the sample values $X = (x_1, ..., x_n)$ in order $(x _ {[1]}, ..., x _ {[n]})$ and choose among them with a sequence number $floor (n / 2)$ . Such an assessment has several advantages. It is less susceptible to erroneous data, the value will always be from the set that occurred in the sample, but there are unpleasant flaws, the main one is the complexity of counting, even for fairly common distributions there is no general formula for calculating (or rather, it is difficult put into practice, see Distribution of ordinal statistics ).

')

For a specific sample, we can always order the data and take the middle element from them, but the trouble is that for this we need all the values of the sample, and at the same time. In the work of Munro, Patterson (1980), a theorem is stated which says that nothing can be thought out better and it is possible to diverge.

But what to do if we cannot afford to keep a hundred thousand million values. On the one hand, you can tackle the task of sorting a million ints into 2 MB of RAM . On the other hand, the above-mentioned article provides a simple solution, which with some assumptions with a certain probability leads to the correct solution. Namely, the following algorithm is proposed.

Munro-Paterson Method

Let there be a data stream of length $N$ (in this algorithm, it would be nice to know the length of the stream in advance), which we can successively read out by one value. We have $S$ memory cells, while $S & lt; & lt; N$ and the input stream is arranged in such a way that any permutation of N elements is equiprobable. In this case, if we are lucky, the algorithm will produce the value of the median.

The algorithm is very simple: two counters are stored. $(l, h)$ and a subsample consisting of no more than $S-2$ elements from the input data. Counter $l$ stores the number of elements that are less than the minimum in the subsample, the counter $h$ - the number of elements that are greater than the maximum. The subsample itself contains a sorted list, so the search for the minimum and maximum in it works for a constant, and the insertion of an element is linear in $S$ . Balanced trees will give logarithms for all operations for a small fee of additional memory growth. When the place under the list is over, the extreme elements are forced out so as to balance the counters. $l, h$ (if a $h & gt; l$ then the minimum element is superseded and vice versa).

So, after thousands of random numbers with $S = 32$ , we can observe the following picture.

If you look closely at the implementation, the meaning of the reference to “lucky, no lucky” becomes clear. The median should be in the range of values stored in memory. The article estimates that this will occur quite often if $S$ proportional to $\ sqrt {N}$ , which in turn means that we obviously need to have information about the length of the data stream, and even in this case, we may be very far from the true value.
This algorithm is good as a first approximation, since it is not very clear what to do if the calculation failed and it is not yet clear how to parallelize it.

Further research went in the direction of approximate calculation of the median and the idea of approximation is very simple: $k$ th order statistics is almost $k-1$ ordinal statistics or $k + 1$ or which we still remember not far from the required one. The error in estimating the median is measured in the fraction of the number of indices for which we can “miss the mark”, to $n$ - the number of viewed input data. This error is called $\ varepsilon n$ where $\ varepsilon$ - fixed permissible error: for example, $\ varepsilon = 0.01$ at $n = 1000$ means that if the entire sample were saved, the resulting median estimate would be between the 495th and 505th sorted values. If we have only 9 values, the picture of the grouping may be next.

Approximate Cann-Greenwald Method

This method borrows the idea of using $S$ memory cells, but uses the allocated memory to uniformly evaluate all quantiles at once with a given $\ varepsilon n$ accuracy, which depends on the amount of permissible memory.

The data is collected into a Cartesian tree (which are remarkably described from a series of articles ), which, when reaching the marginal volume, filters the values, so that the stated accuracy of the approximation is preserved, by creating aggregated nodes.

The tree node is a triple: the value of the initial sequence, the number of grouped data, the generation of the appearance of this value in the original sample. Also, if we are talking about a Cartesian tree, we must define the key and priority function: the key is the value itself, and the priority is the amount of aggregated data in this node (mixed with a little randomness so that the nodes with the same number of aggregated values are distributed more evenly over subtrees).

For such a tree, you can define the following important operations:

Adding a new value for $log (s)$ ;
Calculations of the median and, as is easy to see, arbitrary quantile per linear $S$ time;
The union of trees with the same again for linear $S$ time.

The article presents (without proof) a theorem, that on a random data set, the number of tree nodes will grow proportionally $log (S) / \ varepsilon$ , which means that the algorithm must be very compact and dispute.

Examples

As a small example, consider some symmetrical and non-symmetric distributions with a known median. In order not to go far, the normal and log-normal distributions should come to us. Consider the following median estimates for a sample of one million values:

Ordering values
By the method of Munro-Paterson, with the parameter equal $s = \ sqrt {10 ^ 7}$
Cannes-Greenwald method with parameter $\ varepsilon = 0.001$

All estimates for each of the distributions are made on the same data sets and repeated 100 times.

In the case of a normal distribution, the mean and median are the same, in the case of a log-normal distribution, they are quite different and it does not make sense to use one to evaluate the other (It’s better that you never do this). From the paired pictures, it can be seen that the method of complete sorting very often gives the coincidence of the results with the method of Munro-Paterson and this is correct, but still there is quite a lot about 8%, as stated in the article, of cases when the result of Munro-Paterson differs from the true . The Kann-Greenwald method gives not a bad, but still an approximate result. Below it can be seen that the scatter of all methods from the true value is approximately the same.

The Kann-Greenwald method must be sufficiently economical both in memory and speed of operation, namely, the number of elements in the tree is proportional to the logarithm of the length of the input sequence and inversely proportional to the error $\ varepsilon$ and the filling time is proportional $logN \ cdot loglogN$ .

PS

Examples of implementation can be found at the link to bitbucket , but this is not the most optimal implementation.

Already preparing the text, I discovered that it is these articles and methods that are mentioned in the course of algorithms for processing stream data in PDMI.

Thanks parpalak for the editor.

Source: https://habr.com/ru/post/264987/

All Articles

Median: precisely, sometimes precisely and almost exactly

Munro-Paterson Method

Approximate Cann-Greenwald Method

Examples

PS

More articles: