📜 ⬆️ ⬇️

Effective median estimation

So, you have some kind of data stream. A big such stream. Or already ready set. And I want to determine some of its characteristics. Even non-programmers can come up with an algorithm for determining the minimum and maximum values. Calculating the average is already a little more difficult, but it also presents no difficulties - know how to calculate the amount and increment the counter for each new value. Standard deviation is all the same, only the numbers are different. What about the median?

For those who have forgotten what it is, I recall that the median (the 50th percentile) of the data sample is the value that divides this sample in half - the data from one half have a value not less than the median, and from the second one no more. Its value lies in the fact that its value does not depend on the magnitude of random bursts, which can greatly affect the average.

Strictly speaking, from the definition it follows that in order to calculate the exact value of the median, we need to store the entire sample, otherwise there are no guarantees that we counted exactly what we wanted. But for continuous and large data streams, the exact value still does not make much sense - now it’s one, and after 100 new samples it’s another. Therefore, an effective method of estimating the median, which will not require a lot of memory and CPU resources, and will give an accuracy of the order of one percent or better - just what you need.

Immediately warn you - the proposed method has several limitations. In particular, it works very poorly on sorted samples (but it works very well on more or less evenly distributed samples). Further, a simpler case of non-negative values ​​is considered, for the general case additional calculations are needed.
')
The idea of ​​the method is to build a calculation process that converges to the actual value of the median. If we have already processed some amount of data and have some estimate of the median, then about the receipt of a new volume (with almost the same median, which is important), our estimate should be improved. More precisely, the assessment should be improved more likely than degraded.

You can use various types of median calculation windows, for example, calculate the exact median of the last 100 values, and average it with a constantly decreasing weight with the previous estimate. Such a method will work well, but there is a much easier and almost the same exact. Namely, simply shift the current estimate of the median to a new value by some small constant delta. If the previous estimate was not accurate, then when processing a new amount of data, it will approach the actual value, i.e. will become more accurate. And if the estimate is already accurate, then on the processing of the new data volume, on half of the values ​​there will be a shift in one direction, and in the other half - in the other, as a result, the estimate will return to the exact value.

It is important that the delta should have the same absolute value for shifts both upwards and downwards, otherwise we will get not the 50th percentile. But now the problem arises of choosing the delta value — too small will give slow convergence, and too much will result in a large error, especially if the delta is comparable to the median value itself. Automatic calculation of the delta already deprives of its title of a constant, but it doesn’t matter if we end up with the best result.

It makes sense to make the delta constantly decreasing in order to improve convergence. But not too fast, otherwise, under adverse conditions, the assessment risks never catching up with the real value of the median. The 1 / count ratio fits perfectly - it is easy to calculate, and the sum (1 / n) series is divergent, so that in the end, the estimate will reach the real median, no matter how bad it was originally.

Linking the delta value to the current median estimate is risky. In unfortunate conditions, too low an estimate will be difficult to grow. It is best to calculate the delta based on another quite stable characteristic of the sample - the average value. With a 1 / count ratio, the delta value will constantly decrease (except for a few cases when the current average grows too much compared to the previous one). For data that can take not only non-negative values, this approach will not work anymore, but we can easily get out of the situation, if we take two average values, positive and negative values, and use the sum of their absolute values.

The final algorithm for estimating the median is almost as simple as the algorithm for calculating the average value:
sum += value count ++ delta = sum / count / count // delta = average/count if value < median { median -= delta } else { median += delta } 


The error and speed of convergence, in my opinion, it is excellent - on a sample of 100 values, the error is usually about 10-20%, by 1000 - already about 1-3%, and then the error decreases even more.

For those who want to independently assess the quality of the algorithm, a small script on perl is:
 #!/usr/bin/perl -s $N = shift || 100000; for (1 .. $N) { push @a, (rand(1)*rand(1))**2; } @t = sort { $a <=> $b } @a; $MEDIAN = $t[$N/2-1]; median_init(); for (@a) { median_update($_); } $diff = ($MEDIAN-$median)/$MEDIAN*100; $avg = $sum/$count; print "Real:\t\t$MEDIAN Estimated:\t$median Difference:\t$diff \% Average: $avg "; sub median_init() { $count = $sum = $median = 0; } sub median_update($) { $value = $_[0]; $sum += $value; $count ++; $delta = $sum / $count / $count; if($median <= $value) { $median += $delta; $s2 = '+'; } else { $median -= $delta; $s2 = '-'; } if($debug) { $s = ($value > $MEDIAN) ? '+' : '-'; printf "%s %.5f\t%d\t%.7f\t %s %e\n", $s, $value, $count, $median, $s2, $delta; } } 


Statistics for different distributions:

rand (1) - median 0.5, average 0.5:
sample sizevalues ​​of deviations in percent on 10 different samples
100-0.06568 -0.06159 -0.02009 -0.00372 -0.00177 -0.00012 0.00441 0.00503 0.01287 2.46360
1000-0.01196 -0.00462 -0.00415 -0.00240 0.00187 0.00948 0.01446 0.01787 0.02734 0.04150
10,000-0.04025 -0.03712 -0.00983 -0.00512 0.00632 0.00768 0.01212 0.01267 0.05422 0.07043


rand (1) ^ 2 - median 0.25, average 1/3:
the size10 percent deviation values
100-0.80813 -0.38749 -0.38348 -0.32599 -0.31828 -0.13034 -0.12811 0.06157 0.28336 6.07339
1000-2.98383 -0.32917 -0.27234 -0.22337 -0.09080 0.03338 0.06851 0.30257 0.30286 0.31563
10,000-0.97752 -0.51781 -0.44970 -0.39834 -0.17945 0.02517 0.06729 0.13353 0.31976 0.44967


rand (1) * rand (1) - median 0.187 .., average 0.25:
the size10 percent deviation values
100-3.84694 -0.08139 -0.06980 -0.04138 -0.02477 0.01441 0.02042 0.03998 0.16999 0.17690
1000-0.38108 -0.08865 -0.06287 -0.00716 0.01326 0.02373 0.05510 0.06785 0.09153 0.14421
10,000-0.44392 -0.26293 -0.10081 -0.05271 0.02870 0.03629 0.09319 0.14522 0.14807 0.20980


-log (rand (1)) - median 0.69 ..., average 1
the size10 percent deviation values
100-15.40835 -0.07608 -0.06993 -0.05958 -0.03355 -0.02186 -0.00486 0.01619 0.10614 0.11928
1000-3.05399 -0.06109 -0.05757 -0.04758 -0.01636 -0.01522 0.01376 0.03155 0.05036 0.06091
10,000-0.24191 -0.10548 -0.09042 -0.04264 -0.03507 -0.01219 -0.01158 0.00481 0.04048 0.07969


-log (rand (1) ^ 4) - median 2.76 .., average 4
the size10 percent deviation values
100-4.20326 -0.06970 -0.05433 -0.04191 -0.01110 -0.00353 0.02933 0.05837 0.06381 0.08264
1000-0.01198 -0.01090 -0.00719 0.01424 0.01454 0.01905 0.03147 0.03237 0.05578 0.89456
10,000-0.45039 -0.06332 -0.02954 -0.02589 -0.01358 -0.01100 0.00229 0.01811 0.02675 0.05882

Source: https://habr.com/ru/post/228575/


All Articles