How many numbers are in the array

A little background. I have written this post for two purposes. First, run in Markdown + markup converter $inline_formula$ in a readable form. Secondly, to tell about an interesting problem from data streaming. By the end of the writing, I found a post about LogLog four years ago. The author of the previous post placed emphasis on implementation on my luck. I'm relying on $inline_formula$ tell you more about mathematics.

Let's imagine that we have a router. Through the router passes many packages at different addresses. We are interested in receiving statistics on how many addresses are involved in communication. There are a couple of problems.

There are so many packages that you cannot remember them all. Say to the package "Come back!" I will forgive everything, "- too.
All possible addresses $inline_formula$ . So much memory on the router is not.

')
Task . There is a sequence of integers $inline_formula$ , all numbers take values from $inline_formula$ before $inline_formula$ . It is required in one pass to count the number of different numbers using $inline_formula$ of memory .

I will tell the probabilistic approximate Flageolet-Martin algorithm. TTX algorithm:

uses $inline_formula$ memory!
works on any input;
finds the answer that differs from the exact one by no more than 3 times with the probability $inline_formula$ :

the probability is taken according to the random bits of the algorithm.

At the end of the post I will explain why exact deterministic algorithms require $inline_formula$ of memory.

Flageolet-Martin Algorithm

Imagine that we have a segment of real numbers $inline_formula$ . On the segment, we independently randomly throw $inline_formula$ points according to a uniform distribution. What is the distance between the leftmost point and zero?

It can be assumed that the points divide the segment by $inline_formula$ smaller sub-ribs of approximately the same length. If we carefully write the expectation of the distance and take the integral, we get

Let someone randomly throw several points per segment, and $inline_formula$ - distance from zero to the leftmost point. You can estimate that the total points of the order $inline_formula$ .

The idea of the Flageolet-Martin algorithm is to randomly drop all the numbers in a sequence. $inline_formula$ on the segment $inline_formula$ and then find the distance from zero to the leftmost point. If the same numbers will always be displayed in a single point, and different independently distributed over the segment, we will be able to estimate the answer.

2 independent hash functions

Throw numbers on a segment, we will use a random hash function.

Definition Hash family $inline_formula$ called 2-independent if for any $inline_formula$ and $inline_formula$

The meaning of the definition is as follows. Fix any two keys $inline_formula$ and $inline_formula$ .
The keys are different. Look at random variables $inline_formula$ and $inline_formula$ . Randomness is given by the choice of function. $inline_formula$ . Then, by definition, the quantities $inline_formula$ and $inline_formula$ will behave as independent.

As a result, if you take just one key $inline_formula$ then quantity $inline_formula$ will be evenly distributed over $inline_formula$ .

For example, take the prime number $inline_formula$ . Let be $inline_formula$ . $inline_formula$ Is the family of all linear mappings modulo $inline_formula$ :

for $inline_formula$ . Then

Insofar as $inline_formula$ , the system has exactly one solution among $inline_formula$ possible:

Let's understand two important points. First, the storage of such a function costs in $inline_formula$ memory, which is very little. Secondly, if you look closely, you can understand that the calculations take place in the field $inline_formula$ , and can be generalized for any finite field .

In the test code, I will use the Galois field. $inline_formula$ . In the description below, we can assume that we have a family of hash functions $inline_formula$ where $inline_formula$ - power of two. Storage of one function takes $inline_formula$ of memory.

Algorithm

Let be $inline_formula$ - power of two.
Before starting, the algorithm randomly selects a hash function. $inline_formula$ from a 2-independent family.

We will throw the elements of the sequence on the segment $inline_formula$ . We take the next value $inline_formula$ and write: zero, point, $inline_formula$ in binary form. For example, if $inline_formula$ then get the number $inline_formula$ .

Denote by $inline_formula$ number of leading zeros in binary $inline_formula$ . Let be $inline_formula$ . We know that the minimum value lies between $inline_formula$ and $inline_formula$ .

Algorithm Answer: $inline_formula$ .

def init(): h = H.sample() z = 0 def process(a): z = max(z, zero(h(a)) def answer(): return 2**(z + 0.5)

Analysis

I plan to show that the response of the algorithm will be 3 times more true with a probability less $inline_formula$ . Similarly, the algorithm will return a response 3 times less than the true probability of less $inline_formula$ . If you do not plan to go into mathematics, feel free to go on to the next part.

Denote by $inline_formula$ the set of all different numbers of a sequence $inline_formula$ , but $inline_formula$ - their number.

To analyze the algorithm, we need two sets of random variables.

$inline_formula$ . Takes a value of 1 if the number of leading zeros in a binary entry $inline_formula$ at least $inline_formula$ ; otherwise - $inline_formula$ .
$inline_formula$ . Magnitude $inline_formula$ greater than zero if the variable $inline_formula$ at the end of the algorithm was not less $inline_formula$ .

Note that the probability $inline_formula$ : size $inline_formula$ evenly distributed across the segment $inline_formula$ ; $inline_formula$ - power of two; there is everything $inline_formula$ numbers with $inline_formula$ leading zeros.

Mean expectation $inline_formula$ . Restrict the variance from above

Note that the variance in magnitude $inline_formula$ linear. For any two $inline_formula$ and $inline_formula$

Insofar as $inline_formula$ and $inline_formula$ independent then $inline_formula$ . So

Moreover, $inline_formula$ because magnitudes $inline_formula$ - 2-independent.

Now consider the value $inline_formula$ .

$inline_formula$ on linearity of expectation.
$inline_formula$ linear dispersion for 2-independent values.

Let be $inline_formula$ - the minimum number is such that $inline_formula$ . The event “the algorithm gave a response 3 times the desired one” is equivalent to an event $inline_formula$ and is equivalent to the event $inline_formula$ . Applying Markov's inequality , we limit the probability

Let be $inline_formula$ - the maximum number is such that $inline_formula$ . Similarly, the event “the algorithm gave a response is 3 times less than the desired one” is equivalent to the event $inline_formula$ and is equivalent to the event $inline_formula$ . Applying the Chebyshev inequality , we get

Final chord: median

It remains to understand how to reduce the error. Take the case when the algorithm gives out too big an answer. Let's run the algorithm in parallel $inline_formula$ once and return the median among the answers. I contend that if $inline_formula$ , the algorithm is mistaken with a probability of no more $inline_formula$ . Similarly, limiting the error in the other direction, we get

Why does the median work like this? By Chernov inequality. Let's get a random variable $inline_formula$ . Magnitude $inline_formula$ equals one if the response of the algorithm to $inline_formula$ run less $inline_formula$ . The probability of this event is not less than 0.52.

If the median $inline_formula$ algorithm runs more $inline_formula$ , it means that the algorithm gave at least half the answer $inline_formula$ and $inline_formula$ . Then by the Hefding-Chernov inequality

Where $inline_formula$ - some constant. Another case is shown similarly.

Lower bound for exact algorithm

Let's imagine that someone really came up with a deterministic algorithm that finds the exact number of different elements in one pass in one pass. We show that such an algorithm should use $inline_formula$ of memory.

Take a lot $inline_formula$ size $inline_formula$ and put it as the beginning of the sequence. We feed this part of the algorithm and look at its memory.

From the memory of the algorithm alone, you can extract the entire set $inline_formula$ . If you feed in the current state number $inline_formula$ , the response of the algorithm will not change; if a $inline_formula$ , it will increase by 1. Hence, each set $inline_formula$ must match its unique memory state.

Various subsets of $inline_formula$ size $inline_formula$ about $inline_formula$ . If we want to assign a bit string to each set, we will need $inline_formula$

What to read

Probabilistic Counting Algorithms for Data Base Applications, Flajolet, Martin, 1983, link .
“The space complexity of approximating the frequency moments”, Alon, Matias, Szegedy, 1999, link .

Source: https://habr.com/ru/post/263211/

All Articles