📜 ⬆️ ⬇️

How many numbers are in the array

A little background. I have written this post for two purposes. First, run in Markdown + markup converter inline_formula in a readable form. Secondly, to tell about an interesting problem from data streaming. By the end of the writing, I found a post about LogLog four years ago. The author of the previous post placed emphasis on implementation on my luck. I'm relying on inline_formula tell you more about mathematics.

Let's imagine that we have a router. Through the router passes many packages at different addresses. We are interested in receiving statistics on how many addresses are involved in communication. There are a couple of problems.


some title
')
Task . There is a sequence of integers inline_formula , all numbers take values ​​from inline_formula before inline_formula . It is required in one pass to count the number of different numbers using inline_formula of memory .

I will tell the probabilistic approximate Flageolet-Martin algorithm. TTX algorithm:


At the end of the post I will explain why exact deterministic algorithms require inline_formula of memory.

Flageolet-Martin Algorithm


Imagine that we have a segment of real numbers inline_formula . On the segment, we independently randomly throw inline_formula points according to a uniform distribution. What is the distance between the leftmost point and zero?

It can be assumed that the points divide the segment by inline_formula smaller sub-ribs of approximately the same length. If we carefully write the expectation of the distance and take the integral, we get


Let someone randomly throw several points per segment, and inline_formula - distance from zero to the leftmost point. You can estimate that the total points of the order inline_formula .

The idea of ​​the Flageolet-Martin algorithm is to randomly drop all the numbers in a sequence. inline_formula on the segment inline_formula and then find the distance from zero to the leftmost point. If the same numbers will always be displayed in a single point, and different independently distributed over the segment, we will be able to estimate the answer.

2 independent hash functions


Throw numbers on a segment, we will use a random hash function.

Definition Hash family inline_formula called 2-independent if for any inline_formula and inline_formula


The meaning of the definition is as follows. Fix any two keys inline_formula and inline_formula .
The keys are different. Look at random variables inline_formula and inline_formula . Randomness is given by the choice of function. inline_formula . Then, by definition, the quantities inline_formula and inline_formula will behave as independent.

As a result, if you take just one key inline_formula then quantity inline_formula will be evenly distributed over inline_formula .

For example, take the prime number inline_formula . Let be inline_formula . inline_formula Is the family of all linear mappings modulo inline_formula :


for inline_formula . Then


Insofar as inline_formula , the system has exactly one solution among inline_formula possible:


Let's understand two important points. First, the storage of such a function costs in inline_formula memory, which is very little. Secondly, if you look closely, you can understand that the calculations take place in the field inline_formula , and can be generalized for any finite field .

In the test code, I will use the Galois field. inline_formula . In the description below, we can assume that we have a family of hash functions inline_formula where inline_formula - power of two. Storage of one function takes inline_formula of memory.

Algorithm


Let be inline_formula - power of two.
Before starting, the algorithm randomly selects a hash function. inline_formula from a 2-independent family.

We will throw the elements of the sequence on the segment inline_formula . We take the next value inline_formula and write: zero, point, inline_formula in binary form. For example, if inline_formula then get the number inline_formula .

Denote by inline_formula number of leading zeros in binary inline_formula . Let be inline_formula . We know that the minimum value lies between inline_formula and inline_formula .

Algorithm Answer: inline_formula .

def init(): h = H.sample() z = 0 def process(a): z = max(z, zero(h(a)) def answer(): return 2**(z + 0.5) 

Analysis



I plan to show that the response of the algorithm will be 3 times more true with a probability less inline_formula . Similarly, the algorithm will return a response 3 times less than the true probability of less inline_formula . If you do not plan to go into mathematics, feel free to go on to the next part.

Denote by inline_formula the set of all different numbers of a sequence inline_formula , but inline_formula - their number.

To analyze the algorithm, we need two sets of random variables.


Note that the probability inline_formula : size inline_formula evenly distributed across the segment inline_formula ; inline_formula - power of two; there is everything inline_formula numbers with inline_formula leading zeros.

Mean expectation inline_formula . Restrict the variance from above


Note that the variance in magnitude inline_formula linear. For any two inline_formula and inline_formula




Insofar as inline_formula and inline_formula independent then inline_formula . So


Moreover, inline_formula because magnitudes inline_formula - 2-independent.

Now consider the value inline_formula .


Let be inline_formula - the minimum number is such that inline_formula . The event “the algorithm gave a response 3 times the desired one” is equivalent to an event inline_formula and is equivalent to the event inline_formula . Applying Markov's inequality , we limit the probability


Let be inline_formula - the maximum number is such that inline_formula . Similarly, the event “the algorithm gave a response is 3 times less than the desired one” is equivalent to the event inline_formula and is equivalent to the event inline_formula . Applying the Chebyshev inequality , we get


Final chord: median


It remains to understand how to reduce the error. Take the case when the algorithm gives out too big an answer. Let's run the algorithm in parallel inline_formula once and return the median among the answers. I contend that if inline_formula , the algorithm is mistaken with a probability of no more inline_formula . Similarly, limiting the error in the other direction, we get


Why does the median work like this? By Chernov inequality. Let's get a random variable inline_formula . Magnitude inline_formula equals one if the response of the algorithm to inline_formula run less inline_formula . The probability of this event is not less than 0.52.

If the median inline_formula algorithm runs more inline_formula , it means that the algorithm gave at least half the answer inline_formula and inline_formula . Then by the Hefding-Chernov inequality


Where inline_formula - some constant. Another case is shown similarly.

Lower bound for exact algorithm


Let's imagine that someone really came up with a deterministic algorithm that finds the exact number of different elements in one pass in one pass. We show that such an algorithm should use inline_formula of memory.

Take a lot inline_formula size inline_formula and put it as the beginning of the sequence. We feed this part of the algorithm and look at its memory.

From the memory of the algorithm alone, you can extract the entire set inline_formula . If you feed in the current state number inline_formula , the response of the algorithm will not change; if a inline_formula , it will increase by 1. Hence, each set inline_formula must match its unique memory state.

Various subsets of inline_formula size inline_formula about inline_formula . If we want to assign a bit string to each set, we will need inline_formula

What to read


  1. Probabilistic Counting Algorithms for Data Base Applications, Flajolet, Martin, 1983, link .
  2. “The space complexity of approximating the frequency moments”, Alon, Matias, Szegedy, 1999, link .

Source: https://habr.com/ru/post/263211/


All Articles