A little background. I have written this post for two purposes. First, run in Markdown + markup converter
in a readable form. Secondly, to tell about an interesting problem from data streaming. By the end of the writing, I found a post about LogLog four years ago. The author of the previous post placed emphasis on implementation on my luck. I'm relying on
tell you more about mathematics.Let's imagine that we have a router. Through the router passes many packages at different addresses. We are interested in receiving statistics on how many addresses are involved in communication. There are a couple of problems.
- There are so many packages that you cannot remember them all. Say to the package "Come back!" I will forgive everything, "- too.
- All possible addresses
. So much memory on the router is not.

')
Task .
There is a sequence of integers
, all numbers take values from
before
. It is required in one pass to count the number of different numbers using
of memory .
I will tell the probabilistic approximate Flageolet-Martin algorithm. TTX algorithm:
- uses
memory! - works on any input;
- finds the answer that differs from the exact one by no more than 3 times with the probability
:
the probability is taken according to the random bits of the algorithm.
At the end of the post I will explain why exact deterministic algorithms require

of memory.
Flageolet-Martin Algorithm
Imagine that we have a segment of real numbers

. On the segment, we independently randomly throw

points according to a uniform distribution. What is the distance between the leftmost point and zero?
It can be assumed that the points divide the segment by

smaller sub-ribs of approximately the same length. If we carefully write the expectation of the distance and take the integral, we get
Let someone randomly throw several points per segment, and

- distance from zero to the leftmost point. You can estimate that the total points of the order

.
The idea of the Flageolet-Martin algorithm is to randomly drop all the numbers in a sequence.

on the segment

and then find the distance from zero to the leftmost point. If the same numbers will always be displayed in a single point, and different independently distributed over the segment, we will be able to estimate the answer.
2 independent hash functions
Throw numbers on a segment, we will use a random hash function.
Definition Hash family
called 2-independent if for any
and 
The meaning of the definition is as follows. Fix any two keys

and

.
The keys are different. Look at random variables

and

. Randomness is given by the choice of function.

. Then, by definition, the quantities

and

will behave as independent.
As a result, if you take just one key

then quantity

will be evenly distributed over

.
For example, take the prime number

. Let be

.

Is the family of all linear mappings modulo

:
for

. Then
Insofar as

, the system has exactly one solution among

possible:
Let's understand two important points. First, the storage of such a function costs in

memory, which is very little. Secondly, if you look closely, you can understand that the calculations take place in the field

, and can be generalized for any
finite field .
In the test code, I will use the Galois field.

. In the description below, we can assume that we have a family of hash functions

where

- power of two. Storage of one function takes

of memory.
Algorithm
Let be

- power of two.
Before starting, the algorithm randomly selects a hash function.

from a 2-independent family.
We will throw the elements of the sequence on the segment

. We take the next value

and write: zero, point,

in binary form. For example, if

then get the number

.
Denote by

number of leading zeros in binary

. Let be

. We know that the minimum value lies between

and

.
Algorithm Answer:

.
def init(): h = H.sample() z = 0 def process(a): z = max(z, zero(h(a)) def answer(): return 2**(z + 0.5)
Analysis
I plan to show that the response of the algorithm will be 3 times more true with a probability less

. Similarly, the algorithm will return a response 3 times less than the true probability of less

. If you do not plan to go into mathematics, feel free to go on to the next part.
Denote by

the set of all different numbers of a sequence

, but

- their number.
To analyze the algorithm, we need two sets of random variables.
. Takes a value of 1 if the number of leading zeros in a binary entry
at least
; otherwise -
.
. Magnitude
greater than zero if the variable
at the end of the algorithm was not less
.
Note that the probability

: size

evenly distributed across the segment

;

- power of two; there is everything

numbers with

leading zeros.
Mean expectation

. Restrict the variance from above
Note that the variance in magnitude

linear. For any two

and

Insofar as

and

independent then

. So
Moreover,

because magnitudes

- 2-independent.
Now consider the value

.
on linearity of expectation.
linear dispersion for 2-independent values.
Let be

- the minimum number is such that

. The event “the algorithm gave a response 3 times the desired one” is equivalent to an event

and is equivalent to the event

. Applying
Markov's inequality , we limit the probability
Let be

- the maximum number is such that

. Similarly, the event “the algorithm gave a response is 3 times less than the desired one” is equivalent to the event

and is equivalent to the event

. Applying
the Chebyshev inequality , we get
Final chord: median
It remains to understand how to reduce the error. Take the case when the algorithm gives out too big an answer. Let's run the algorithm in parallel

once and return the median among the answers. I contend that if

, the algorithm is mistaken with a probability of no more

. Similarly, limiting the error in the other direction, we get
Why does the median work like this? By Chernov inequality. Let's get a random variable

. Magnitude

equals one if the response of the algorithm to

run less

. The probability of this event is not less than 0.52.
If the median

algorithm runs more

, it means that the algorithm gave at least half the answer

and

. Then by
the Hefding-Chernov inequalityWhere

- some constant. Another case is shown similarly.
Lower bound for exact algorithm
Let's imagine that someone really came up with a deterministic algorithm that finds the exact number of different elements in one pass in one pass. We show that such an algorithm should use

of memory.
Take a lot

size

and put it as the beginning of the sequence. We feed this part of the algorithm and look at its memory.
From the memory of the algorithm alone, you can extract the entire set

. If you feed in the current state number

, the response of the algorithm will not change; if a

, it will increase by 1. Hence, each set

must match its unique memory state.
Various subsets of

size

about

. If we want to assign a bit string to each set, we will need

What to read
- Probabilistic Counting Algorithms for Data Base Applications, Flajolet, Martin, 1983, link .
- “The space complexity of approximating the frequency moments”, Alon, Matias, Szegedy, 1999, link .