
People meet, people quarrel, add and remove friends in social networks. This post is about mathematics and algorithms, beautiful theory, love and hate in this unstable world. This post is about searching for connected components in dynamic graphs.
The big world generates big data. So a big graph fell on our head. So big that we can keep in memory its vertices, but not edges. In addition, there are updates regarding the graph - which edge to add, which one to delete. We can say that every such update we see for the first and last time. In such conditions it is necessary to find the components of the connectivity
Search in depth / width here will not pass simply because the entire graph in the memory does not hold. A system of disjoint sets could greatly help if the edges in the graph were added. What to do in the general case?
')
Task . Dan undirected graph
on
tops. Initially, the graph is empty. The algorithm comes a sequence of updates.
, which can be read in a given order exactly once. Each update is a command to remove or add an edge between a pair of vertices.
and
. It is guaranteed that at any point in time between a pair of vertices more edges than they are will not be deleted. After reading the sequence, the algorithm should output all connected components with a probability of success.
. Permitted to use
memory where
- some constant.The solution of the problem consists of three ingredients.
- Incidence matrix as a representation.
- The method as an algorithm.
Sampling as optimization.
You can see the implementation on the githaba:
link .
Incidence Matrix as Representation
The first data structure for the storage of the graph will be extremely non-optimal. We will take the incidence matrix

size

, which will mostly be zeros. Each row in the matrix corresponds to a vertex, and a column corresponds to a possible edge. Let be

. For a pair of vertices

connected by an edge, we define
%7D%20%3D%201)
and
%7D%20%3D%20-1)
, otherwise the values are zero.
As an example, look at the graph in the picture below.
For him, the incidence matrix will look like this.
The naked eye can see that this view has a serious drawback - the size
)
. We optimize it, but later.
There is also an implicit advantage. If you take a lot of vertices

and add all matrix row vectors

that match

, then the edges between the vertices

only those connecting will be reduced

and

.
For example, if you take a lot

and put the corresponding vectors, we get
Non-zero values are at the edges
)
,
)
and
)
.
Shrinking the graph as an algorithm
Understand what a rib is. Here we have two peaks

and

there is a rib between them. Of

and

there may be other edges. Stiffen rib
)
this is the procedure when we merge the vertices

and

in one let's say

, rib
)
we delete, and all remaining edges, incident

and

, we lead to the new peak

.
An interesting feature: in terms of the matrix of incidence, to tighten the edge
)
it is enough to add the corresponding row vectors. Edge itself
)
shrink, only those that go outside will remain.
Now the algorithm. Take a graph and for each non-isolated vertex we choose a neighbor.
We pull the corresponding edges.
Repeat iteration

time.
Note that after each iteration of the connection of the connected components of the new graph are one-to-one mapped to the components of the old. We can even mark which vertices have been merged into the resulting vertices, in order to restore the answer later.
Note also that after each iteration, any connected component of at least two vertices is reduced at least two times. This is natural, since at least two vertices of the old one were merged into each vertex of the new component. So after

iterations in the graph will remain only isolated vertices.
We search through all isolated vertices and restore the answer in the history of mergers.
-sampling as optimization
Everything would be great, but the above algorithm works in terms of time and memory.
)
. To optimize it, we will construct a sketch - a special data structure for a compact representation of row vectors.
From the sketch you need three properties.
First, compactness. If we build a sketch for a vector

size
)
then the sketch itself
)
must be sized
)
.
Secondly, sampling. At each iteration of the algorithm, we need to choose a neighbor. We want a way to get the index of at least one non-zero element, if there is one.
Third, linearity. If we built for two vectors

and

sketches
)
and
)
. Must be an effective method, get a sketch
%20%3D%20f(%5Csigma(a)%2C%20%5Csigma(b)))
. This will help tighten the ribs.
We will use the solution of the problem.

-sampling.
Task. Dan vector zero vector
dimensions
. Algorithm sequence comes from
view updates
: add
to value
.
can be either positive or negative integer. The resulting vector at some positions may have non-zero values. These positions are denoted by
. It is required to issue any position from
equally likely. All updates need to be processed in one pass, you can use
of memory. It is guaranteed that the maximum value in
fit in
bit.1-sparse vector
To begin, we will solve a simpler problem. Suppose we have a guarantee that the final vector contains exactly one nonzero position. We say that such a vector is 1-sparse. We will support two variables

and

. Maintaining them is simple: on each update we add to the first

to second

.
Denote the desired position by

. If it is only one, then

and

. To find a position, we consider

.
You can probabilistically check whether the vector is 1-sparse. For this we take a prime number

random integer
)
and calculate the variable

. The vector passes the 1-sparsity test if

and

.
Obviously, if the vector is really 1-sparse, then
and he will pass the test. Otherwise, the probability of passing the test is not more than

(actually, maximum

).
Why?In terms of a polynomial, a vector passes the test if the values of a polynomial
randomly selected point

equals zero. If the vector is not 1-sparse, then
)
is not identically zero. If we passed the test, we guessed the root. The maximum degree of a polynomial is

means no more roots

so the probability of guessing them is no more

.
If we want to increase the accuracy of verification to an arbitrary probability

, then you need to calculate the value

on
)
random

.
-frame vector
Now we will try to solve the problem for

- sparse vector, i.e. containing no more

nonzero positions. We will need hashing and the previous method. The general idea is to restore the entire vector, and then select some element randomly.
Take a random 2-independent hash function
![h: [n] \ rightarrow [2s]](http://tex.s2cms.ru/svg/h%20%3A%20%5Bn%5D%20%5Crightarrow%20%5B2s%5D)
. This is such a function, which distributes two arbitrary different keys equally independently.
More about hash functionsThe algorithms have two fundamentally different approaches to hash functions. The first is based on the fact that the data is less random, and our favorite hash function will mix them correctly. The second is that an evil adversary can pick up data, and the only chance to save the algorithm from unnecessary computation and other problems is to choose a hash function randomly. Yes, sometimes it can work poorly, but we can measure how bad it is. Here we are talking about the second approach.
The hash function is selected from the family of functions. For example, if we want to stir all the numbers from

before

you can take random

and

and determine the hash function
Functions for all possible
![a, b \ in [p]](http://tex.s2cms.ru/svg/a%2C%20b%20%5Cin%20%5Bp%5D)
set the family of hash functions. Randomly select the hash function - it is essentially random to select those

and

. They are usually chosen equiprobably from the set
![[p]](http://tex.s2cms.ru/svg/%5Bp%5D)
.
The remarkable property of the example above is that any two different keys are randomly distributed independently. Formally, for any different
![x_1, x_2 \ in [p]](http://tex.s2cms.ru/svg/x_1%2C%20x_2%20%5Cin%20%5Bp%5D)
and any, possibly the same,
![y_1, y_2 \ in [p]](http://tex.s2cms.ru/svg/y_1%2C%20y_2%20%5Cin%20%5Bp%5D)
probability
This property is called 2-independence. Sometimes, the probability may not be

, but

where

some kind of reasonably small value.
There is a 2-independence generalization -

-independence. This is when the function distributes not 2,

arbitrary keys regardless. Polynomial hash functions with

random coefficients have this property.
Have

Independence is one unpleasant property. With growth

function takes up more memory. This property does not depend on any particular implementation, but simply a general information property. Therefore, the less independence we can do, the easier it is to live.
Take a hash table size

. In each cell of the table, there will be an algorithm for a 1-sparse vector. When we receive an update
)
, we send this update to the algorithm in the cell
)
.
It can be calculated that for a single cell, the probability that there will be a collision at least along two non-zero coordinates will be no more than

.
Why?The probability that another element does not fall into the cell to us
)
. The probability that everyone will not get to us:
%5E%7Bs-%201%7D)
. The probability that at least someone will fall
%5E%7Bs%20-%201%7D)
. You may notice that this function is monotonically increasing. I'll just bring a picture here:
In the limit, the function gives the value

.
Suppose we want to restore all coordinates with probability of success.

or with probability of failure

. Take not one hash table, but immediately
%5E%7B-1%7D)))
. It is easy to understand that the probability of failure in decoding a particular coordinate will be
%20%2F%20s)
. The probability of failure in decoding at least one of

coordinates

. If the total decoding for 1-sparse vectors works with the probability of failure

then we won.
The final algorithm is as follows. Take
))
hash table size

. Each cell of the algorithm will have its own decoder for a 1-sparse vector with the probability of failure

.
Every update
)
processed in each hash table separately by the algorithm in the cell
)
.
Upon completion, we extract from all successfully completed 1-decoders by coordinate and merge them into one list.
The maximum in the general table will be

affected 1-decoders. Therefore, the total probability that one of the 1-decoders will not work correctly will not exceed

. Also, the probability that at least one coordinate will not be restored does not exceed

. Overall, the probability of failure of the algorithm

.
Another hash for general case
Last step in

-sampling, is to understand what to do with the general case. We will use hashing again. Take
)
-independent hash function
![h: [n] \ rightarrow [2 ^ k]](http://tex.s2cms.ru/svg/h%20%3A%20%5Bn%5D%20%5Crightarrow%20%5B2%5Ek%5D)
for some

.
Let's say that the update
)
is an

interesting if
%20%5Ctext%7B%20mod%20%7D%202%5Ej%20%3D%200)
. In other words, in binary record
)
contains

zeros at the end.
Run the algorithm for

-frame vector in parallel by

levels. At the level

we will take into account only

- interesting updates. It is easy to understand that the more

the less chance (and chances

) have the update be taken into account.
Find the first level with the highest

where there came at least some updates, and try to restore it. If the recovery is successful, we will return the randomly selected item.
There are several points that I want to briefly pay attention to.
First, how to choose

. In general, a vector may have more than

non-zero positions, but with every increase

per unit, mat. the expectation of non-zero positions drops exactly 2 times. You can choose a level at which the expectation will be between

and

. Then from Chernov and
)
independence of the hash function, the probability that the vector will be zero or have more than

nonzero positions will turn out to be exponentially small.
This determines the choice
)
where

- the permissible probability of failure.
Secondly, from
)
-independence of the hash function implies that for any position the probability to pass the filter will be equal. Insofar as

- sparse vectors, we can already recover, then getting a uniform distribution is already trivial.
So we learned how to choose a random non-zero position according to a uniform distribution.
Mix, but do not shake
It remains to understand how to combine everything. We know that in the graph

vertices. For each vertex, or rather, each row vector in the incidence matrix, we will

sketches
%2C%20%5Csigma_2(A_v)%2C%20%5Cdots%2C%20%5Csigma_%7B%5Clog%20n%7D(A_v))
for

-sampling. On

th iteration algorithm we will use for sampling

sketches
When adding an edge
)
add vertices to all sketches

and

corresponding +1 and -1 respectively.
When the ribs are over and we are asked about the components, we will run the tightening algorithm. On

iteration through

-sampling from the sketch
)
find a neighbor to each vertex. To pull the edge
)
, we add all the sketches corresponding

and

. At each new vertex we will save the list of vertices that were merged into it.
Everything. In the end, just go through the isolated peaks, the history of mergers restore the answer.
Who is to blame and what to do again
In fact, the topic of this post has arisen for a reason. In January, Ilya Razenshtein (@ilyaraz), a graduate student at MIT, came to us at Peter's CS Club and talked about algorithms for big data. There was a lot of interesting (see
course description ). In particular, Ilya managed to tell the first half of this algorithm. I decided to finish the job and tell the whole result on Habré.
In general, if you are interested in mathematics related to computational processes aka Theoretical Computer Science, come to our Academic University in the direction of
Computer Science . Inside they will teach complexity, algorithms and discrete mathematics. The real science will start from the first semester. You can get out and listen to courses in the
CS Club and
CS Center . If you are not from St. Petersburg, there is a hostel. A great chance to move to the Northern Capital.
Apply , get ready and come to us.
Sources