People meet, people quarrel, add and remove friends in social networks. This post is about mathematics and algorithms, beautiful theory, love and hate in this unstable world. This post is about searching for connected components in dynamic graphs.
The big world generates big data. So a big graph fell on our head. So big that we can keep in memory its vertices, but not edges. In addition, there are updates regarding the graph - which edge to add, which one to delete. We can say that every such update we see for the first and last time. In such conditions it is necessary to find the components of the connectivity
Search in depth / width here will not pass simply because the entire graph in the memory does not hold. A system of disjoint sets could greatly help if the edges in the graph were added. What to do in the general case?
')
Task . Dan undirected graph on tops. Initially, the graph is empty. The algorithm comes a sequence of updates. , which can be read in a given order exactly once. Each update is a command to remove or add an edge between a pair of vertices. and . It is guaranteed that at any point in time between a pair of vertices more edges than they are will not be deleted. After reading the sequence, the algorithm should output all connected components with a probability of success. . Permitted to use memory where - some constant.The solution of the problem consists of three ingredients.
- Incidence matrix as a representation.
- The method as an algorithm.
- Sampling as optimization.
You can see the implementation on the githaba:
link .
Incidence Matrix as Representation
The first data structure for the storage of the graph will be extremely non-optimal. We will take the incidence matrix
size
, which will mostly be zeros. Each row in the matrix corresponds to a vertex, and a column corresponds to a possible edge. Let be
. For a pair of vertices
connected by an edge, we define
and
, otherwise the values are zero.
As an example, look at the graph in the picture below.
For him, the incidence matrix will look like this.
The naked eye can see that this view has a serious drawback - the size
. We optimize it, but later.
There is also an implicit advantage. If you take a lot of vertices
and add all matrix row vectors
that match
, then the edges between the vertices
only those connecting will be reduced
and
.
For example, if you take a lot
and put the corresponding vectors, we get
Non-zero values are at the edges
,
and
.
Shrinking the graph as an algorithm
Understand what a rib is. Here we have two peaks
and
there is a rib between them. Of
and
there may be other edges. Stiffen rib
this is the procedure when we merge the vertices
and
in one let's say
, rib
we delete, and all remaining edges, incident
and
, we lead to the new peak
.
An interesting feature: in terms of the matrix of incidence, to tighten the edge
it is enough to add the corresponding row vectors. Edge itself
shrink, only those that go outside will remain.
Now the algorithm. Take a graph and for each non-isolated vertex we choose a neighbor.
We pull the corresponding edges.
Repeat iteration
time.
Note that after each iteration of the connection of the connected components of the new graph are one-to-one mapped to the components of the old. We can even mark which vertices have been merged into the resulting vertices, in order to restore the answer later.
Note also that after each iteration, any connected component of at least two vertices is reduced at least two times. This is natural, since at least two vertices of the old one were merged into each vertex of the new component. So after
iterations in the graph will remain only isolated vertices.
We search through all isolated vertices and restore the answer in the history of mergers.
-sampling as optimization
Everything would be great, but the above algorithm works in terms of time and memory.
. To optimize it, we will construct a sketch - a special data structure for a compact representation of row vectors.
From the sketch you need three properties.
First, compactness. If we build a sketch for a vector
size
then the sketch itself
must be sized
.
Secondly, sampling. At each iteration of the algorithm, we need to choose a neighbor. We want a way to get the index of at least one non-zero element, if there is one.
Third, linearity. If we built for two vectors
and
sketches
and
. Must be an effective method, get a sketch
. This will help tighten the ribs.
We will use the solution of the problem.
-sampling.
Task. Dan vector zero vector dimensions . Algorithm sequence comes from view updates : add to value . can be either positive or negative integer. The resulting vector at some positions may have non-zero values. These positions are denoted by . It is required to issue any position from equally likely. All updates need to be processed in one pass, you can use of memory. It is guaranteed that the maximum value in fit in bit.1-sparse vector
To begin, we will solve a simpler problem. Suppose we have a guarantee that the final vector contains exactly one nonzero position. We say that such a vector is 1-sparse. We will support two variables
and
. Maintaining them is simple: on each update we add to the first
to second
.
Denote the desired position by
. If it is only one, then
and
. To find a position, we consider
.
You can probabilistically check whether the vector is 1-sparse. For this we take a prime number
random integer
and calculate the variable
. The vector passes the 1-sparsity test if
and
.
Obviously, if the vector is really 1-sparse, then
and he will pass the test. Otherwise, the probability of passing the test is not more than
(actually, maximum
).
Why?In terms of a polynomial, a vector passes the test if the values of a polynomial
randomly selected point
equals zero. If the vector is not 1-sparse, then
is not identically zero. If we passed the test, we guessed the root. The maximum degree of a polynomial is
means no more roots
so the probability of guessing them is no more
.
If we want to increase the accuracy of verification to an arbitrary probability
, then you need to calculate the value
on
random
.
-frame vector
Now we will try to solve the problem for
- sparse vector, i.e. containing no more
nonzero positions. We will need hashing and the previous method. The general idea is to restore the entire vector, and then select some element randomly.
Take a random 2-independent hash function
. This is such a function, which distributes two arbitrary different keys equally independently.
More about hash functionsThe algorithms have two fundamentally different approaches to hash functions. The first is based on the fact that the data is less random, and our favorite hash function will mix them correctly. The second is that an evil adversary can pick up data, and the only chance to save the algorithm from unnecessary computation and other problems is to choose a hash function randomly. Yes, sometimes it can work poorly, but we can measure how bad it is. Here we are talking about the second approach.
The hash function is selected from the family of functions. For example, if we want to stir all the numbers from
before
you can take random
and
and determine the hash function
Functions for all possible
set the family of hash functions. Randomly select the hash function - it is essentially random to select those
and
. They are usually chosen equiprobably from the set
.
The remarkable property of the example above is that any two different keys are randomly distributed independently. Formally, for any different
and any, possibly the same,
probability
This property is called 2-independence. Sometimes, the probability may not be
, but
where
some kind of reasonably small value.
There is a 2-independence generalization -
-independence. This is when the function distributes not 2,
arbitrary keys regardless. Polynomial hash functions with
random coefficients have this property.
Have
Independence is one unpleasant property. With growth
function takes up more memory. This property does not depend on any particular implementation, but simply a general information property. Therefore, the less independence we can do, the easier it is to live.
Take a hash table size
. In each cell of the table, there will be an algorithm for a 1-sparse vector. When we receive an update
, we send this update to the algorithm in the cell
.
It can be calculated that for a single cell, the probability that there will be a collision at least along two non-zero coordinates will be no more than
.
Why?The probability that another element does not fall into the cell to us
. The probability that everyone will not get to us:
. The probability that at least someone will fall
. You may notice that this function is monotonically increasing. I'll just bring a picture here:
In the limit, the function gives the value
.
Suppose we want to restore all coordinates with probability of success.
or with probability of failure
. Take not one hash table, but immediately
. It is easy to understand that the probability of failure in decoding a particular coordinate will be
. The probability of failure in decoding at least one of
coordinates
. If the total decoding for 1-sparse vectors works with the probability of failure
then we won.
The final algorithm is as follows. Take
hash table size
. Each cell of the algorithm will have its own decoder for a 1-sparse vector with the probability of failure
.
Every update
processed in each hash table separately by the algorithm in the cell
.
Upon completion, we extract from all successfully completed 1-decoders by coordinate and merge them into one list.
The maximum in the general table will be
affected 1-decoders. Therefore, the total probability that one of the 1-decoders will not work correctly will not exceed
. Also, the probability that at least one coordinate will not be restored does not exceed
. Overall, the probability of failure of the algorithm
.
Another hash for general case
Last step in
-sampling, is to understand what to do with the general case. We will use hashing again. Take
-independent hash function
for some
.
Let's say that the update
is an
interesting if
. In other words, in binary record
contains
zeros at the end.
Run the algorithm for
-frame vector in parallel by
levels. At the level
we will take into account only
- interesting updates. It is easy to understand that the more
the less chance (and chances
) have the update be taken into account.
Find the first level with the highest
where there came at least some updates, and try to restore it. If the recovery is successful, we will return the randomly selected item.
There are several points that I want to briefly pay attention to.
First, how to choose
. In general, a vector may have more than
non-zero positions, but with every increase
per unit, mat. the expectation of non-zero positions drops exactly 2 times. You can choose a level at which the expectation will be between
and
. Then from Chernov and
independence of the hash function, the probability that the vector will be zero or have more than
nonzero positions will turn out to be exponentially small.
This determines the choice
where
- the permissible probability of failure.
Secondly, from
-independence of the hash function implies that for any position the probability to pass the filter will be equal. Insofar as
- sparse vectors, we can already recover, then getting a uniform distribution is already trivial.
So we learned how to choose a random non-zero position according to a uniform distribution.
Mix, but do not shake
It remains to understand how to combine everything. We know that in the graph
vertices. For each vertex, or rather, each row vector in the incidence matrix, we will
sketches
for
-sampling. On
th iteration algorithm we will use for sampling
sketches
When adding an edge
add vertices to all sketches
and
corresponding +1 and -1 respectively.
When the ribs are over and we are asked about the components, we will run the tightening algorithm. On
iteration through
-sampling from the sketch
find a neighbor to each vertex. To pull the edge
, we add all the sketches corresponding
and
. At each new vertex we will save the list of vertices that were merged into it.
Everything. In the end, just go through the isolated peaks, the history of mergers restore the answer.
Who is to blame and what to do again
In fact, the topic of this post has arisen for a reason. In January, Ilya Razenshtein (@ilyaraz), a graduate student at MIT, came to us at Peter's CS Club and talked about algorithms for big data. There was a lot of interesting (see
course description ). In particular, Ilya managed to tell the first half of this algorithm. I decided to finish the job and tell the whole result on Habré.
In general, if you are interested in mathematics related to computational processes aka Theoretical Computer Science, come to our Academic University in the direction of
Computer Science . Inside they will teach complexity, algorithms and discrete mathematics. The real science will start from the first semester. You can get out and listen to courses in the
CS Club and
CS Center . If you are not from St. Petersburg, there is a hostel. A great chance to move to the Northern Capital.
Apply , get ready and come to us.
Sources