Once again about finding the greatest concentration in a cloud of points

Once again, I came across a task - to find the place of their greatest concentration in a cloud of points. This time the situation was this:

There is a certain amount (it can be considered that no more than 16 million) of measurements of a parameter set The number of parameters in the set is from 2 to 5.
measurement of parameters can be relatively successful - then their result will be close to the true (the parameters and type of distribution are unknown), or not successful - then the result will be random (again with unknown distribution parameters). It is impossible to determine by single measurement whether it was successful.
We can assume that the point of condensation exists. If there are several of them with similar quality (which is not formally defined), any one can be issued.
the answer must be given in one pass on the source data: to recalculate them or to save them entirely - expensive.
And, as usual, I want the algorithm to look simpler and work faster.

For simplicity, we can assume that the measured values of the parameters are real numbers from 0 to 1. I would like to get the result with a resolution of 3 decimal places for the case of 5 parameters, and 4 or more - for 4 or less parameters.

It is clear that a large number of reference points make the methods based on the processing of pairs of points questionable. Even an effective search for the distance to the nearest point will require considerable effort to implement.

The construction of histograms is also not easy: even the desired cell size is not known in advance. It is possible, for example, that all measurements will give results in a ball with a diameter of 0.1, and successful ones will be localized in a region of size 0.003. And it will be necessary to give the position of this area (in the N-dimensional space!)
')
It would be possible to construct histograms of the distribution of individual parameters, hoping that the condensation of N-dimensional points can be restored according to the positions of the condensations of the projections of points to individual coordinates, but it’s better to leave this option as a spare: parasitic areas of thickening.

The variant that seemed to me the most promising is the use of kd tree . If we build a binary tree, each node of which corresponds to dividing a region of space into two parts by one of the coordinates (the coordinates are iterated over the cycle - x ₁ , x ₂ , ..., x _k , x ₁ , x ₂ , ...), and find the number of points caught in each of the areas you can do so.

Let the points be N. Choose a value of K less than N, for example, K = [sqrt (N)] or K = [N ^2/3 ]. Let us find the region of the smallest volume (that is, lying at the greatest depth of the tree) containing at least K points. If there are several such areas, take the one with the most points. After that, we will begin to divide it in half (go down the tree further), each time choosing half, in which there are more points. When we have reached the sheet (for example, a separate point), we give it out as an answer.

You can find examples in which this algorithm misses the main condensation area and chooses some local anomaly instead, or as a response gives a point far from the center of condensation, but you can expect that in most cases its answers will be more or less adequate. Unfortunately, building a kd tree is expensive, and it takes a lot of memory. I am ready to allocate 8 bytes per point, but it would be undesirable to spend more.

To save time and memory, I decided to build a kd tree implicitly.

Suppose we have a point (x ₁ , x ₂ , ..., x _k ). We write its coordinates in binary
x ₁ = 0.x ₁₁ x ₁₂ x ₁₃ ...
x ₂ = 0.x ₂₁ x ₂₂ x ₂₃ ...
x ₃ = 0.x ₃₁ x ₃₂ x ₃₃ ...
...

And let's build a 64-bit integer
P = x ₁₁ x ₂₁ ... x _k1 x ₁₂ ... x _k2 ...

This number represents the path through the tree to a depth of 64 levels. In addition, it allows you to restore the coordinates of the point with the required accuracy.

If we sort the codes P for all points in ascending order, we get points in the tree traversal order. For any two points P _a and P _b, it is easy to find the deepest node, the common descendants of which they are: it is enough to find the leading nonzero bit in the number P _a ^ P _b . And all the necessary fragments of the problem on such a representation are solved in several lines:

Search for the level at which the smallest point is located containing at least K points:

ulong mask=ulong.MaxValue; K--; for(int i=K;i<m_np;i++) mask=Math.Min(mask,m_Arr[i]^m_Arr[iK]);

Here, m_Arr is an array of codes, m_np is the number of filled elements in it, and mask is a number, the most significant bit of which defines the desired level.

Search for a region at the found level containing the maximum number of points (we know that there are at least K):

  mask=InvMask(mask); int ms=0,me=0; for(int i=0;i<m_np-K;i++){ ulong a=m_Arr[i]; int h=i+K; if(((a^m_Arr[h])&mask)==0){ while(h<m_np-1 && ((a^m_Arr[h+1])&mask)==0) h++; K=hi; ms=i; me=i=h; } }

Here InvMask (mask) is the calculation of a mask containing 1 in all digits older than the most significant non-zero bit of the mask. Calculates ms - the beginning of the desired area and me - its end.

Search for a heavier descendant:

  int h=(ms+me)/2; ulong samp=m_Arr[h]; ulong cb=samp^m_Arr[ms],ce=samp^m_Arr[me]; if(cb>ce) { ms=h; ce=InvMask(ce); while(ms>0 && ((m_Arr[ms-1]^samp)&ce)==0) ms--; } else { me=h; cb=InvMask(cb); while(me<m_np-1 && ((m_Arr[me+1]^samp)&cb)==0) me++; }

Here the descent is not necessarily one level.
Thus, the task of searching for thickening is solved.

The whole program was packed in 100 lines, the source code (in C #) can be found here .

The real problem has a slightly more complicated wording. The range of change of parameters is not known in advance, in addition, the measurements in the sample may be more than the planned 16 million, and “good” measurements may go closer to the end. But small modifications of the algorithm can cope with these problems. For example, if a place in the array is over, and the points continue to flow, we can sort and thin out the array of already typed codes - this will not affect the position and quality of the thickening.

In conclusion, there are several examples of work (two parameters, 10 ^ 5 and 10 ^ 7 points - for one real and several synthetic examples). It can be seen that the algorithm still needs to be improved (to clarify the position found), and that for the real case it is too prudent, simpler solutions would suffice. But I want to be a little safe.

Source: https://habr.com/ru/post/137652/

All Articles

Once again about finding the greatest concentration in a cloud of points

More articles: