📜 ⬆️ ⬇️

Projections? No thanks


Under the cut there will be a short note on the application of the spatial index
based on zcurve for indexing point data located on a sphere. As well as bencmark for PostgreSQL and comparison with the same (but completely different)
index on the R-tree.

In the development of the topic ( 1 , 2 , 3 , 4 , 5 , 6 ).
Actually, we return to the very beginning - the idea of ​​indexing geographic coordinates, placing them on the surface of the sphere. The usual indexing of the latitude / longitude pair leads to distortions far from the equator, the work with projections is not universal. Therefore, the idea to translate geographical coordinates into three-dimensional space looks pretty elegant.

By itself, this idea is not new, similarly, for example, the PostgreSQL extension works - PGSphere , which uses a 3-dimensional R-tree for indexing. With him, and we will compare.

Data preparation.


PGSphere



ZORDER



Test preparation


PGSphere


For testing, you need this awk script

 #--- gentest.awk ------- BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; r2 = sqrt(x*x + y*y); lat = atan2(z, r2) * rad; lon = 180. + atan2(y, x) * rad; # EXPLAIN (ANALYZE,BUFFERS) printf ("select count(1) from spoint_data where sp @'<(%14.10fd,%14.10fd),.316d>'::scircle;\n", lon, lat); } 

This script is completely symmetrical to the one with which we prepared the data. It is worth paying attention to the number .316, this is the radius of the sphere with the center at the calculated random point at which we are looking for data

Preparing a series of requests is done like this:

 ./random 1000000 100 1023 | gawk -f gentest.awk >tests1.sql 

Here 100 is the size of the test series, 1023 is the seed of the randomizer.

ZCURVE


For testing, you also need an awk script

 #--- gentest2.awk ------- BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; ix = int(x+0.5+Grad); iy = int(y+0.5+Grad); iz = int(z+0.5+Grad); # EXPLAIN (ANALYZE,BUFFERS) lrad = int(0.5 + Grad * sin(.316 * degra)); print "select count(1) from zcurve_3d_lookup_tidonly('zcurve_test_points_3d', "ix-lrad","iy-lrad","iz-lrad","ix+lrad","iy+lrad","iz+lrad");"; } 

This script is completely symmetrical to the one with which we prepared the data. Again, pay attention to the .316 number, which is half the side of the cube with the center at the computed random point where we are looking for data.

Preparing a series of requests is done like this:

 ./random 1000000 100 1023 | gawk -f gentest2.awk >tests2.sql 

Here 100 is the size of the test series, 1023 is the seed of the randomizer, it is better if it coincides with the one from pgsphere.

Benchmark


As before, the measurements were carried out on a modest virtual machine with two cores and 4 GB of RAM, so the times have no absolute value, but the numbers of the read pages can still be trusted.

The times are shown on the second runs, on the heated server and the virtual machine. The number of buffers read is on a freshly raised server.
RadiusAVG NPointsNreqTypeTime (ms)ReadsHits
.01 °1.17
0.7631
(0.7615)
10,000zcurve
rtree
.37
.46
1.4397
2.1165
9.5647
3.087
.0316 °11.6
7.6392
(7.6045)
10,000zcurve
rtree
.39
.67
2.0466
3.0944
20.9707
2.7769
.one115.22
76.193
(76.15)
1,000zcurve
rtree
.44
2.75 *
4.4184
6.073
82.8572
2.469
.316 °1145.3
758.37
(760.45)
1,000zcurve
rtree
.59
18.3 *
15.2719
21.706
401.791
1.62
one11310
7602
(7615)
100zcurve
rtree
7.2
94.5 *
74.9544
132.15
1651.45
1.12
Where
Radius - size of the search area in degrees
Npoints - the average number of points in the output, in parentheses - the theoretically expected number
(in the area of ​​41252.96 square degrees, 100,000,000 points, ~ 2424 points per square degree)
Nreq - the number of requests in the series
Type -
'zcurve' - it is
'rtree'- PGSphere
Time (ms) - the average query time
Reads - the average number of reads per request
Hits - the number of calls to buffers

* at some point the performance of the R-tree starts abruptly
subside, this is due to the fact that this tree reads much more
pages and its working set ceases to fit in the cache (apparently).

Note that zcurve finds more data, which is logical because it searches inside a cube, not spheres like PGSphere. Therefore, post-filtering in the spirit of HAVERSINE is required . But here we compare only the performance indexes.

Let's try to evaluate the difference. In general, the task of finding the area of ​​a piece of a sphere that falls inside a cube is nontrivial. Let's try to make an assessment.

Source: https://habr.com/ru/post/338088/


All Articles