Under the cut there will be a short note on the application of the spatial index
based on zcurve for indexing point data located on a sphere. As well as bencmark for PostgreSQL and comparison with the same (but completely different)
index on the R-tree.
In the development of the topic (
1 ,
2 ,
3 ,
4 ,
5 ,
6 ).
Actually, we return to the very
beginning - the idea of indexing geographic coordinates, placing them on the surface of the sphere. The usual indexing of the latitude / longitude pair leads to distortions far from the equator, the work with projections is not universal. Therefore, the idea to translate geographical coordinates into three-dimensional space looks pretty elegant.
By itself, this idea is not new, similarly, for example, the PostgreSQL extension works -
PGSphere , which uses a 3-dimensional R-tree for indexing. With him, and we will compare.
Data preparation.
PGSphere
- First you have to deflate, build and install the extension (the author used the current version 1.1.5)
')
gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install
- Next, download it (psql)
CREATE EXTENSION pg_sphere;
- Create a table for test data
CREATE TABLE spoint_data (sp spoint);
- We need a source of random data .
The first parameter of the program is the radius, the second is the number of results.
The only subtlety is that the data is evenly distributed inside the ball with a given radius, otherwise the uniform distribution on the sphere will not turn out - Random data is passed through the awk script to turn it into geo-coordinates
# --- gendata.awk ------ BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; r2 = sqrt(x*x + y*y); lat = atan2(z, r2) * rad; lon = 180. + atan2(y, x) * rad; printf ("(%14.10fd, %14.10fd)\n", lon, lat); }
- The actual creation of data, here the radius is not important, it is important that both pgsphere and zcurve receive the same data. Sorting is highly desirable to speed up indexing.
./random 1000000 100000000 | gawk -f gendata.awk | sort > pgsphere.txt
- Fill the data in our table
COPY spoint_data (sp) FROM '/home/.../pgsphere.txt';
- Indexing
CREATE INDEX sp_idx ON spoint_data USING gist (sp);
ZORDER
- First you have to deflate, build and install the extension
gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install
- Create a table for test data
create table test_points_3d (x integer,y integer, z integer);
- We need the same source of random data .
- Random data is passed through the awk script to place it inside a cube with a side of 2,000,000
#--- gendata2.awk ------ BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; ix = int(x+0.5+Grad); iy = int(y+0.5+Grad); iz = int(z+0.5+Grad); print ix"\t"iy"\t"iz; }
- Actually the creation of data, here the radius is important. Sorting is optional.
./random 1000000 100000000 | gawk -f gendata2.awk > zcurve.txt
- Fill the data in our table
COPY test_points_3d FROM '/home/.../zcurve.txt';
- Indexing
create index zcurve_test_points_3d on test_points_3d(zcurve_num_from_xyz(x,y,z));
Test preparation
PGSphere
For testing, you need this awk script
#--- gentest.awk ------- BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; r2 = sqrt(x*x + y*y); lat = atan2(z, r2) * rad; lon = 180. + atan2(y, x) * rad; # EXPLAIN (ANALYZE,BUFFERS) printf ("select count(1) from spoint_data where sp @'<(%14.10fd,%14.10fd),.316d>'::scircle;\n", lon, lat); }
This script is completely symmetrical to the one with which we prepared the data. It is worth paying attention to the number .316, this is the radius of the sphere with the center at the calculated random point at which we are looking for data
Preparing a series of requests is done like this:
./random 1000000 100 1023 | gawk -f gentest.awk >tests1.sql
Here 100 is the size of the test series, 1023 is the seed of the randomizer.
ZCURVE
For testing, you also need an awk script
#--- gentest2.awk ------- BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; ix = int(x+0.5+Grad); iy = int(y+0.5+Grad); iz = int(z+0.5+Grad); # EXPLAIN (ANALYZE,BUFFERS) lrad = int(0.5 + Grad * sin(.316 * degra)); print "select count(1) from zcurve_3d_lookup_tidonly('zcurve_test_points_3d', "ix-lrad","iy-lrad","iz-lrad","ix+lrad","iy+lrad","iz+lrad");"; }
This script is completely symmetrical to the one with which we prepared the data. Again, pay attention to the .316 number, which is half the side of the cube with the center at the computed random point where we are looking for data.
Preparing a series of requests is done like this:
./random 1000000 100 1023 | gawk -f gentest2.awk >tests2.sql
Here 100 is the size of the test series, 1023 is the seed of the randomizer, it is better if it coincides with the one from pgsphere.
Benchmark
As before, the measurements were carried out on a modest virtual machine with two cores and 4 GB of RAM, so the times have no absolute value, but the numbers of the read pages can still be trusted.
The times are shown on the second runs, on the heated server and the virtual machine. The number of buffers read is on a freshly raised server.
Radius | AVG NPoints | Nreq | Type | Time (ms) | Reads | Hits |
---|
.01 ° | 1.17 0.7631 (0.7615) | 10,000 | zcurve rtree | .37 .46 | 1.4397 2.1165 | 9.5647 3.087 |
.0316 ° | 11.6 7.6392 (7.6045) | 10,000 | zcurve rtree | .39 .67 | 2.0466 3.0944 | 20.9707 2.7769 |
.one | 115.22 76.193 (76.15) | 1,000 | zcurve rtree | .44 2.75 * | 4.4184 6.073 | 82.8572 2.469 |
.316 ° | 1145.3 758.37 (760.45) | 1,000 | zcurve rtree | .59 18.3 * | 15.2719 21.706 | 401.791 1.62 |
one | 11310 7602 (7615) | 100 | zcurve rtree | 7.2 94.5 * | 74.9544 132.15 | 1651.45 1.12
|
Where
Radius - size of the search area in degrees
Npoints - the average number of points in the output,
in parentheses - the theoretically expected number
(in the area of 41252.96 square degrees, 100,000,000 points, ~ 2424 points per square degree)Nreq - the number of requests in the series
Type -
'zcurve' - it is
'rtree'- PGSphere
Time (ms) - the average query time
Reads - the average number of reads per request
Hits - the number of calls to buffers
* at some point the performance of the R-tree starts abruptly
subside, this is due to the fact that this tree reads much more
pages and its working set ceases to fit in the cache (apparently).
Note that zcurve finds more data, which is logical because it searches inside a cube, not spheres like PGSphere. Therefore, post-filtering in the spirit of
HAVERSINE is required . But here we compare only the performance indexes.
Let's try to evaluate the difference. In general, the task of finding the area of a piece of a sphere that falls inside a cube is nontrivial. Let's try to make an assessment.
- Suppose that our cube on average cuts out from a sphere the same area as a sphere of equal volume
- The volume of a single sphere is 1.33 * 3.14 = 4.19
- The volume of the cube with a side of 2 = 8.
- Then the third-degree root of 8 / 4.19 = 1.24 is the ratio of the radii of the imaginary sphere to the present
- the ratio of the areas of the imaginary sphere to the present 1.24 * 1.24 = 1.54
- we have from experimental data 1.17 / 0.7631 = 1.5332
Bingo!