Projections? No thanks

Under the cut there will be a short note on the application of the spatial index
based on zcurve for indexing point data located on a sphere. As well as bencmark for PostgreSQL and comparison with the same (but completely different)
index on the R-tree.

In the development of the topic ( 1 , 2 , 3 , 4 , 5 , 6 ).
Actually, we return to the very beginning - the idea of indexing geographic coordinates, placing them on the surface of the sphere. The usual indexing of the latitude / longitude pair leads to distortions far from the equator, the work with projections is not universal. Therefore, the idea to translate geographical coordinates into three-dimensional space looks pretty elegant.

By itself, this idea is not new, similarly, for example, the PostgreSQL extension works - PGSphere , which uses a 3-dimensional R-tree for indexing. With him, and we will compare.

Data preparation.

PGSphere

First you have to deflate, build and install the extension (the author used the current version 1.1.5)
')
```
gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install 
```
Next, download it (psql)
```
 CREATE EXTENSION pg_sphere; 
```
Create a table for test data
```
 CREATE TABLE spoint_data (sp spoint); 
```
We need a source of random data .
The first parameter of the program is the radius, the second is the number of results.
The only subtlety is that the data is evenly distributed inside the ball with a given radius, otherwise the uniform distribution on the sphere will not turn out

Random data is passed through the awk script to turn it into geo-coordinates

 # --- gendata.awk ------ BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; r2 = sqrt(x*x + y*y); lat = atan2(z, r2) * rad; lon = 180. + atan2(y, x) * rad; printf ("(%14.10fd, %14.10fd)\n", lon, lat); }

The actual creation of data, here the radius is not important, it is important that both pgsphere and zcurve receive the same data. Sorting is highly desirable to speed up indexing.
```
 ./random 1000000 100000000 | gawk -f gendata.awk | sort > pgsphere.txt 
```

Fill the data in our table

 COPY spoint_data (sp) FROM '/home/.../pgsphere.txt';

Indexing

 CREATE INDEX sp_idx ON spoint_data USING gist (sp);

ZORDER

First you have to deflate, build and install the extension

 gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install

Create a table for test data

 create table test_points_3d (x integer,y integer, z integer);

We need the same source of random data .

Random data is passed through the awk script to place it inside a cube with a side of 2,000,000

 #--- gendata2.awk ------ BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; ix = int(x+0.5+Grad); iy = int(y+0.5+Grad); iz = int(z+0.5+Grad); print ix"\t"iy"\t"iz; }

Actually the creation of data, here the radius is important. Sorting is optional.
```
 ./random 1000000 100000000 | gawk -f gendata2.awk > zcurve.txt 
```

Fill the data in our table

 COPY test_points_3d FROM '/home/.../zcurve.txt';

Indexing

 create index zcurve_test_points_3d on test_points_3d(zcurve_num_from_xyz(x,y,z));

Test preparation

PGSphere

For testing, you need this awk script

 #--- gentest.awk ------- BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; r2 = sqrt(x*x + y*y); lat = atan2(z, r2) * rad; lon = 180. + atan2(y, x) * rad; # EXPLAIN (ANALYZE,BUFFERS) printf ("select count(1) from spoint_data where sp @'<(%14.10fd,%14.10fd),.316d>'::scircle;\n", lon, lat); }

This script is completely symmetrical to the one with which we prepared the data. It is worth paying attention to the number .316, this is the radius of the sphere with the center at the calculated random point at which we are looking for data

Preparing a series of requests is done like this:

 ./random 1000000 100 1023 | gawk -f gentest.awk >tests1.sql

Here 100 is the size of the test series, 1023 is the seed of the randomizer.

ZCURVE

For testing, you also need an awk script

 #--- gentest2.awk ------- BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2; z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; ix = int(x+0.5+Grad); iy = int(y+0.5+Grad); iz = int(z+0.5+Grad); # EXPLAIN (ANALYZE,BUFFERS) lrad = int(0.5 + Grad * sin(.316 * degra)); print "select count(1) from zcurve_3d_lookup_tidonly('zcurve_test_points_3d', "ix-lrad","iy-lrad","iz-lrad","ix+lrad","iy+lrad","iz+lrad");"; }

This script is completely symmetrical to the one with which we prepared the data. Again, pay attention to the .316 number, which is half the side of the cube with the center at the computed random point where we are looking for data.

Preparing a series of requests is done like this:

 ./random 1000000 100 1023 | gawk -f gentest2.awk >tests2.sql

Here 100 is the size of the test series, 1023 is the seed of the randomizer, it is better if it coincides with the one from pgsphere.

Benchmark

As before, the measurements were carried out on a modest virtual machine with two cores and 4 GB of RAM, so the times have no absolute value, but the numbers of the read pages can still be trusted.

The times are shown on the second runs, on the heated server and the virtual machine. The number of buffers read is on a freshly raised server.

Radius	AVG NPoints	Nreq	Type	Time (ms)	Reads	Hits
.01 °	1.17 0.7631 (0.7615)	10,000	zcurve rtree	.37 .46	1.4397 2.1165	9.5647 3.087
.0316 °	11.6 7.6392 (7.6045)	10,000	zcurve rtree	.39 .67	2.0466 3.0944	20.9707 2.7769
.one	115.22 76.193 (76.15)	1,000	zcurve rtree	.44 2.75 *	4.4184 6.073	82.8572 2.469
.316 °	1145.3 758.37 (760.45)	1,000	zcurve rtree	.59 18.3 *	15.2719 21.706	401.791 1.62
one	11310 7602 (7615)	100	zcurve rtree	7.2 94.5 *	74.9544 132.15	1651.45 1.12

Where
Radius - size of the search area in degrees
Npoints - the average number of points in the output, in parentheses - the theoretically expected number
(in the area of 41252.96 square degrees, 100,000,000 points, ~ 2424 points per square degree)
Nreq - the number of requests in the series
Type -
'zcurve' - it is
'rtree'- PGSphere
Time (ms) - the average query time
Reads - the average number of reads per request
Hits - the number of calls to buffers

* at some point the performance of the R-tree starts abruptly
subside, this is due to the fact that this tree reads much more
pages and its working set ceases to fit in the cache (apparently).

Note that zcurve finds more data, which is logical because it searches inside a cube, not spheres like PGSphere. Therefore, post-filtering in the spirit of HAVERSINE is required . But here we compare only the performance indexes.

Let's try to evaluate the difference. In general, the task of finding the area of a piece of a sphere that falls inside a cube is nontrivial. Let's try to make an assessment.

Suppose that our cube on average cuts out from a sphere the same area as a sphere of equal volume
The volume of a single sphere is 1.33 * 3.14 = 4.19
The volume of the cube with a side of 2 = 8.
Then the third-degree root of 8 / 4.19 = 1.24 is the ratio of the radii of the imaginary sphere to the present
the ratio of the areas of the imaginary sphere to the present 1.24 * 1.24 = 1.54
we have from experimental data 1.17 / 0.7631 = 1.5332
Bingo!

Source: https://habr.com/ru/post/338088/

All Articles

Projections? No thanks

Data preparation.

PGSphere

ZORDER

Test preparation

PGSphere

ZCURVE

Benchmark

More articles: