📜 ⬆️ ⬇️

We attach the spatial index to the unsuspecting OpenSource database.

image
I always liked it when the headline clearly says what will happen next, for example, “Texas Chainsaw Massacre”. Therefore, under the cut, we really will add spatial search to the database, in which it was not originally.

Introductory
Let's try to do without common words about the importance and usefulness of spatial search.
When asked why it was impossible to take an open DBMS with an already built-in spatial search, the answer would be: “if you want to do something well, do it yourself” (C). But seriously, we will try to show that most practical spatial problems can be solved on a completely ordinary desktop without overvoltage and special expenses.

As the experimental DBMS, we will use the open edition of OpenLink Virtuoso . Spatial indexing is available in its paid version, but this is a completely different story. Why this particular DBMS? The author likes it. Fast, easy, powerful. Copied-launched-everything works. We will use only standard means, no C-plugins or special assemblies, only an official assembly and a text editor.

Index type
We will index the points using a pixel-code (block) index.
What is the pixel-code index? This is a compromise version of the sweeping curve, allowing you to work with non-point objects in a 2-dimensional search space. Let's unite in this class all the methods that:

The most well-known methods are:

With all the drawbacks, the main one of which, perhaps, is non-adaptability, the need to customize the index for a specific task, block indexes also have advantages: simplicity and omnivorousness. In essence, this is an ordinary tree with an additional semantic load. We somehow convert the geometry into numbers when indexing and in the same way we get the keys to search in the index.
In this paper, we will use the line scan (+1 ystein ) because of the simplicity: the coordinates in the number can be cheaply converted by means of PL / SQL. Using HTM or the Hilbert curve without involving C-plugins would be difficult.
')
Data type
Why point? As already noted, Block indices are just the same as indexing. At the same time, the rasterizer for points is trivial, and for polygons it would have to be written, and this is completely irrelevant for the disclosure of the topic.

Actually the data
Without further ado,
For the index, it is not important how the points are located inside the cell, and such regularity will allow us to control the correctness of the results.

To generate the data, we will use the following AWK script (let’s be 'data.awk'):
BEGIN { cnt = 1; sz = 4000; # Set the checkpoint interval to 100 hours -- enough to complete most of experiments. print"checkpoint_interval (6000);"; print"SET AUTOCOMMIT MANUAL;"; for (i=0; i < sz; i++) { for (j=0; j < sz; j++) { print "insert into \"xxx\".\"YYY\".\"_POINTS\""; print " (\"oid_\",\"x_\", \"y_\") values ("cnt","i","j");"; cnt++; } print"commit work;" } print"checkpoint;" print"checkpoint_interval (60);"; exit; } 


Server


Data schema
First, create a table of points, nothing more:
 create user "YYY"; create table "xxx"."YYY"."_POINTS" ( "oid_" integer, "x_" double precision, "y_" double precision, primary key ("oid_") ); 

Now the actual index:
 create table "xxx"."YYY"."_POINTS_spx" ( "node_" integer, "oid_" integer references "xxx"."YYY"."_POINTS", primary key ("node_", "oid_") ); 
where node_ is the space cell identifier. Note that, although the index is created as a table, both fields are present in the primary key and, therefore, are packed in a tree.

Configure index options:
 registry_set ('__spx_startx', '0'); registry_set ('__spx_starty', '0'); registry_set ('__spx_nx', '4000'); registry_set ('__spx_ny', '4000'); registry_set ('__spx_stepx', '10'); registry_set ('__spx_stepy', '10'); 

The system function registry_set allows you to write string name / value pairs to the system registry — it is much faster than keeping them in auxiliary table and is still covered in transactions.

Write trigger:
 create trigger "xxx_YYY__POINTS_insert_trigger" after insert on "xxx"."YYY"."_POINTS" referencing new as n { declare nx, ny integer; nx := atoi (registry_get ('__spx_nx')); declare startx, starty double precision; startx := atof (registry_get ('__spx_startx')); starty := atof (registry_get ('__spx_starty')); declare stepx, stepy double precision; stepx := atof (registry_get ('__spx_stepx')); stepy := atof (registry_get ('__spx_stepy')); declare sx, sy integer; sx := floor ((n.x_ - startx)/stepx); sy := floor ((n.y_ - starty)/stepy); declare ixf integer; ixf := (nx / stepx) * sy + sx; insert into "xxx"."YYY"."_POINTS_spx" ("oid_","node_") values (n.oid_,ixf); }; 

We put all this in, for example, 'sch.sql' and execute it by
 isql.exe localhost:1111 dba dba <sch.sql 

Data loading
Once the circuit is ready, you can fill in the data.
 gawk -f data.awk | isql.exe localhost:1111 dba dba >log 2>errlog 
It took the author 45 minutes (~ 6000 per second) on the Intel i7-3612QM 2.1GHz with 8Gb of memory.
The memory size occupied by the server was about 1 Gb, the size of the database on the disk was 540Mb, i.e. ~ 34 bytes per point and its index.
Now we need to make sure that the result is correct, we enter in isql
 select count(*) from "xxx"."YYY"."_POINTS"; select count(*) from "xxx"."YYY"."_POINTS_spx"; 
Both requests must give out 16,000,000.

As a guideline, you can use the data published here :
image .
They are received on PostgreSQL (Linux), 2.2GHz Xeon.
On the one hand, the data was obtained 9 years ago. On the other hand, in this case disk operations are a bottleneck, and random disk access has not accelerated much over the past period.
And here we can learn that the time of spatial indexing is commensurate with the time of filling the actual data.
It should be noted that filling data with subsequent indexing occurs (for obvious reasons) much faster than indexing on the fly. But we will be more interested in indexing on the fly because for practical purposes, a spatial database is more interesting, rather than a spatial search system.

And yet, it should be noted that when filling in spatial data, the order of their filling is very important.
Both in this paper and in the aforementioned order, the point feeding was perfect (or close to it) in terms of spatial index. If the data is mixed, the performance will fall quite dramatically, regardless of the selected DBMS.
If the data arrives arbitrarily, it can be recommended to accumulate them in a settling tank, as it overflows, sort and fill with portions.

We are developing a search
To do this, we use such a remarkable feature of Virtuoso, as a procedural view.
 create procedure "xxx"."YYY"."_POINTS_spx_enum_items_in_box" ( in minx double precision, in miny double precision, in maxx double precision, in maxy double precision) { declare nx, ny integer; nx := atoi (registry_get ('__spx_nx')); ny := atoi (registry_get ('__spx_ny')); declare startx, starty double precision; startx := atof (registry_get ('__spx_startx')); starty := atof (registry_get ('__spx_starty')); declare stepx, stepy double precision; stepx := atof (registry_get ('__spx_stepx')); stepy := atof (registry_get ('__spx_stepy')); declare sx, sy, fx, fy integer; sx := floor ((minx - startx)/stepx); fx := floor (1 + (maxx - startx - 1)/stepx); sy := floor ((miny - starty)/stepy); fy := floor (1 + (maxy - starty - 1)/stepy); declare mulx, muly integer; mulx := nx / stepx; muly := ny / stepy; declare res any; res := dict_new (); declare cnt integer; for (declare iy integer, iy := sy; iy < fy; iy := iy + 1) { declare ixf, ixt integer; ixf := mulx * iy + sx; ixt := mulx * iy + fx; for select "node_", "oid_" from "xxx"."YYY"."_POINTS_spx" where "node_" >= ixf and "node_" < ixt do { dict_put (res, "oid_", 0); cnt := cnt + 1; } } declare arr any; arr := dict_list_keys (res, 1); gvector_digit_sort (arr, 1, 0, 1); result_names(sx); foreach (integer oid in arr) do { result (oid); } }; create procedure view "xxx"."YYY"."v_POINTS_spx_enum_items_in_box" as "xxx"."YYY"."_POINTS_spx_enum_items_in_box" (minx, miny, maxx, maxy) ("oid" integer); 
In essence, this procedure is very simple, it finds out which cells in the block index the search extent falls into and makes a sub-request for each line of these cells, writes record identifiers to the hash map, eliminating duplicates and giving the resulting set of identifiers as a result.
How to use the spatial index? For example:
 select count(*), avg(x.x_), avg(x.y_) from "xxx"."YYY"."_POINTS" as x, "xxx"."YYY"."v_POINTS_spx_enum_items_in_box" as a where a.minx = 91. and a.miny = 228. and a.maxx = 139. and a.maxy = 295. and x."oid_" = a.oid and x.x_ >= 91. and x.x_ < 139. and x.y_ >= 228. and x.y_ < 295.; 
Here is the join between the table of points and the synthetic recordset from the procedural view by identifiers. Here the procedural view plays the role of a coarse filter that works up to the cell. Therefore, in the last line we placed a thin filter.
Expected Result:
This is how we performed a spatial search with one query within a single transaction.

Benchmark
To prepare the request flow, use the following script on AWK:
 BEGIN { srand(); for (i=0; i < 100; i++) { for (j=0; j < 100; j++) { startx = int(rand() * 3900); finx = int(startx + rand() * 100); starty = int(rand() * 3900); finy = int(starty + rand() * 100); print "select count(*), avg(x.x_), avg(x.y_) from"; print "\"xxx\".\"YYY\".\"_POINTS\" as x, " print "\"xxx\".\"YYY\".\"v_POINTS_spx_enum_items_in_box\" as a" print "where a.minx = "startx". and a.miny = "starty"." print " and a.maxx = "finx". and a.maxy = "finy". and" print "x.\"oid_\" = a.oid and x.x_ >= "startx". and x.x_ < "finx"." print " and x.y_ >= "starty". and x.y_ < "finy".;" } } exit; } 
As you can see, it turns out 10,000 requests in random places of random size (0 ... 100). Let's prepare 4 such sets and we will execute three series of tests - we start one isql with tests, 2 in parallel and 4 in parallel. On the eight-core processor, there is no more parallelizing. Measure performance.

You can try to compare these times with those obtained by the spinal cord by means of two ordinary indices - x_ & y_. We index the table for these fields and run a similar series of tests with queries like
 select count(*), avg(x.x_), avg(x.y_) from "xxx"."YYY"."_POINTS" as x where x.x_ >= 91. and x.x_ < 139. and x.y_ >= 228. and x.y_ < 295.; 
The result is as follows: a series of 10,000 requests is executed in 417 seconds, we have 24 requests per second.

Conclusion
The presented method has a number of drawbacks, the main of which, as already noted, is the need to tune it for a specific task. Its performance is highly dependent on the settings, which can no longer be changed after creating the index. However, this is true for most spatial indices in general.

But even this index written on the knee gives a decent performance. In addition, since we completely control the process, on this basis it is possible to make more complex structures, for example, multi-table indexes, which are difficult to build using standard methods. But more about that next time.

Source: https://habr.com/ru/post/196682/


All Articles