📜 ⬆️ ⬇️

Z-order in 8D


Indexes based on self-similar sweeping curves are suitable not only for organizing the search for spatial data. They are efficient on heterogeneous data laid on an integer lattice.

Under the cut, we will check the possibility of using the Z-curve for the implementation of an 8-dimensional index with an eye on the OLAP cube .

Summary of the previous series.


Earlier ( 1 , 2 , 3 , 4 , 5 ) it was shown that the index based on the Z-curve is quite suitable for organizing spatial search. It is very close in speed (in the absence of restrictions on the buffer cache) to the traditionally used R-tree, but it is noticeably smaller, faster built, not afraid of random data, less reads during the search.

Fears.


Technically, there is nothing difficult in the implementation of such an index, it only took to process an 8-dimensional key. Dimension 8 was chosen for technological reasons, if you need a 7-dimensional index, you can safely add it with zeros in one of the coordinates, it almost does not affect the speed, except that the key will be longer. If necessary, nothing prevents you from making a longer one, for example, a 16-dimensional key.
')
It is worth considering in advance what pitfalls can expect us and what to do about it.


Benchmark


The following table summarizes the results for several (described below) query types.
Data - 100 000 000 random points, evenly distributed in an 8-dimensional cube with sides [0 ... 1 000 000].

Measurements (as before) were conducted on a virtual machine with two cores and 4 GB of RAM, so the times have no absolute value, but the numbers of the read pages can still be trusted.
The times are shown on the second runs, on the heated server and the virtual machine. The number of buffers read is on a freshly raised server.

Details
  • Here is this commit : (SHA-1: 36dd ... 76c7)
  • gawk -f gendata.awk〉 data.csv
     BEGIN {
       for (i = 0; i <100000000; i ++)
       {
         x = int (1000000 * rand ());
         y = int (1000000 * rand ());
         z = int (1000000 * rand ());
         a = int (1000000 * rand ());
         b = int (1000000 * rand ());
         c = int (1000000 * rand ());
         print "{" x "," y "," z "," a "," b "," c "," a "," b "}";
       }
     } 
    note that “a” and “b” occur twice, we have degenerate data, the real dimension is 6
  • create table test_points_8d (p integer[8]);
  • COPY test_points_8d from '/home/.../postgresql/contrib/zcurve/data.csv';
    10 minutes
  • \ dt +
     Schema |  Name |  Type |  Owner |  Size |  Description 
     ------- + ---------------- + ------- + ---------- + ------ --- + -------------
     public |  test_points_8d |  table |  postgres |  8056 MB | 
    

  • create index zcurve_test_points_8d on test_points_8d (zcurve_num_from_8coords (p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8]));
    15 minutes
  • \ di +
     Schema |  Name |  Type |  Owner |  Table |  Size | 
     -------- + ----------------------- + ------- + --------- - + --------------- + ------- + ---
      public |  zcurve_test_points_8d |  index |  postgres |  test_points_8d |  4743 MB  
    

  • verification request
     select c, t_row from (select c, (select p from test_points_8d t where c = t.ctid) t_row from zcurve_8d_lookup_tidonly('zcurve_test_points_8d', 237000,291000,845000,152000,585000,193000,152000,585000, 247000,301000,855000,162000,595000,203000,162000,595000) as c) x; 
    returns
    c | t_row
    ------+-----------------------------------------------------------
    (0,1) | {237787,291065,845813,152208,585537,193475,152208,585537}

    as required


Request typeNPointsNreqTime (ms)ReadsShared hits
all 10,000 each
1.1e-4100,000.462.017515.3919
all over 100,000
73.910,0004757.30631604.18
3100 000 +
510 000
0.083710,000.98.763773.8852
2x100,000 +
610 000
0.009610,000.665.177140.9089
1 whole +
1X100,000 +
6X10,000
0.097810,0001.524.2122199.33
2 whole +
6X10,000
0.966310,00095115.2021050.8
Where
Type of request - one of the following

NPoints - the average number of points found in the query
Nreq - the size of a series of requests
Time (ms) - the average query time
Reads - the average number of reads per request
Shared Hits - the number of hits in the buffer cache

findings


As expected, the area of ​​request perimeter plays an important role in the cost of the request. In fact, the larger the perimeter, the more pages it will potentially cross, the more it will have to read, the lower the cache efficiency ...

However, for cubic queries, the cost of the query is quite acceptable. In fact, for a query - a cube with a side of 100,000, an average of 57 pages are read with 74 lines of output. In order to read these 74 lines, you will most likely have to read about 74 more pages.

In addition, these data are obtained on evenly distributed data; on clustered data, it is expected that in order to obtain the same output power, query sizes will be much smaller, as will the number of pages read.
It would be interesting to have a look.

PS: on the title picture shows an okterakt - 8-dimensional cube
PPS: thanks to Andrei Borodin (Oktonika, Ekb.) For the tip on

Source: https://habr.com/ru/post/331420/


All Articles