Using quadtree when calculating 2GIS traffic jams

Even without being a navigator, 2GIS collects and displays information about traffic jams. Firstly, it is necessary to build optimal routes, and secondly, such data is very necessary for users in large cities.

In 2GIS service of traffic jams appeared in September 2011 and today works in five cities ( Novosibirsk , St. Petersburg , Krasnoyarsk , Ufa , Kazan ). Plans for the near future - to run traffic jams in all cities with a population of one million.

Under the cut a story about what problems we faced and how they were solved.

The basis of traffic jams is data, namely points. A single point carries the following basic information:

project identifier (city);
registration time;
coordinates;
instant speed;
driving direction at the time of registration.

We receive such points from our data providers, as well as users of the mobile version of 2GIS. Before participating in the calculation, the data gets into the raw data storage, based on MongoDB .
')
What is the "draw points" and why it is needed:
Traffic jam does not exist by itself, but is usually located on some road. The points themselves do not carry information on which section the device was passing at the time of registration, therefore they need to be superimposed on the road network.

Thus, the draw points - the process of correlating each point of the road segment on which it was registered, taking into account its speed and direction of movement.

Let's take one point (figure a), and try to determine on which part of the road it was registered.

It is clear that the desired section of the road is somewhere near the point, so the task immediately becomes to find the nearest segment of the road network in a certain radius (for example, 20 meters).

In this case, the forward and reverse directions of the nearest segment (Figure b) are very different from the direction that the device recorded at the time of registration of the point (indicated by the arrow), therefore this segment is not what we were looking for. From the figure it is clear that the correct answer was nearby. Such situations may arise, for example, due to the substantial error of the devices (trackers). At more complex (perhaps multilevel) interchanges, they occur constantly, so this search will not work.
Modify it as follows: we will not search for the nearest segment, but all segments within a radius. As a result of such a search, we will receive several segments at once, and already, taking into account their direction, distance to them, and other attributes, we will be able to choose the correct segment more correctly.

Everything is good, but the network in some cities is very large, hundreds of thousands of segments are getting expensive, and the points that need to be pulled are millions in a few minutes, so the quality of the search raises the same level as the solution quality.

What was before

Initially, there was no task to read online traffic jams, but the task was to process the data for the month so that in our offline products we set average monthly speeds on the roads and more realistic to show the travel time.
Data from suppliers aggregated in a central repository.

Since the central data warehouse uses PostgreSQL , this problem was also solved at the early stages using the PostGIS extension.
Simplistically, the algorithm was about the following:

At the preparation stage, polygons were built around all the edges of the graph (the ST_Buffer function), the width (different for different classes of roads) of which determined the tolerance.
The point was checked for hit in these polygons and admissible edges were selected.
A projection of a point was built on all admissible edges (the ST_ClosestPoint function) and we obtained the sorting of admissible edges by distance.
Using sorting by distance, checking the azimuth of the point and the azimuth of the edge segment, the class and width of the road and the previous drawn points from this device, we determined the probable edge of the graph and the projection of the point on this edge was considered a drawn point and recorded in the base.

The whole kitchen worked on stored procedures, although not very quickly, but there was no urgency in these data, since the problem was solved for offline products.

On an average server, the speed was about 2,000 points per second.

With the advent of the “Online” mode, this solution was not enough: the data is updated every five minutes, so the full data processing cycle should fit into this time. With the increase in the amount of data, this stage began to be performed so long that it was difficult to enter the whole cycle in five minutes even on a very powerful hardware.

The time has come to implement an idea that has long been brewing up - the implementation of calculations in C ++.

Finding quick fixes

In parallel, we began to develop 2 solutions - one based on the GEOS library, and the second on a completely from scratch based on quadtrees.

About geos

This solution appeared first because it used the ready-made GEOS library, which, by the way, uses PostGIS.

Quadtree indexes from this library were used to search for the nearest segments.

To do this, all segments of the road graph are loaded into the tree, the hit area is indicated around a point and at the output we have a set of segments that cross the hit area, then the segments are sorted by distance from the point and the closest one is selected.

Own implementation of quadtrees

This implementation appeared second, but (running a little ahead) turned out to be a little more successful.

First, a few general definitions to add clarity.

A quad tree is a tree in which each internal node has four children.
The quadtree allows recursively split two-dimensional space into four quadrants (regions).

Quad-trees are able to store information about different types of data, but we are interested in storing segments (segments), since the road network can be simply represented using segments.

For this, we used the PM Quadtree version - Polygonal Map Random Quadtree optimized for segment storage.

PMR Quadtree:

Uses probabilistic splitting rule;
Each node contains a variable number of segments;
Each segment is inserted into all nodes that it intersects;
If a node contains a number of segments that exceeds the limit, then splitting occurs (only once) into 4 child nodes;
Well suited for the task of finding the nearest segment.

Minuses:

Possible unbalance wood.

Each tree node represents a certain area (for example, square) and contains the following information:

Depth;
Area coordinates;
Four children, if the node is internal;
Segments, if the node is external.

The node stores only those segments that intersect its area. The number of segments in a node is variable.

Building

Building a data structure is the process of sequentially inserting segments into a tree. Initially, the tree is empty.
Similar to the algorithms for building many other hierarchical data structures, the PMR Quadtree algorithm is based on top-down traversal . In other words, starting from the root node, we visit all the child nodes intersecting the segment, and add this segment to all external nodes encountered.

A key aspect of PMR Quadtree is the partitioning rule, that is, the condition under which a node is divided. For this purpose, some parameter t is used , which limits the number of segments that can be contained in a node.
If the number of segments that intersects with a region exceeds t and if the maximum depth level is not reached, then this region is divided into four subsidiaries, and all segments that intersect with it are inserted into new subsidiary nodes. It is worth noting that immediately after splitting the child nodes are not split again, even if the number of segments in the node exceeds t . This avoids excessive partitioning and leads to probabilistic behavior in the sense that the order in which objects are inserted affects the structure of the resulting tree.

Visualization

Search in the tree

For the search we will use the queue with priority .
In the queue we will put objects of three types:

internal nodes;
external nodes;
segments.

The key in the queue is the distance to the starting point. Initially, the queue contains only the root node.

Now at each step we will take from the queue an object with a minimum distance and, depending on its type, carry out the corresponding operations:

Internal node We add to the queue each of the four children, if the distance from it to the point does not exceed R
External node Add to the queue all segments contained in it, the distance to which from the point does not exceed R
Segment. Found the answer to the problem.

Similarly, you can solve the problem of several nearest segments in the radius.

Comparison

Comparing the implementation on GEOS and our own, we were pleasantly surprised:
On the same gland for 300,000 segments:

	Geos	PMR Quadtree
Building a tree	2 seconds	0.2 seconds
Memory consumption	200 MB	50 MB
Draw points	35 500 points per second	383,500 points per second

As a result, the own decision laid the foundation for the new draw service of points, and allows you to draw points from all suppliers throughout Russia on an average server.

Conclusion

As a result, moving from PostGIS to a highly specialized solution in C ++, we received an acceleration from 2k to 380k (190 times).

Write highly specialized bikes!

How the traffic jams service works can be viewed right now , or by installing 2GIS on iOS , or Android .

Source: https://habr.com/ru/post/205742/

All Articles