The task was to organize administrative centers into a clear hierarchy on the basis of matryoshka, for example, Ukraine - Crimea - YBK - Yalta, and correct existing errors in the current database.
In this article I will explain how I solved this problem with the help of KML files framing the boundaries and Postgres + Postgis.
The fact is that the base we use for
our project is not commercial (user generated, open source) and there are errors in it. For example, the most frequent case is that many cities are assigned to a country, but they do not belong to any of its regions and regions, we call them
orphaned cities .
')
Plus, we have a tourist business, so the administrative and political fragmentation of countries does not always fit, sometimes no-no and you have to add tourist regions manually. For example, there is no such administrative region as the “Southern Coast of Crimea”, but there is such a tourist area where tourists choose where to go - they are looking for “houses on the South Coast”, rather than “houses in Yalta, Gaspra, Gurzuf and in general somewhere ".
The question is how to automatically find a parent (Crimea) for such an administrative region and ask all its children (cities like Yalta and Sudak).
By the way: our country consists of regions, regions consist of regions, regions of subregions. Just like the death of Koscheev ...
To solve this problem, it was decided to use the data with the borders of the regions, and compare them with parents and children.
Preparing data
The database of the boundaries of the regions of the part of the world that interests you can be easily found on the Internet. I used files for Google Earth - either KML (XML with coordinates) or KMZ (KML in a zip archive), it is convenient to look at them directly in Google Earth.
KMZ didn’t turn into KML automatically - the Unicode characters in the names of toponyms break down, so I opened the KMZ files in Google Earth and saved them as KML. Of course, “it is unworthy for a talented person to spend hours, like a slave, on calculations, which, of course, could be entrusted to any person if you used a machine at the same time,” but if there were more files, you would look for an automation tool.
The second way, if you need custom areas like the South Coast, draw with your hands in Google Earth and save to a KML file.
The data files that I had had different regions in each region, but there were no relationships between parents and children. For example, a separate file "all regions of the country," and separately - "all areas of the country." In principle, this is the task that we need to solve, but for this we now have all the data.
Cooking platform
At first I wanted to use MySQL, since they have support for spatial data types and operations on them, but then it turned out that these very operations are implemented in a highly simplified form and are unsuitable for real tasks. A simplified view means that MySQL only works with the MBR, a rectangle describing a polygon. It turns out that instead of the beautiful borders of Crimea, operations will be performed with a blunt rectangle around it, and the calculation of the real area would be replaced simply by the area of ​​this rectangle (which will be used to solve this problem). This makes MySQL unsuitable for this task.
We use Postgres and install PostGis library on it to work with geographic data types.
Load the data
I prepared several tables -
country
,
region
,
area
,
subarea
. For each of them, tables with the suffix "
_boundary
" were also made, since the same region can have more than one border (for example, islands).
The data itself can be downloaded in 2 ways:
- Being unfamiliar with the PostGis arsenal, I wrote my KML handler - in fact, XML nodes with coordinates parsed and turned into SQL polygons,
- alternative to manual work - the ST_GeomFromKML function built into
ST_GeomFromKML
We clean the data
The obtained boundaries came out too detailed, with a large number of points. This means longer processing, as well as more data to be sent to the client via AJAX in the browser, if these polygons need to be drawn on the map. Looking ahead, I will say that in this way the size of the data was reduced by a factor of 10, and ajax requests began to “fly”.
Therefore, I decided to simplify polygons, good for this in PostGis is the
ST_Simplify
function - it should be given a polygon and a smoothing value at the input.
Having played with the test data, I selected the smoothing parameter to be 0.001.
The request turned out like this:
UPDATE "subarea_boundary" SET "path" = ST_Simplify("path", 0.001);
Compare yourself before and after:


We select the parent
There are various functions in PostGis to clarify the relationship between polygons, but I came to the conclusion that the following idea is best to use: a
territory is considered subsidiary if the subsidiary territory is more than half the parent territory.In the form of a formula, it looks like this:
[intersection area] / [subsidiary area]> 0.5In terms of PostGis functions, it looks like this:
ST_Area(ST_Intersection(region_path, area_path)) / ST_Area(area_path) > 0.5
Next, I wrote several similar stored procedures to find the right parent for a given territory.
Here is an example of finding a parent region for a region:
CREATE OR REPLACE FUNCTION "SuggestRegion" ("area_path" geometry) RETURNS integer AS 'DECLARE admId integer := 0; BEGIN SELECT INTO admId "parent_id" FROM "region_boundary" WHERE ST_Area(ST_Intersection(region_path, area_path))/ST_Area(area_path) > 0.5 LIMIT 1; RETURN admId; END;' LANGUAGE "plpgsql" COST 100 VOLATILE RETURNS NULL ON NULL INPUT SECURITY INVOKER
To get the ID of the region into which this region belongs, you need to drive the polygons of this region through this function:
SELECT "SuggestRegion" ("path") AS parent_id FROM area_boundary WHERE area_id = XXX LIMIT 1;
If the received ID is greater than zero, then you can save the resulting region as the parent region of this region.
For cities it is even simpler, since the city is set only by the center coordinate - we use the function of checking the full entry
ST_Within
.
I hope this article will be useful for other lovers of geodesy.