Hello, dear readers of Habr. In this article we will talk
- About addresses and data storages with fuzzy scheme
- About the processing of geodata in java, namely about the Java Topology Suite
- About the cost of "simplicity" for the developer
- About pure Java nosql document database / full-text search engine - Elasticsearch.
For Java hub readers unfamiliar with OSM
OSM is an open geospatial database with highly detailed global coverage.
')
For a user, this is simply a map that can be viewed on-line, downloaded to a navigator, telephone, edited, printed, and generally used in any other imaginable way, including commercial gain from it.
From the point of view of the developer, this is a very peculiar database.
- In OSM there is no clear data scheme, there is no usual division into layers into GIS. The type of the object, its properties, and in a certain sense the geometry, are specified via tags - text key-value pairs.
- Geometry is stored only for points. A line is an array of id's points. Polygon - a set of links to lines with an indication of which of the boundaries a line is inside or outside.
- This allows the community to model new types of data on the fly. Want to mark sidewalks - negotiate a combination of tags and go ahead. All this without dramas (well, almost) with cutting out some incomprehensible moderator of the data that is dear to your heart with the comment “you litter the card, nobody needs your work”.
Naturally, for developers, this liberty easily turns into a nightmare.
Addresses
Most of us live life after changing 1-2 cities, and we just do not have time to pay attention to the variety of postal addressing schemes.
It starts with a small one - with a record of the address: from large to small or from small to large.
Then it turns out that in some cities they do not use the street in the address:
house 521, Kotor, Montenegro or the Russian Federation, the village Energetik house 15.
That one house may have several addresses:
The best city on earth, Straight Street, 22, and he has the Best City on earth, Perpendicular Street, 12
That double addressing can be not only at the corner houses and that 2 addresses are far from the limit.
Tallinn Kashtanovaya street 13, 15, 17
Tallinn Cherry Street 1, 3
* An example is quite from my own life, I just forgot the original street names.
What can be addressed are not only the houses themselves, but also the entrances, and even the sections of the house that do not manifest themselves in any way. (This is just from the example of Tallinn - addresses on the same street inside the same building are internal sections that once belonged to different owners).
And finally, all exotic as
Naberezhnye Chelny, 11/1
This is 11 residential complex, house 1. In fact, houses in the LF also have the usual addresses of Street Street, but local residents do not use them.
Now multiply this by:
4 fundamentally different ways to specify the Settlement in OSM
3-4 ways to specify multiple addressing
3 ways to specify the street for the address
Moreover, if you enter the data, you just need to use a couple of schemes from this zoo. If you want to write a decent geocoder, it is not enough for you to just remember the existence of this zoo, you need to “calculate” it.
For example, let's see what needs to be done to match a home — a city.
Let's start with the homeland of the project - the UK.
Here for example a house in Westminster.
www.openstreetmap.org/way/46138969addr:housenumber 1 addr:street Derby Gate building yes
Westminster itself is marked like this:
name Westminster name:ru place town
What else is important to remember, duck is that this is practically the very center of London. London essentially consists of several dozen smaller towns. Which city to take the house? To London or to Westminster?
In fact, it is impossible to unequivocally answer this question, for a building it is not indicated to which city it formally refers. But, for me, for example, it is obvious that a good geocoder should find this house both at the request of "London, Derby Gate, 1" and at the request of "Westminster, Derby Gate, 1" and "London, Westminster, Derby Gate, 1"
How do I see the implementation: I associate with each house (address point) all polygonal boundaries into which it falls (the boundaries of the city fall here if they are specified) and an array of surrounding settlements, here fall London as the nearest city, and Westminster, as the nearest town and still their neighbors.
Here is how it looks in the json data schema for elasticsearch
Accordingly, when searching, it will be possible to find a house on request with any of the cities, the difference will be only in the relevance of the exactly matched city and neighboring cities. Each "need" is 1 or 2 spatial join'a.
Geocoder
As said one famous osmer
Zverik“Recently, everyone is only busy with what the next geocoder is doing.”
I will not be an exception: everyone writes geocoder, why am I worse? I also want a
bike geocoder.
What do people want from OSM about addresses?
- Upload me addresses in csv or sqlite
- Geocode, please, a list of 100,500 addresses
- Online geocoding service (there are online geocoders in OSM, therefore this item is not in the first place)
- I entered the address for the house or organization, why can not I find it through an online geocoder?
- I want my local search engine with OSM data, which advise.
The answer usually begins like this:
- You will need postgres, postgis (90% usually pass this part of the quest)
- You will need to load OSM data into this database using osmosis or osm2pgsql, if suddenly you decide to rotate such a focus on a global scale, then please be fast (16-32 gig) and patience (several days, if you have many operatives, and several weeks, if little RAM). Country is also not very fast and not very loyal to the gland.
- Process the data scripts on postgreshke. (Depending on the desired result, there remain 10% - 20% who wish to contact OSM-Stoics data processing.)
- Add search in DB with addresses or upload the one you need to csv / json
Entirely the whole quest pass units. Adding each new chip to a geocoder turns around for hours, or even days, for processing geospatial joines with a postgreshka and gigabytes of RAM needed to load the new data you need into a postgreshka.
And when you understand that in the Russian Federation it is considered to be a city for a home by entering the city landfill, and in the UK - by proximity to points of settlements, respectively, you need both the city boundaries adopted in the Russian Federation and Voronoi landfills + neighbors for the UK, and all this mixed up. The eyes are filled with blood, and the next addressing scheme, based on a heap of relationships, causes a flurry of emotions.
Geocoding can be considered as two almost independent tasks:
- For all objects (addresses, streets, cities, points of interest) to obtain data: the boundaries of cities, neighboring points of cities, the surrounding streets.
- Organize a search for this data. In the simplest case, this is just a full-text search.
I concentrated on solving the problem 1 - get the data in a form suitable for search indexing. In principle, having given this data for indexing google or yandex, the second part of the geocoder can be outsourced.
Attempt # 1
My first attempt here.
github.com/kiselev-dv/osm-addresses-pgsqlI load the simplified data into postgres, after which, resorting to various tricks to keep within a reasonable time, I process them.
In the process of development, it turned out that it is faster and more flexible to process geometry into java and load ready-made geometry into postgres.
Computational complexity in computational geometry
The middle of the article usually no one reads, I hope, an obvious tautology will encourage the bored reader.
Why is it necessary to download 6 GB of data packed in zipped xml, 6+ GB of RAM?
OSM data does not contain complete geometry, the only OSM object that is geo-referenced is a point. (Just triumph of data normalization) In order to get the geometry, you need to load the points in the lists of id's lines from the lines. Actually the bulk of the memory eats index points. In order not to climb once more on the disc, tags are usually saved for them. Now the planet’s dump contains about 3 billion points, add here the overhead of storing points in the index, and for a regular home computer, JavaHeapSpace will loom.
Suppose we managed to fake it and load it all into database, collect geometric indices. How to speed up the execution of requests?
The main part of the work is Point Location Problem. That is, using the example of the Russian Federation, we need to calculate the occurrence of one and a half million points in half a million border polygons. It is also important to understand that, for example, the border of the Russian Federation contains 40,000 segments. All million points should be checked for entry into the border. Despite the existence of the index, the complexity is O (N * M). Postgis quickly using the index will determine that the point n from the million address points possibly falls within the Russian Federation, after which, determining whether the point is inside the polygon, it will calculate M scalar products, where M is the number of segments of the polygon boundary.
By the time this fact was realized, I had already done almost all the preparatory work: parsing the polygons of administrative boundaries, getting and simplifying the geometry of the streets, getting geometry for houses with java addresses using the Java Topology Suite.
www.vividsolutions.com/jts/jtshome.htm I basically used postgis for calculating spatial join'ov well, and as a general io framework.
Another bottleneck with postgis is the difficulty to parallelize query execution. In this case, I have not hundreds of requests per second, but 1 thick request for several hours.
The best bike, attempt number 2
In fact, while describing the complexity of data processing, I curse my heart a little. For example, the OSM community copes quite well with the conversion of maps into various navigation software formats. It does this on a daily basis, using both dedicated servers and regular home computers of the participants.
At the same time, the community of the Russian Federation will process OSM data taking into account local specifics, the German community will take into account its own.
In the second attempt, I decided to go the same way. I decided to write an application that could be used by most OSMs on home PCs.
What is required for this:
- The most important thing is ease of installation. No one will set up an environment for you, install additional libraries, or a database engine. (Goodbye postgres)
- Ability to customize. If you want the community to help you, let the community take into account and implement local traditions and practices.
- A variety of uses. Again, allow the local community to solve a wide range of tasks with your tool. Give the opportunity to download the address registry in the form in which it is needed by the local community.
Since I already had some experience after trying # 1, I had to partially repeat the process implemented for postgis using only java. In fact, I need only spatial indices RTree and QuadTree, which are implemented in JTS. And a couple of utilitarian methods for working with geometry, such as the separation of geometry and a point on the line.
So, I process OSM data in 2 stages.
1 I get the geometry from the OSM file, I cut it into pieces - strips with a width of 0.1 degrees. This allows you to naturally break the task into parts and at the same time simplifies the geometry of the borders. That is, instead of 1 border of the Russian Federation with 40,000 segments, I get 1,500 pieces, in each of which there are a couple of tens of segments. The number of border polygons per strip also goes around several dozen.
2 In the second stage, I expect to hit the points in the polygons of borders, looking for the nearest streets, the nearest cities. Each "band" is processed separately, respectively, this stage is parallelized between threads / nodes.
As a result, I get a json description for points of interest, streets, cities, borders and houses.
The description of POI looks like this { "id": "poipnt-3372930671-n600379459", "ftype": "poipnt", "timestamp": "2014-05-14T15:22:11.785Z", "poiTypes": ["police"], "boundaries": { "index_name": ", , ", "text": " , , ", "langs": [], "longText": " , , , ", "boundariesHash": -410546539, "parts": [{ "lvl": "boundary:2", "names": {
The description of the house (address point) looks like this { "id": "adrpnt-0642964090-w129096772", "ftype": "adrpnt", "timestamp": "2014-05-14T15:31:38.778Z", "nearbyStreets": [{ "id": "hghway-3750745956-w128994466", "properties": { "highway": "unclassified", "name": " " } }, { "id": "hghway-1683350308-w129293134", "properties": { "kladr:user": "87005000001000100", "surface": "concrete", "highway": "tertiary", "name": " ", "oneway": "no" } }, { "id": "hghway-0059099168-w129293131", "properties": { "highway": "unclassified", "name": " " } }, { "id": "hghway-0069632087-w129293121", "properties": { "kladr:user": "87005000001000100", "surface": "concrete", "highway": "tertiary", "name": " ", "oneway": "no" } }, { "id": "hghway-4111943004-w128995822", "properties": { "highway": "residential", "name": " " } }], "properties": { "building": "yes", "building:levels": "2", "sport": "swimming", "amenity": "swimming_pool", "addr:street": " ", "addr:housenumber": "14" }, "nearestCity": { "id": "plcdln-2842041450-n1575069719", "properties": { "name:fr": "Providenia", "addr:postcode": "689251", "name:ru": "", "name:en": "Provideniya", "name:nl": "Providenia", "name:pl": "Prowidienija", "name:it": "Providenija", "name:es": "Providéniya", "addr:country": "RU", "name": "", "name:de": "Prowidenija", "addr:region": " ", "place": "town" } }, "type": "Feature", "addresses": [{ "index_name": " 14, , , , , ", "text": " , , , , , 14", "langs": [], "longText": " , , , , , , 14", "parts": [{ "lvl": "postcode", "name": "", "lvl-size": 55 }, { "lvl": "boundary:2", "names": {
Intermediate data storage format - json (packed gzip). This, firstly, greatly simplifies debag. I can see intermediate results with simple (z) cat, (z) grep. Secondly, it simplifies the life of third-party developers, if any.
At the output, I generate csv for a bunch of postgis + sphynx (maybe the postgis is superfluous here, but it’s convenient to do a reverse geocoder in this bundle) and json for elasticsearch (reverse geocoding can be done using the elasticsearch itself).
It takes about 2 hours to process the RF, with 6 GB of RAM and approximately 5 GB of disk space. You can shrink to a smaller amount of ram by sacrificing time. The analogs known to me spend two days.
Many sites can be customized (I plan to increase the number of extensions) using my class implementations, passing them from groovy to extensions.
The application for data processing is a 1 jar executable and does not require you to install additional dependencies.
At the moment I have a rather tricky syntax for launching, with a lot of command line parameters, in the future I plan to launch 1 line java -jar gazetteer.jar russia.config and even better as a working node of the cluster.
Link to the bike
github.com/kiselev-dv/gazetteerI am not strong in search technologies, so the search for downloaded elasticsearch itself is lame. If there are those in Habré who are interested in working with this kind of data and experimenting with the settings of a search engine by geodata - welcome.
The project consists of two parts.
Gazetteer - data processing.
GazetteerWeb - api over elasticsearch (while on tomcat - I plan to use netty in the future)