Earth is flat

More precisely, not flat, but not a ball. And not even an ellipsoid. And quite a polyhedron. More precisely, the 56-gran. More precisely - the proposed new format for recording geo-coordinates.

First, some general considerations: There are three types of data in OSM: node way and relation. Nodes contain coordinates in the latitude-longitude format, the paths of the way type allow you to draw lines (more precisely, broken lines, including closed ones) on the map, and they are ordinary arrays of node identifiers, and the relation type data may contain as children not only nodes, but also data of other types. In addition, all three types may have descriptions presented in the EAV (Entity-Attribute-Value) model. Structurally, there is nothing else in the database.

The earth is a ball, and the window of sight is flat. The coordinates of the objects in the database are given in the form of “latitude-longitude”, and this seems to us not quite reasonable. Map volumes (at least, of the smallest scale) are simply huge, but the user needs to provide effective access to any corner of the planet for a map of any size - both for reading and editing. What is the conclusion of all this? In my opinion, the obvious: each of the cards is cut into slices in the size of a more or less tolerable volume (in terms of download speed).
')
It is cut on the server, i.e. the same pieces are transmitted the same for all users. Another point: although the coordinates are numbers and, therefore, can be stored in binary format, the browser will have to transmit only text. The solution to this problem has long been known: binary data is transmitted in base64 encoding, when only 6 of 8 bits of a byte carry information. More correctly, as many as 6 are quite compact. We will use this 64-bit number system.

The coordinates of the objects on the map are best set in such a way that the volume of transmitted data is minimal. It seems to us a successful solution when the latitude and longitude (or, on the plane, X and Y) would be set by one symbol (in base64 format), which allows positioning the element as a chessboard cell of size 8 * 8 (units of length of the appropriate scale). That is, with two characters we can position the element already on the 64 * 64 card, three - 512 * 512, and eight characters (the ninth - the card code itself, as will be shown below) provide us with the accuracy of specifying the coordinates 134217728 * 134217728, which is more than enough even for the most accurate map (for planet Earth, this means positioning objects with an accuracy of less than half a meter). Are you a little? You can add 2-3 more characters! And I would also turn one down - there is plenty of accuracy in a couple of steps too.

How to get a starting set of (flat) maps of the roughest scale? Very simple!

We cut the Earth at the equator and get rid of the negative values of latitude. From now on, only one bit will determine whether the northern one is hemisphere or southern (we equate the equator to the northern hemisphere so as not to duplicate the objects lying strictly on it).
Let's make two more pairs of sections: on the 32nd and 64th degrees in latitude. Now the planet is divided into 6 regions: two “polar caps”, two “equatorial” regions and two zones of the “middle band”.
We will look at each of the “polar caps” from above (this is called the “azimuthal projection”). We will see a circle with a center in the pole, the boundary of which runs along the 64th parallel. We describe a square around this circle so that it touches the circle at the intersection of this parallel with Greenwich and the 90th meridian, and these meridians themselves (in azimuthal projection, they look like straight lines) cut this square into 4 equal parts. These will be maps of the roughest scale.
The remaining “barrel” (the Earth without “polar caps”) is cut along the meridians in increments of 30 degrees, starting from Greenwich. Thus, the planet is cut into (12 + 12 + 4) * 2 = 56 almost flat maps, each of which (and each of the subsidiary ones), with further positioning, is assumed to be a square of 8 * 8. The code of the daughter card is also transmitted in one character, i.e. the daughter card corresponds with the parent card as a checkerboard cell with the board itself.

And now let's see how accurately the coordinates of the objects in the OSM database are specified. MOM DEAR! Seven decimal places! Gentlemen, what are you, a fool ?! Do you really indicate the coordinates of the outlines of the houses, the coastline of the continents and everything else up to a centimeter? Let you not believe it! However, you can believe: even if it is not complete nonsense, but true truth, but tell me who and why need such a crazy accuracy? This is idiocy! No, we will go the other way! (C)

Now remember that our earth's surface is very heterogeneous. Something tells me that the information saturation of a square kilometer map in the center of Paris will be slightly higher than that of the same kilometer in the center of the Pacific Ocean. What do we do?

That's right: it is not at all necessary that all the slices of this "chocolate" are available. If there are too few objects on the corresponding card, they are simply registered on the parent card, but this one is not stored in the database and is not transferred to the user. How to find out that the information on the card is intended for one of the subsidiaries? Elementary! The format for recording coordinates of objects on a map of any scale is the same, so if an object is specified for a daughter card on a parent's card, its code is extended by one byte (card ID relative to parent + coordinates), for the grandson - by one more, and so on to the map the smallest scale. Thus, for a “decimeter accuracy” map, the length of an object identifier does not exceed 9 bytes in “absolute” coordinates (for the entire database as a whole) —it is even less than the length of most node identifiers represented now in the database. But after all our identifier now contains all the information about its geographical location! Therefore, the following two fields of the node (lat / lon) structure are no longer needed, and therefore the vast majority of nodes become literal objects (not addressable independently to the database and existing only as part of the parent objects, which are the trajectories for the nodes). We no longer need to dig into the database in order to obtain its coordinates by the node identifier, and the volume of the base of nodes will be reduced by about three times and, accordingly, the volume of the entire database will be about a third! For nothing! Finally, errors in the data of the form are automatically removed, when the nodes with different identifiers have the same coordinates or, on the contrary, different coordinates are indicated for the same identifier in different places in the database.

An important nuance: the point given by the coordinates on the current map or on any of the parents as described above is the same point: it is intended for a map of the same scale. But the trajectories set for maps of different scales are already other geometric objects that live their own lives! Even if they describe the same logical level object. Example: the outlines of a lake on a small scale map can be described by relation type data - for example, due to the limitation on the number of points (no more than 2000) in the data of the way type. On a map of larger scale, the same lake may already “fit” into an object of type way, on an even larger scale, it may degenerate into a point, and on the others it may disappear altogether from the map. But the description of this object (non-geometric), of course, should be the same object associated with its geometric descriptions on each of the cards where it is present.

According to our idea, the node identifiers themselves will henceforth contain all the necessary information about their coordinates. This reveals errors like:

No child elements were found that are referenced from the parent (separately for each data type). Parental elements containing such links are flagged as “potentially defective.”
No items were found for which descriptions are available. An obvious error in the database.

Node type

The bulk of the nodes is translated into the literal values of the paths of the way type. In this case, the options are possible:

All nodes of this trajectory are safely converted into a new format. Successful completion of the conversion process, as it should be.
At least one path node could not be translated to the new format. Error in the data, the trajectory is marked as "defective".

After the conversion is completed, in addition to a set of “defective” trajectories, we can also obtain an array of “unnecessary” nodes that do not belong to any trajectory. This still does not mean an error in the data: after all, nodes can belong to paths of the type “relation” or be independent database elements that have their own description. If these connections are not found, such nodes should be removed as "garbage".

Strangely enough, when translating the paths of the way type into the new data format, no defective ones were found! This means the highest quality of data, which, of course, is provided by OSM software - people cannot work so correctly without error! But with other data things are much worse. In particular, 312057 trajectories of the type of way were found, which not only have no descriptions, but no one refers to them. What to do with them? Throw away? Or mark as defective and incite volunteers on them? On the map, anyway, any lines can be drawn by them. Okay, for now let's leave. Dummy nodes will be discarded, and dummy trajectories will be left. But in general, an object without a description is abnormal, it is a mistake. And there are not so many such objects in the database - the same way-paths without descriptions are more than 12 million!

It's funny, but as it turned out, there is not a single trajectory in Monaco and in the Azores, which would not be duplicated in any other file. It is understandable: try to draw a map of France, so as not to catch Monaco! Azores - this, of course, Portugal. It is equally clear why Antarctica, Cuba, Madagascar or the Maldives do not have a single double in other cuts. But the trajectory with ID rel = 148838 is duplicated in as many as 64 files! It is not difficult to guess that it is called the "administrative boundaries of the United States."

In most cases, data duplicated in different files completely coincide. But there are differences! And this means that these duplicates are physically different records, live their own lives and require synchronization of changes in them (which is not always possible). By the way, the use of the proposed format for recording coordinates automatically removes errors of this type.

How are we going to correct mistakes? Yes, very simple! We ignore references to missing elements in the database, and stupidly merge the data from different cuts into one: either they are completely absorbed (there are no mismatch errors), or we have some “bounce” when drawing the trajectory (if they resemble each other). If they are completely different - well: we have already marked the trajectory as defective.

All these checks, although rather cumbersome in terms of execution, have very simple and obvious algorithms: this is actually a check of the uniqueness of certain identifiers (without the values of the information fields), sometimes combined with the operations of intersection or union of certain sets. The last check is a bit more complicated: it checks the uniqueness of the trajectory identifiers along with the field of the node sequence number in the trajectory (also without values).

Obviously, the main (and possibly the only) value of relation paths is that they work as containers of child elements. Attempt to introduce the edge information there (if we consider the database as a graph) using the forcefully squeezed role attribute is clearly failed - suffice it to say that the lion's share of data elements in these trajectories (61.5%) have a dummy value (role = ""). And in general, database editors do not understand what to do with it - here and there are often descriptions (the place in the EAV data, for example, URL) and just garbage. The most common attribute values are outer, inner, forward, stop, from, to, via, whose information value tends to zero. There are a lot of clearly descriptive values (house, platform, street, admin_centre, subarea). Okay, for the time being we’ll leave it out - maybe we’ll stick a few things into descriptions.

Scaling

In the new format of recording coordinates, the preparation of maps of different scales is trivial - it consists merely in processing data of the type “way”. In fact, information on the location of point objects on a map of any scale is contained in their identifiers already now, and relation type paths (not to mention descriptions) generally do not care about the map - they simply group the child objects according to their belonging. And the algorithm for generating trajectories for a map of the next scale is much simpler and faster than even the process of converting data into this format. Note that not only the polygon is actually no different from any other line (just the first and the last elements are the same), but the single node is the same polyline consisting of one element. Thus, we generally got rid of the node data type - from now on, all the “geometry” is represented only by elements of the way type, which, if necessary, are assembled into relation type containers and have descriptions in EAV format. Their identifiers are the last thing that connects [as yet] our database with the original OpenStreetMap database.

The algorithm for generating a set of maps of all scales is fully automated and looks like this: for the elements of the paths of the way map type of the current scale, the last character is cut off, after which the duplicates formed (running in a row) are removed. Then from the general array there are groups of trajectories that have collapsed to a point, as well as a group of “segments” - trajectories in which exactly two nodes remain. Neither one nor the other is transferred to a map of a coarser scale (unless convincing arguments are found for leaving them on the formed map). The remaining trajectories (if no convincing arguments are found for not transferring them to the new scale map) form a new map and the process repeats. A set of cards is ready!

I know that you know that "it was smooth on paper"! The simplest (and knockout) question is: “If scaling is done so simply, then why is there still no such set of maps”? After all, it does not exist We have so famously described (and even implemented) an algorithm: trajectories consisting of one node and “segments” on a map of the next scale are not transferred. But in fact? Is it not clear to someone that the Church of St. Basil on any map should be much more noticeable than any “Khrushchev” or modern reinforced concrete box - even if it is much bigger! So the decision on the existence of any object, including the point ones, must be made individually for a map of each scale. And the algorithms of these solutions are not so simple. Say, some shrunk “Uchkuduk”, in which only “three wells” in the “hot desert” is a very important landmark, and who needs it outside of it? Nevertheless, the technical part of the preparation of maps of different scales is really simple to the extent described.

The point has the parameters "latitude" and "longitude" (in the desired properties it is proposed to add the third parameter - "height").

That's really nafig! Are you going to ask relief in raster format? Totally go crazy? Only vector! Moreover, all these isolines are fundamentally no different from any others in terms of serving mathematics. However, since there is no volume in the database, then we don’t need it. Our scheme is flexible, and the relief (if it ever appears at all) is quietly connected at any moment as a separate layer of visibility.

And now let's work a bit with EAV. Our task is to cut off the most annoying garbage, and to structure that which can be structured in any way. We conduct processing based on data types, although the general principle is the same: we pull data from the general heap under the most popular tags, and send them either to a dump or to some sort of structure. And not even with tags, but just with data! Who, in fact, said that metadata is located in the tags? As they say, must, but not obliged. So check it out.

Here is an illustration (part of the description for a node with ID = 1027246737 under the prefix tiger):

tlid = 147661845: 147661846: 147661853
upload_uuid = bulk_upload.pl-9e2fc80e-9312-407e-9603-ca030ab40414

Tell me just honestly: do you want to poke around in such data in the hope of catching useful information from there? So I do not want. Moreover, the negligence of the database compilers has already been repeatedly proven. However, there is quite a lot of data (street_num, city_id, etc.), which are actually garbage for the end user, but may be useful for structuring (identifying and correcting errors in the data).

So, we fastened our seat belts, numbered the bones ... we dive! But not in this article.

Source: https://habr.com/ru/post/448752/

All Articles

Earth is flat

Node type

Scaling

More articles: