📜 ⬆️ ⬇️

"Who is on the first base" - a new geographical reference from Mapzen

Small version




All administrative units! So far everything is damp and difficult !!! But this is for now !!!

Big version


Mapzen creates a geographical directory of administrative units. Not that all , but the overwhelming majority, and, we hope, the majority of their species. The geographical reference book is a large list of administrative units, each of which has a permanent identifier and a number of properties describing their location. It is interesting to consider the directory as a space where debates around administrative units are conducted , but not resolved. We call our directory “Who's On First”, or in short “WOF”.
')

According to Wikipedia , Who's on First:
... it's a comedy scene made famous by Abbott and Costello. The plot is based on what Abbott calls Costello baseball players, but their names and nicknames can be interpreted as meaningless answers to Costello questions. For example, a player on first base is called "Who"; Therefore, by ear, “Who is in first base?” is perceived in two ways as a question (“Which player is in first base?”) and the answer (“The name of the player in first base is“ Who ”). "Who's On First" comes from the burlesque sketches of the last century, in which the game of words and names was used. For example, “The Baker Scene” (the store is located on Watt Street (in tune with “What Street”)) and “Who Dyed” (owner is Who (Who)). In the 1930 film Cracked Nuts, the comedians Bert Wheeler and Robert Woolsey studied a map of the mythical kingdom with a dialogue like this: “What is Who?” (Which city is behind Which?) “Yes.” In the English music halls (British analogue of the Vaudeville theaters) of the early 1930s, the comedian Will Hay played in a scene where the teacher would interview the student Howe, who came from Ware, but now lives in Wye.


The name beautifully underlines one of the “problems” in geography. Of course, it would be easier if we only had to perceive the world as a set of coordinates . But we cannot do this, and the burden of the “administrative unit” with all that is meant by it is on us to this day.

Our directory is very far from being complete (both in terms of data coverage, and in terms of their quality), so in the near future we should not expect much when using it.

We publish the data now, because it is important for us not only to pronounce our goals and intentions, but also to turn them into tangible results . Consider this blog entry and data release today as a reflection of the direction of our efforts, and not the final goal.



Our directory is not the first (and not the last, we hope). Many created before us. The most notable of them are:



We use the quattroshapes as a basis for our dataset, since it has the most information on more places for a longer time.

We supplement this set with relevant geometry and metadata from the projects of Natural Earth , GeoPlanet , GeoNames and Zetashapes , as far as their licenses permit.

People familiar with the problem probably think: Does this mean that the scarcity of covering certain types of administrative units (many districts and villages around the world) will delay the start of the whole process? The answer is yes. One of our short-term goals is to find out which administrative units in the GeoPlanet suite lack the corresponding counterpart in Quattroshapes. After receiving their list, we will be able to import the names and hierarchies for these units from GeoPlanet, and the data coverage will immediately improve. True, many units will lose coordinates during import, but we are sure that this problem will be solved with time.

We are not the first in our striving to create a comprehensive open set of data about the whole world. We consider each such project as a unique contribution to this common cause, and, combining these contributions, we hope to start creating something more than just their sum.

All data is available under the Creative Commons Zero license .

A number of open sources used by us require attribution. We have listed these sources here .

Huge version




The huge version is huge. It makes sense to make a cup of coffee for yourself, and maybe something stronger if you climb into the same wilds as we do.

In the end, what is a geographical directory?


As we said earlier, a geographical directory is a large list of administrative units, each of which has a permanent identifier and a number of properties that describe their location. It is interesting to consider the directory as a space where debates around administrative units are conducted , but not resolved.

The simplest and most friendly expression of this idea is to consider how differently people can call the same locality. Sometimes two people call this place the same way, but they record it differently. Remember how the geocoder is still meticulous about the correctness of requests, and how funny it happens if an incorrect request is entered. Now multiply this by the number of all languages ​​in the world.

The easiest way to explain the purpose of the directory, saying that it solves the problem of misspelling. If every locality has a simple (often numerical, but most importantly, permanent) identifier, then you and I can refer to the same locality in any possible context, using this identifier. And do not lose a lot of time, given the linguistic wisdom.

For example, I can call the city "Montreal" (in English), and you call it, say, "몬트리올" (in Korean) and someone else - "Montréal" (in French) or "MTL" (international reduction) and so on. So, it would be great if the reference book were the space in which all these representations of one administrative unit (or, for fun, for all of the agreed hallucinations) could coexist.

To convey a more complex idea that the handbook exists for debates, we can recall that the “administrative unit” is a big problem, since it often becomes a matter of dispute on a social, political, and often very emotional level. This problem is not new. People argue and, at times, fight for their vision of belonging and borders of the territory for as long as they remember themselves.



For so many years, in one book, in one paragraph, in one sentence, different minds still cannot, or do not try, or do not want to come to a common opinion on a certain territory. Perhaps this is ... useless, but I want to believe that some list of key-value pairs, no matter how complete, created in an attempt to fix all the nuances of such a territory, will help in resolving a long-running dispute.

Basic principles


The handbook is based on a number of basic principles:

Mapzen has an opinion


It is important that Mapzen has an opinion not on each specific administrative unit, but on the nature of the unit as such . This necessary moment delineates our boundaries and gives us an understanding of what our project is and what is critical, what it is not.

Reflect all points of view that fall within the project boundary.


The world is a complicated thing, and we would like the geographical directory to be a kind of stage for, sometimes contradictory, opinions about this world. We intend to reflect as many opinions or decisions on a specific unit as we can, for applications and users. How this will manifest itself in specific conditions remains to be seen, but we set a goal for ourselves.

Relocatability


The canonical source of information about an administrative unit is a GeoJSON text file with a unique 64-bit numeric ID. All computers can operate with text files and numbers. Text files can be viewed or corrected in any old text editor. Text files can be printed on the printer. Numbers are quickly and simply indexed by databases.

We use text files because our data is especially important: ease of use, reliability and transferability over time. The benefits of the good old text format outweigh the benefits of other options.

For example, Google’s Protocol Buffers are great, but they require many other Google programs to install. ESRI's Shapefiles are just as wonderful, their prevalence and long history confirm the convenience of the format, however, they also need special programs for even a small edit.

This does not mean that text or static files are the most optimal choice. It all depends on the specific tasks, and if necessary, we will translate all the data into a more lightweight and convenient format, but you will always have access to simple text files.

Geojson


We use GeoJSON as the primary exchange format for two complementary reasons:



Tell me more (complex things)




If now you are interested in simple things (such as names, geometries and the necessary minimum of properties related to them), then flush further .

Matches


When dealing with other directories (and we want to interact with all available directories: both obsolete, and current, and planned), a good option is to start by searching for correspondences between them.

Compliance in this case is the basis for asserting that, for example, “their Boston” and “our Boston” are one and the same. The details of them may be completely different due to other tasks and views. Having different points of view is good. Compliance allows anyone to work with things that interest him, taking into account the work of others, and providing a mechanism for interaction.

Each WOF entry has a wof:concordances property in the form of key / value pairs, which is a list of pointers to the same object in other databases. For example:

 "wof:id": 101736545, "wof:concordances": { "fct:id": "03c06bce-8f76-11e1-848f-cfd5bf3ef515", "gn:id": "6077243", "gp:id": "3534" } 


At the time of this publication, we have correspondences with GeoNames (159,359 objects), GeoPlanet (135,399), QuattroShapes (115,550), Factual (80,973), various airport classifiers (ICAO, IATA, FAA and OurAirports), Wikipedia (for now only by airports) and even with Mapzen Border countries . Soon there will be even more.

Types of administrative units


For any hierarchy of administrative units, we have defined three "classes", one of which may belong to any type of units. This does not mean that there can be no other classes (or types of administrative units). We just decided to start with such a set.

Common (C)


These units are common to all hierarchies and all administrative units in Who's On First.

An important point: this means that any object must have one or more common higher objects of this class (for example, a country, or a continent, or sometimes just the planet Earth). This does not prevent specific additions to the hierarchy for a particular place, in order to fit it into an already existing common hierarchy.

Common-optional (CO)


Units of this class are implied as part of a general hierarchy, but may be missing because they are not relevant or we do not have such data. An example of this type is the district.

Optional (O)


These parts of the hierarchy are specific, usually for a particular country or region. For example, many embedded departments in France or Germany. The only rule is that the optional (O) types must be somewhere within the common (C) hierarchy.

The minimum list of types of administrative units for the most general hierarchy looks like this:

 - continent (C)
   - country (C)
     - region (C)
       - "county" (CO)
         - locality (C)
           - neighborhood (C)


A more detailed version might be:

 - continent (C)
   - empire (CO)
     - country (C)
       - region (C)
         - "county" (CO)
           - "metro area" (CO)
             - locality (C)
               - macrohood (O)
                 - neighborhood (C)
                   - microhood (O)
                     - campus (CO)
                       - building (CO)
                         - address (CO)
                           - venue (C)


Sites! Buildings !!! Neighborhood !!! Empire !!!

So many new types, but that's not all. You still see only a common skeleton. On GitHub there is a whole repository dedicated to types , including a discussion (and canonical links) about each type given above.


Hierarchies


The hierarchies in Who's On First are presented in the form of a list, each element of which is a directory containing the complete hierarchy. Like here:

 "wof:hierarchy": [ { "country_id": "85633147", "region_id": "85683255", "county_id": "102072387", "locality_id": "101750223", "neighbourhood_id": "85794581" } ] 


This is due to the fact that the area in Who's On First may be part of several different hierarchies. Take, for example, such a type as urban agglomeration ("San Francisco Bay Area" in and around San Francisco, "New York", which includes all five districts and even parts of New Jersey, and so on), which often includes such units same type. Disputed territory, again. Why and how we came to this decision is a topic for a separate article, but if it is short:

Why this solution is good:

  • it is visual
  • easy to compare multiple hierarchies
  • it does not require the user to overstretch brains in order to restore the full hierarchy or to provide support for the next "insight" that we have just visited
  • easier to make changes in the development process (... we declare before the "official" launch)


Why this decision is bad, or it seems bad:

  • if we support urban agglomerations, it means that many other units (neighborhoods, districts, sites) may have several hierarchies, where a district extends beyond all parent units
  • file size, disk space, channel width are all the consequences of the first item and akin to spaces and coordinates with> 6 decimal places in GeoJSON files that can quickly become heavy



Disputed territory


Although all regions and many localities are “challenged” at the level of friendly ban, disputes about some territories take a very serious turn, since they involve two or more states (and sometimes, so-called non-state subjects of international law ). Such disputes are fraught with violence and consequences far from the "friendly banter".

For the duration of the disputed status, we assign the disputed type to such territories. The disputed territories, by definition, have in their hierarchy two or more parent states. This approach does not reflect all the facts of the current situation in each dispute. On the other hand, it allows us to single out the participants in the dispute and, as we said above, make a decision on how to reflect the controversy in the context of the task.

Parental IDs and Parental Rights


Even if a territory can belong to different hierarchies, we mean that in most cases it is de facto “controlled” by someone alone. For example, the Golan Heights are contested by Syria and Israel, which is reflected in the hierarchy, but they are still under the control of Israel.

 "wof:hierarchy": [ { "continent_id": "102191569", "country_id": "85632315", "disputed_id": "85632221" }, { "continent_id": "102191569", "country_id": "85632413", "disputed_id": "85632221" } ], "wof:id": "85632221", "wof:name": "Golan Heights", "wof:parent_id": "85632315", 


In some cases, we can’t say for sure who controls the territory, or we’re not sure, because the dispute started recently, and we are still checking the data. Then we assign the parent record the value -1.

It happens that we assign and -2. This should be interpreted as ": shrugging: The world is a complex thing." For example, the Baikonur cosmodrome in Kazakhstan .

Substitute / replaceable


One of the big and complex philosophical questions that arises when working with geography: How to distinguish a simple adjustment from a fundamental change?

This problem is not purely geographical, but in geography it most often blisters the eyes. For example, Poland, France, and Germany have appropriated for a good hundred years and gave (sometimes with subsequent assignment back) one territory. Their boundaries, and periods of the existence of boundaries are important contextual information not only for cartography, but also for many other fields of activity. Take the works of art that were ready when the territory belonged to Poland, but were created when the territory belonged to Germany. How would you consistently identify a changing terrain?

Another example. In New York City, there is a sort of non-quite- called neighborhood called “BoCoCa”. BoCoCa is short for Boerum Hill, Cobble Hill and Carroll Gardens, three adjacent districts south of downtown Brooklyn. BoCoCa is not a name in the usual sense, and not a district, as most people think. On the other hand, in the set of maps and data sets, this is the area (and name). Whatever we think, BoCoCa "exists."

In Who's On First, we made BoCoCa a “macro district”, which includes three districts from which its name is derived.
The type of administrative unit is a very important property, and it is obviously used by various applications. We do not need to know how or why applications handle properties associated with the terrain. And if we take in head to consider this WOF ID 85892915 a district (which it was when importing from Quattroshapes), we probably should not change it so easily, at the request of our left heel .

True, we do not consider BoCoCa a district. We have a firm opinion about this. While BoCoCa is considered a district, from our point of view, this is no longer the case. Our way to settle a problem like this is to create a new entry with a new type (BoCoCa - macro area) and correct the remaining entries, indicating that one is replaced by the other.

For example, the entry for BoCoCa as a district looks like this:

 "wof:superseded_by": [102147495], "wof:supersedes": [], 


While the BoCoCa as a macro area looks like this:

 "wof:superseded_by": [], "wof:supersedes": [85892915] 


Applications need to decide how (and if) the replaced objects should be separately taken into account. The search engine, for example, can rank the replaced objects separately or completely exclude from processing.

Violations


Each entry has a list of wof:breaches . At the time of reading this article, most of these lists may still be empty. "Violations" occur when the geometry of one unit intersects the geometry of another unit of the same type.

These lists are used as a signal to Who's On First users, both about the fact that there are errors in the data (as a rule, the borders of countries do not cross adjacent borders), and that there is a difference of opinion about the boundaries of a territory (for example, a region).

Like many other signals, its value, importance, and processing method are left to the end applications.

Tell me more (simple things)




Remember what is meant by "simple things" ...

Titles


All names are originally taken from the Quattroshapes and Natural Earth sets. However, GeoPlanet (GP) is generally better in terms of multilingual and colloquial names.

GP has two properties to name:

  1. ISO 639-3 language code
  2. The name of the "type" of the well-known list of descriptions, compiled by excellent people from the GP:


The Name_Type field is a single letter code that takes the following values:

  • P - preferred name in English
  • Q - preferred name in other languages
  • V - common (but unofficial) version of the name (for example, “New York City” for New York)
  • S - synonym or colloquial name (“Big Apple” for New York)
  • A - abbreviation or code for an administrative unit (“NYC” for New York)


GP also distinguishes between name and alias , and the following can be found in their world:

 Name: Montréal
 Language: FRE
 Alias ​​(ENG_P): Montreal
 Alias ​​(KOR_Q): 몬트리올


GP does not take into account that some countries have several state languages. We thought about all this and decided:



For example:

 { "wof:lang": ["eng", "fre"], "name:eng_p": "Montreal", "name:eng_a": "YMQ", "name:fre_p": "Montréal", "name:kor_p": "몬트리올", } 


Geometries


"Consensus" geometry



Each administrative unit will have one “consensus” geometry. The concept of "consensus" has not yet been defined. And in general, the use of this word is fraught with problems. It will be replaced by a more accurate term.

All “other” geometries



Also, each unit will have an “alternate” file with different named geometries. It is supposed to store in them controversial, simplified geometries, or optimized for specific tasks (for example, geocoding).

The main thing here is: FOUNDATION OF ALL GEOMETRIES.

The source of geometry, consensus (sic) or alternative, is included in each record. For example:

 { "src:geom": "zetashapes", "src:geom_alt": ["quattroshapes", "naturalearth"] } 


Centroids


Each record can have several centroids. The combination of "several centroids" and the truth sounds like an oxymoron. The term "centroid" we denote the area of ​​focus of any geometry. Different centroids are indicated by prefixes indicating the type of use. For example:



Required minimum of properties


The Flickr API is designed according to the principle: “What is the minimum data set that the API should return to any request related to photos?”

The essence of the answer Flickr gives in the " standard result for photos ", namely: "The minimum set of data should allow to load / build a URL that points to the photo page on Flickr"

In the case of Mapzen, the answer to such a question would be: "The data set should allow displaying the API response on the map . "

For example, it should be possible to get a response from Pelias (or any other API), simply transfer it to the Leaflet as a GeoJSON layer and see the result on the map.

Given all this, the “minimum set of properties” might look like this:

 { "wof:id": 85922583, "wof:name": "San Francisco", "wof:fullname": "San Francisco, California US", "wof:placetype": "locality", "wof:parent_id": 85688637, "wof:quality": 9, "wof:score": 100 } 


A few words about the example above:



Future magic (under development)


Then there is a list of properties that are not currently supported, or their support is still so raw that it is easier to say that it is not. We talk about these properties, because they must (and will) be supported in the future.

Grade (s)


Quality


How complete or reliable in our opinion the data in this record

Coating


By “coverage,” we mean the number of attributes that a terrain entry has. Because the record can have a wonderful set of official and alternative names, but quite a bit of metadata (population, height, and so on).

Dates


What is the date of the formation or, in some cases, the abolition of the administrative unit? This information becomes especially important in a data set, where one record can replace another.

In general, dates are a rather heterogeneous space, and we intend to start with simple forms, gradually increasing the complexity for describing historical and modern realities. The Library of Congress is working on an extended date / time format (EDTF) , it makes sense to touch it if you are interested in this business.

Wait ... And where to get this data from you?




The first (and very very very important) that we ask to understand - Who's On First is still in development, which means that:



The goal of the current release was not to trumpet and announce new dawn of excellent data, but to fill in for everyone what we have said, to have a set of data that can confirm or disprove our hypotheses, and to give education about the practice of working with these data.

If you do not have enough time or temperament (personal or collective) in order to climb through difficulties with the same little violent zeal, then it’s probably too early to get into our data. We plan to continue to keep you informed and participate in open discussions on our project, so follow the blog and let us know what needs to be improved.

The raw data in the GeoJSON format is in two places: the AWS S3 public access point and the GitHub repository with a bunch of small files. Their URLs, respectively:



Note: the link to S3 above does not need to be opened in the browser, since this is an access point, and only people who know how to work with S3 will be able to do something with it. If these words are bird language for you, you do not need to click on the link to S3. Immediately follow the link to GitHub.

There is no publicly available tool for viewing data. We have an internal speleologist, the code of which we plan to open (along with the libraries for working with data from Who's On First), but this has not happened at the moment.

The repository on GitHub also has “meta” files that are folded into a directory that is smartly namedmeta . These are mostly CSV files with a minimum of data on administrative units of a certain type. Like everything elsemeta files are in development, but they provide the minimum ability to view data without loading the entire set into the database.

Did you say ... "playgrounds"?


Yes.We have not included sites in this release, but we are working on them. Sites occupy a very large part of Who's On First, but they are either complex or numerous, and often all at once. So, we will move from simple to complex.

A few words about Git (and GitHub)


We do not recommend to be especially attached to the Who's On First data on GitHub (and Git as a whole). Now we have little idea what the best way to simultaneously distribute data and accept amendments and proposals from the community.

Although the good people from GitHub continue to do excellent work, making Git easier to use, the reality shows that Git remains a barrier for many people. In the absence of a more formal decision on an alternative, GitHub, at least, makes it possible to outline the basic wishes:



And once again: do not become attached completely to Git, working with data of Who's On First. He needed to show the idea of ​​the project.

What's next?



There is still a lot of work to do.

Read more: releases of tools and libraries for working with Who's On First, the release of an internal web-application “speleologist”, which we use for data digging and validation, creation of prototype services based on our data, completion (and in some places and beginning ) documenting everything above and fixing all the bugs .

Do not miss!

Image author - Aaron Cope.

Source: https://habr.com/ru/post/265661/


All Articles