Concepts of natural language versus formal classifications in OpenStreetMap

Those who are at least somewhat familiar with the OpenStreetMap project have probably heard about a couple of principles that are based on it: “any tags you like” and the fact that the content of the cartographic database is primary in this project, and not how this The base displays the Standard style on osm.org . But is everything so good and rosy with the semantic structure of this database, given the first principle? Reading the Russian-language branch of the OSM forum, I decided to look into the situation and describe it here.

Some more history and facts. The OSM project originated in the UK. Because the main language for tags, which in most cases are just words, is British English. Therefore, the designations of a sports center or territory, which is called “ leisure=sports_cent ”, are written as leisure=sports_cent and place=neighbo u rhood respectively. German words are also used, for example, one of the (non-recommended) designations of types of megalithic structures is the tag megalith_type=grosssteingrab from the German Großsteingrab (dolmen).
')
Traditionally, tags consist of a key (key is what is to the left of the equal sign) and value (value is what is to the right). It seems to indicate the principle that the key corresponds to a class of objects or properties, and the value to an object or a specific property value. Sometimes keys and values use namespace syntax. This is most often done in cases where several tags constitute the so-called designation scheme, where the general designation of an object is supplemented with properties specific to it. For example:
social_facility=day_care
social_facility:for=senior
Such a pair of tags would mean "social institution, place of day care, for the elderly." This option is quite perfect, because in the namespace, named after the root key, there may be any number of other keys that correspond to the qualifying properties.

Another common method is qualifying tags without using a namespace. For example:
barrier=bollard
bollard=removable
This means: "an artificial obstacle, a pillar that is removed." From the point of view of natural language, “pillar = retractable” is a rather delusional construction, but you must keep in mind that in OSM all these words correspond to abstractions that should ideally be clearly described in the project Wiki (which does not always happen) . The disadvantage of this approach is that there can be only one specifying tag specific to the obstacle column, since in OSM you cannot assign two tags with one key to an object. Nonspecific refinement tags used for other objects may also have as many as you want - for example, this column can have a material and height: material=concrete , height=0.7 .

So far everything seems quite logical and understandable. But, as you know, any good thing is easy enough to spoil. Obviously, in order to store some data in the database, while retaining the ability to simply parse them, select subsets according to the right features, find very specific data and objects, the database must retain more or less well-defined data semantics. Otherwise, it turns into a weakly structured text. But remember that the OSM base, being a cartographic base, is obliged to store information about the real world, which many perceive “as is”, in the form of indivisible objects, without allocating any special properties from them in advance. People are just used to talking about what they see. Usually, when it comes to large projects with large databases of objects, for example, online stores, a typical scenario of using such databases is a sample of data for the user. In some cases, this is a parametric search, in others - a “smart” search, which allows you to associate sets of properties of objects (goods) with search queries in a free form.

The situation in OSM is the opposite: the project participants, on the contrary, contribute data to the database, and everyone does it to the best of their skills, including their ability to identify the main features of objects. And taking into account the “any tags you like” principle, which is designed to guarantee the extensibility of the notation system and the project’s ability to store a variety of data for various needs, sometimes this use of the usual natural language leads to something, if not to catastrophic results for semantics which is worthy of the epithet "extreme uncertainty."

Think and honestly answer yourself: if you want to buy a particular product, you will look for a store, where such a product must be, or where it can be only with some relatively small probability? Those who want to spend time running around in places where the desired can only happen by chance, most likely, there is little. But imagine, in OSM there are tags that denote a store where it is unknown what is being sold. For example, this is shop=kiosk . As can be seen from the description, there you can find anything from cigarettes to newspapers. And you can not detect. The only clear characteristic of such a store is its size. Because you can not even say for sure whether the kiosk is a small shopping pavilion, standing separately, or it is built into a building. And in some countries, the word kiosk can simply be called a small store.

In fact, this tag simply migrated to the notation scheme from natural language. “Thank you” for it you can say a man whose name is Etric Celine . As you can see, he writes quite honestly on his page on the project’s Wiki that he doesn’t care about the notational discussion (the formal procedure for proposing tags and discussing it before accepting) for any order, but he considers it very important that everyone “does at least something "anyone." So he did something: put into use a tag that means almost nothing. Do you know how many of these "shops do not know what" in the OSM database? Almost fifty thousand. And a lot of people, just becoming participants in the project and not understanding the importance of describing the properties of objects, hang this tag on any small trading pavilion, if they don’t know the best designation for it, although they exist for tobacco shops and for places where they sell newspapers, and for places that sell ice cream, as well as many others. What leads them? The fact that they do not understand the importance of the structured information in the database, but they are well aware that they used to call such establishments "kiosks".

For an experienced developer or database architect, the situation when, in addition to the numerous designations such as “supermarket”, “kiosk”, “mall”, there is no universal means for describing the assortment, it may seem extremely strange, but that is the reality. That is, of course, there are tags for bookstores, or do-it-yourself stores. But what a supermarket or a supermarket sells is not to describe in any way, with all the desire. About the difference between the supermarket and the mall, by the way, there are also long disputes, because it is impossible to draw a clear border: although the mall is, by definition, “a building housing many shops, entertainment venues, cafes and restaurants”, After all, the supermarket can also be indoors for tenants. So when the supermarket turns into a mall, and most importantly, is this distinction important at all?

Very many common tags have not very clear definitions and limits of applicability. For example, it is impossible to formulate a clear difference also between a restaurant and a cafe. Such a difference is not size, not waiter service, not an assortment of dishes, not work time, not a way of landing visitors, not a requirement to reserve a table, not prices, and not anything else. These are just the words "cafe" and "restaurant". Of course, in some extreme cases, the word "cafe", definitely, does not fit well with some very high class places. But where is the clear boundary? It does not exist. Therefore, the assignment of the tags amenity=cafe and amenity=restarurant is a fuzzy procedure that, strictly speaking, contradicts another important principle of OSM: verifiability. This principle states that any designation entered into the database should be such that another project participant could confirm it, that is, clearly designate it in the same way. The presence of the word "restaurant" in the place name is not a criterion, because it is in Russian that there is a borrowed "cafe" and "restaurant". And what about the Czech hospoda or Polish tawerna ? But in any way, because it is always necessary to follow the path “to each word (language concept) there must be its own tag” - is incorrect. Using abstraction as a thinking tool, you need to find similar and different properties of objects, and then denote these properties, not paying attention to the habit. Then it will be easier to provide the user of a map or reference book based on OSM data: he will not have to guess what the one who marked something on the map meant a restaurant, and not a cafe. Parametric search or demonstration of the desired properties in the list is definitely a more user-friendly solution than pushing down to all places to eat, almost without any explanation.

Sometimes classification attempts are made, but natural language and everyday knowledge make it difficult to create a correct, correct classification. Almost from the very foundation of the OSM project, there were two tags in the forests to specify the type of trees: wood=coniferous , wood=deciduous (literally, “trees with cones” and “deciduous trees”). These two words - coniferous , deciduous - are everyday in English. And people used to oppose them. In Russian in such cases they say coniferous and deciduous , which is somewhat more correct in terms of biology, but also not completely. In fact, there are trees that seasonally drop foliage, and those that don't (evergreens). And at the same time, there are trees with leaves and trees with needles. That is, there may be a tree with needles and cones, but throwing needles for the winter ( European larch ). Or a tree with leaves, but evergreen ( Lavrovishnya ). Plus, there are other, less numerous properties. Not so long ago, the original scheme was replaced with a scheme with two keys responsible for the seasonal cycle of leaves and their shape.

Another situation where the authors of the tags had no sufficiently strict knowledge in the subject area, which gave rise to vague and contradictory descriptions, this is the case with towers and masts related to OSM key values of man-made objects made to objects. In construction engineering, a field of knowledge that overlaps all types of man-made stationary structures, such vertical vertical structures are called towers that stand only thanks to their support on their own foundations. And the masts - that has a delay, each of which is attached to the anchor device. That is, everything is quite simple, moreover - such a classification is of an international character. But in other technical areas these terms may be used differently. Say, energy mast is also called the fact that from the point of view of the builder - the tower. The result is that in OSM these tags are assigned quite loosely to hand-made objects.

The most curious things (and, in the case of the spread of such practices, are unpleasant because of semantic divergence, that is, diverging meanings of notation) situations are the use of such words in tags that have completely different meanings in different languages. A recent example is the proposal of one of the Russian-speaking participants to enter a tag indicating the place where you can get a “business lunch”. The curiosity of this situation is that, probably, only in Russia the word “business lunch” (it was invented sometime in the nineties of the last century, when everything with the prefix “business” sounded more solid) is called a set of dishes at a fixed price, which can be obtained at certain times of the day. In the rest of the world, this is called the French table d'hôte , fix-price, or another local word, but business lunch , in any case, means something that is associated with talks at dinner, and not some particular type of restaurant service. . Of course, the words used in the tags are conditional. But they should be understood by the rest of the project participants, at least to such an extent that there is no doubt to which subject area the tag belongs. Therefore, the adoption of such designations that will mislead anyone who speaks English, who is not from Russia, is unacceptable.

The reverse situation is more common. Borrowed words rarely change their meaning at all, and therefore those for whom the culture of English-speaking countries is a dark forest often make mistakes by interpreting tags in accordance with the meaning of the consonant borrowed word, and not with what the word means in the original. Also, consonant words can exist in different languages independently. Thus, the Russian-speaking participants of the OSM project are often misleading the highway=alley tag. The fact is that the English word alley sounds similar to the French allée and the Russian alley . The Russian was borrowed from French, and therefore means the same: a road for walking, along which trees are planted. The English alley is usually a narrow technical passage or passage along the side or rear wall of buildings, or a passage between private land plots located in one or two rows. This word is closer to the Russian "backyard". But inexperienced participants often try to label the highway=alley with an alley with trees.

Even among the English-speaking community, the agreement itself is not always due to cultural differences. For example, a typical American drugstore besides drugs, sells a bunch of manufactured goods, cosmetics, food and drinks. And the prescription department may be, unexpectedly, in a supermarket. The British have an idea about the pharmacy, which is somewhat closer to the habitual residents of Russia.

Another example is the use of words such as cabin , hut , as the values of the key building=* . In accordance with the key, these tags should indicate the type of building. However, there is no clear difference between them. But there are associations with the appointment. For example, an American is more likely to associate a cabin with something like a summer house, that is, with a small private or rented house for outdoor recreation. A resident of Norway, seeing the word hut , may recall the winter quarters belonging to the Den Norske Turistforening Tourist Association, which its members are entitled to use. A similar association may occur in the German-speaking Swiss, only now with the mountain shelters of the Schweizer Alpen Club . That is, people may well compensate for the lack of a clear definition regarding the type of building, by associating with its purpose. And in Russia, until recently, these tags could indicate a hut , which is wrong, because it can be described as “a building built of logs”, using the tags building=yes, material=log .

Certainly, the semantic chaos originating from a natural language does not dominate the project, although it is quite noticeable. There are quite strong and successful precedents of attempts to create reliable, consistent and clear classifications that replace tags with a vague meaning with sets of keys, each responsible for its own individual property. One of these, quite well-known, but officially not yet approved schemes - this is Healthcare 2.0 . It was created with the aim to have tools that describe medical facilities of various types without ambiguities inherent, for example, to the amenity=doctors tag. Using it, one can describe both a large hospital and a private practice doctor’s office. Quite a lot of work on creating a scheme to indicate the state of forests was done by one of the Russian participants in the project. Unfortunately, it didn’t go further than placing the description on the outdated tags page .

New schemes that are well thought out turn out to be practically independent of the cultural and linguistic context. The maximum that may be required is the addition of one or two values of a key. For example, Healthcare 2.0 allows you to describe Russian-specific medical institutions like a medical assistant, although its authors had no idea about such an institution. This is the power of using elementary properties that can be freely combined.

The saddest thing in this situation, it seems to me, is that even many experienced project participants do not understand this problem or understand it, but claim that it is insignificant, or that its solution can significantly increase the threshold of entry and scare away the notorious newcomers (about whom they like to talk very much, attributing to them the qualities convenient for arguing their point of view).

Practice shows that, first, with the appearance of a new, more specific scheme, people successfully begin to use it, having the opportunity to designate what was previously impossible or inconvenient, but it would be desirable. Secondly, OSM exists to create the most adequate reality, the most complete and at the same time free to use map of the world, and not to create a club of interests (although this is also not bad). And if it is difficult for someone to understand the principle of quite obvious notation, but it is easy to use vaguely thoughtless, then what contribution can he make to the creation of qualitative data? , , - , , .

Source: https://habr.com/ru/post/269733/

All Articles

Concepts of natural language versus formal classifications in OpenStreetMap

More articles: