📜 ⬆️ ⬇️

How Yandex made the most detailed Russian-language map of the world

Today, Yandex announced a major update of the Maps. A detailed world map is now available on the service, with details down to the houses and routing. All the main names on it are presented in two languages ​​at once: local and Russian.
In addition, the maps of Russia, Ukraine, Kazakhstan and Belarus are now fully owned by Yandex.

The service now works on a single platform that allows you to maintain and independently update any amount of data. Maps of Russia, Ukraine, Belarus, and Kazakhstan update Yandex cartographers every month. All other countries whose schemes are drawn by a Yandex partner, Navteq, change every three months.

image
')
The most detailed parts of Navteq are Europe and North America: with all the main streets and houses in the cities and a detailed network of roads along which the service knows how to build car routes.

Most of the toponyms on the Navteq maps were written in Latin characters, although for some languages ​​(for example, Thai and Arabic), original alphabets were used. To make it easier for users to navigate, Yandex automatically translated the foreign names of cities and popular tourist places into Russian. We were faced with the task of translating more than 7 million place names from 237 countries from 37 languages.

In this post we will describe in detail how we chose translation methods and used them in practice.

Today's news is the result of one and a half years of work. We have long understood that it is wrong to rely only on data providers and we need our own maps. The first was Moscow, drawn by our cartographers in 2011. Now we have our own detailed maps of Russia, Ukraine, Belarus and Kazakhstan. They are combined with detailed maps of the rest of the countries from our partner, so that users can conveniently “cross” the borders. We had to not only combine the schemes of the countries, but also organize the storage and fast processing of all the world map data, so we completely rewrote the core of the service. In addition, we have created our own software to quickly make changes to the cards, because it is very important to update electronic cards frequently, and existing programs on the market are not able to overpower our volumes. Yandex is now a cartographic company of a completely new level.

First, in order to translate toponyms on the World Map, we thought to use Wikipedia, because there are a lot of articles about localities with exact coordinates. It was necessary to simply take the title of the equivalent article in Russian, and get the desired translation. There were really a lot of articles, but for our needs - absolutely not enough. Going this way, we were able to cover no more than 5-7 percent of toponyms. It became clear that the problem can only be solved independently, i.e. create transcription rules for each of the languages ​​used. Of course, linguists have long solved this problem, any textbook of a foreign language begins reading rules. However, these rules are designed for a person, respectively, we had to prepare them in such a way that the machine could use the rules. And this is not so easy. In addition, almost every rule has exceptions, which also had to be taken into account and tracked manually. Do not forget about the well-established translation options for many toponyms. For example, if we translate the French word Paris into Russian according to the rules of transcription, we will have the "Pari". But this city in Russian already has another, historically established name - Paris.

image

The same can be said about the Hudson River. In English this name is spelled Hudson. Just like the last name of the famous Mrs. Hudson, the landlady of Sherlock Holmes. The writing of “Hudson” is an example of phonetic transfer (transcription), and the historical “Hudson”, rather, was obtained by transliteration - letter-by-letter transfer. And there are quite a few such examples, since phonetic transfer was not always used in the translation of toponyms. In addition, in some cases, toponyms do not obey the rules of reading the most common language in this area, since the names were given to them by other peoples living in the same territory.

image

How did we translate

So, in order to cover the entire World Map, we needed to transfer the transcription rules for 37 languages ​​(plus various options and dialects) to a computer-friendly form. The rules had to take into account the context: after all, the rules for reading a particular letter or syllable very often depend on what is around, up to inter-word relations.

When the rules are formulated, a Perl script is created that, taking into account all the rules and contexts, transcribes all the lines passed to it. First, he breaks the source word into segments (groups of adjacent letters): separate sequences of vowels and consonants. Then for each segment, the most likely transcription option is selected. Naturally, this takes into account the context: which segments are located to the right and left of the decoded portion.

The following symbols are used for marking:

 -  
   -   
$ -  
^ -  
* - « »,   
// - 


:

 (0.95) *(0.03) (0.01) (0.01) //  
(^)r => (0.99) (0.01) //    1, ^ -  
r(eu) => (1.00) //    1
r(eu l) => (1.00) //    2


eu => (0.39) (0.21) (0.16) (0.14) (0.10)
(r)eu => (0.55) (0.18) (0.17) (0.10) (0.10)
eu(l) => (0.45) (0.30) (0.20) (0.05)
(^ r)eu => (0.25) (0.25) (0.17) (0.17) (0.16)
eu(l eau) => (1.00)
 
l => (0.86) (0.13) *(0.01)
(eu)l => (0.80) (0.20)
l(eau) => (1.00)
(r eu)l => (0.50) (0.50) //    2
l(eau x) => (1.00)
 
eau => (0.95) (0.02) (0.01) (0.01) (0.01)
(l)eau => (0.86) (0.14)
eau(x) => (1.00)
(eu l)eau => (1.00)
eau(x $) => (1.00)
 
x => (0.52) *(0.24) (0.21) (0.03)
(eau)x => *(1.00)
x($) => *(0.59) (0.38) (0.02) (0.01) //    1, 
(l eau)x => *(1.00) 


:

[r eu l eau x] => [    *] //  


. , .


. . , . , . 70 . 20 000 . .

, , Navteq . , .

. , . , . , . , « » « ». , .

image

, . , .

?

- , -. , , . , . , , , , .. . .

Source: https://habr.com/ru/post/201732/


All Articles