📜 ⬆️ ⬇️

W for Wikipedia

What is “BigData”? These are the data that can not just digest. Or you can not just cook. Or do you think that it is impossible.
A particularly strong “imbalance”, in this understanding, the big data hid in web-cartography, in maps on various websites.
And so it happened - for several years I rode on various conferences, and talked about organizing the transfer of data from the server to the Map. Sometimes I was asked - "but where do you get these many of your data . "
These are not the right questions, the right questions are:
- how to store data
- what data, when and why to transfer to the client
- what server clustering is, how it looks and why it is needed
- what to do with the data
- and why all this is necessary% username%

And about where to get the data ... There is one such children's rhyme

All covered
Absolutely all
Wikipedia Island
In the ocean there
')
And on this island is growing about ten million geotagged articles, which we will use.
But with the flora and fauna there is not so simple - articles grow in different languages, in different places, and in fact there are a lot of them ...
Therefore, we, like real heroes, complicate the task a bit and add a little aggregate functions , Levenstein , Morton codes, esosedi and a bit of common sense.


Enhancing Internet connectivity ... About these words, I represented the process of integrating Wikipedia into a common pile of geodata.

But there was a problem - Wikipedia, although it itself does not describe the most complicated process of its cloning ( Wikipedia: How_to do_copy_Wikipedia ), did not want to integrate at all.
Because the wiki is big. Real bigdata, for my SSD. And it was difficult to download the necessary data to localhost, and it was impossible to server (because all the work described in this article, like the article itself, was written in the train).
But everything is solved. Literally 5 minutes ( local meme ).

G for geonames
Some time ago I discovered such a project as geonames.org , which, among other things, provides an API to the “geo” wikipedia articles - www.geonames.org/maps/wikipedia.html ( And there everything “jumps and jumps” ) .
If you dig a little bit, you can find the source of their data - the Georeferenzierung project, which after all their discussions and discussions results in a small file, 700 megabytes - http://toolserver.org/~kolossos/wp-world/pg-dumps/wp -world /
Please love and complain very much - this is a CSV with all the wikipedia articles we need.

D for Database
Step 1. Import data.
Download from the tool server file new_C.gz. This is a common csv, the format of which is described in the wiki .
There is no need for all the possible good in fact - only the name of the article, the coordinates and the length of the document's content - auto_incriment_id, lang: Titel, lat, lon, psize (for ranking)
Based on this data, you can get a list of Wikipedia geo pages.

Step 2. Interwiki Stage
Many wikipedia articles are described in more than one language. This seems to be the point. In the original file, the data format is something like this:
lang, pageTitle, ..., en-page, ru-page, de-page ... only 273 languages ​​from interwiki.

Our task is to unite a group of articles in different languages ​​into a “cluster”. In essence, a sequence of 273 references to different language versions is a cluster. You can assign it a unique id (the minimum page id in the cluster, for example. Or CRC32 from the components), after which you can refer to it.
Simultaneously with this operation, we find out that half of the pages for which there are links from interwiki are not in the main index.
They were not in the first stage.
Because in these articles there is no geotagging (but there is a link through interwiki), so at the first stage we didn’t “see” them, since they did not initially fall into the export file.
In this case, these pages should be added to the main index, but the coordinates should be driven to null (for statistics).
PS: the first stone in the garden wikipedia. No data level connectivity.
PPS: Sometimes in the file there are names with a bit encoding.

Step 3. We look!
Let's dwell a little bit, and finally look at how our ball looks like if we impose Wikipedia data on it



Almost this picture you could see in the header of the article on heat maps of Yandex. The map a couple of weeks ago.

In fact, the picture is clickable - this is an interactive example, you can touch it on esosedi (it only works fine in WebKit!). The link can be viewed not only for general wiki coverage, but also for data from various language sections (en, ru, uk, de). And there you can look at some strange version of the Incorrectenesses distribution map (for those who want a little math).

The meaning of this “term” is very simple:
1. We have Wikipedia articles.
2. We have a connection between these articles.
3. We can find the difference between the coordinates of the articles of one cluster. That is the difference in the data of one and the same article, but in different languages.
INSERT INTO deviance SELECT cluster, MAX( lat ) - MIN( lat ) + MAX( lng ) - MIN( lng ) , STD( lat ) + STD( lng ) , VARIANCE( lat ) + VARIANCE( lng ) FROM pages AS p LEFT JOIN `pagecluster` AS pc ON PC.page = p.uid LEFT JOIN cluster as c on c.clusterId=pc.cluster WHERE c.cnt>1 GROUP BY cluster 
PS: no matane - standard functions with results for every taste

What do we get?

Those who have read carefully must remember (from item 2 of the import) that in some articles of one cluster there is NO geo-referencing. And where there are such gross violations - there may be smaller ones, for example, just different coordinates in different articles. Because at least one cluster - the articles are actually different since the wiki sections are relatively autonomous.
PS: Articles of the same cluster may not only have no georeference, but also links to each other through interwiki. Many articles do not have links to those articles that have links to them. Or they may have outgoing links to a completely different cluster. Do not try to understand - I myself do not understand how this happened.

In fact, this means that the coordinates of the articles may or may not be, may not be correct. And the Sabiha Gokcen Airport in the ru version is “hanging” over Domodedovo. In this case, with en version everything is ok.
At the moment this is no longer the case, and Sabiha is fine - since August someone has corrected the article. But (!):
 SELECT * FROM `pages` as p LEFT JOIN pageCluster ON pageCluster.page=p.uid LEFT JOIN deviance as dv ON dv.clusterId=cluster WHERE `type` = 'airport' and width>1 ORDER BY cluster 

Gives out about 500 airports with a delta of coordinates greater than a degree.
For example Ravensthorp and Ravensthorpe_Airport . There everything is OK with the coordinates - just the hemispheres were mixed up (33 and -33 degrees).
PS: It was the second stone in the garden wikipedia

If you dig a little, it turns out that with airports and major universities is almost always a disaster. Sometimes it seems that some strange robot put the coordinates. And everywhere the same. Most of all, he was lucky with such a place as “Gorlovsky State Pedagogical Institute of Foreign Languages” - he gathered “in himself” 22 other educational institutions, having surpassed both MSU and Ivy League.

In fact, there are not too many articles on Wikipedia for which “deviance” can be counted - just a lot of materials are described only in one language and the calculation is not so simple to make. But the “order” and the error rate are roughly preserved.
Many articles are tied “a little” aside. Sometimes in centimeters, sometimes in meters, sometimes worse. The maximum error found is approximately 60 degrees.
Finding such places is difficult, but possible. For example, according to Wikipedia, ru: Temple_Vyrupakshi_v_Hampi is not at all in Hampi (offset 290km). It has a link to the city of Hampi, which is in the right place. And there are a lot of such places for which you can define at least the region of binding.
PS: Well, for example, for almost all airports and institutions.


But more about that another time. Let's go to the moment about "how to show"

Z for Z-code

In order to display on the map some data, this data must be transferred.
But all data cannot be transferred, because it is many, many megabytes (initially 700). And there is only one salvation - server clustering.
It is (in my terminology) of three types:

In particular, to export data to the heatmap example, the spell sounded like
SELECT z, lat, lng, SUM (weight) FROM `coordinate_map` as cm GROUP BY z & 0xFFFF0000

Where lat, lng are the usual “earth” coordinates, and z is “z-order-curve” , aka Morton code. In which all the magic and hides.



Morton, his fellow Hilbert (geo-hash in MongoDB) and Gray-code do the same thing - they translate 2D (or 3D) coordinates to 1D.
Z-order, Quad-code, Morton order, or Morton code.


In fact, Z is a “quadtree addressing”. Bing calls it quadkeys and uses the map tiles for addressing (pyramid).
It is this “effect” that we need. Z-code, Quadkeys or whatever you call it “points” to the quadtree node. On the square, on the tile.
In uint32, you can “put” 16 “directions” along the tree. You can address the 16 tile zoom. It is about 300x300 meters. In bigint 16 more levels will get.
For comparison, I show you the size of the tile 16 of the zoom - for clustering, and for selecting data - this is enough.



But the “main” bonus of Z-addressing is in fast data sampling (1D index) and very fast aggregate operations.
GROUP BY z & BITMASK and all. For tens of millions of records, the request runs in tens of milliseconds.
With the mask 0xFFFF0000 (16 units, 16 zeros) we “cover” the tiles up to the 8th zoom, after which we “discard” further variability. We perform clustering. The output for the wiki will be ~ 22k "server-clustered" points for our heatmap.
This process itself is not so easy to understand, but read the math. Of course in wikipedia. : P
These spatial filling curves are used for data sampling, and for grouping, and for sorting. For several decades they have continued to find various uses for such mechanics and there is no end in sight.
In my other article , about the regions, there is a link to the drug chapter in the book about spatial access methods. Recommend.

You can also see the module for Drupal , which works on the basis of geoHash.org , which (in its essence) is also “Morton number”, but for some reason in base32.
The main thing to remember is that strings are not numbers, B-trees cannot be built from them (normal). Although - if you have MongoDB - there is no choice, but everything works - checked (link to maple REST server clustering from API Yandex.Map and example to it)

So, at the very beginning of the article some questions were raised.
Q: How to store data?
A: Store in a database like data. As it is more convenient, and finish the data with z-coordinates.

Q: What data, when and why to transfer to the client?
A: Use smart data transfer, such as RemoteObjectManager and company. This was recently an article .

Q: What is server clustering and how does it look and why is it needed?
A: Oh! This is not at all a scary beast, and allows not only to reduce the load on the client, but also to provide a more even data coverage (it also “jumps and jumps”)

There are two questions that have not yet been answered:
- what to do with the data
- and why all this is necessary% username%

So I'm starting to tell you about the last stage - the stage where it is decided what to do with this data, and how they can be useful.

All covered with tags ... Absolutely all ... There is a ball.

When I started writing this article - it was August in the yard. Since then, a lot of time has passed, the API Map released ObjectManager and I wanted to make a “pen” for it so that you could show the Wikipedia objects on your map. But the public handle which would sustain a habraeffect I and did not make.
If you want to get a little bit of “experience” in placing and displaying more or less normal amounts of data on a map, this topic, the topic about ObjectManager and our performances ( one , two ) will help you with this.
But - this is not the topic of this story. We're talking about Wiki.

PS: By the way - recently an interview with the head of Russian Wikipedia was published on the-village. In fact, this prompted me to get an article from drafts.


L for levenshtein

Enhancing Internet connectivity ... Approximately with these words I represented the process of integrating wikipedia into a common pile of esosedi geodata.
The essence of the esosedi project is very simple - adding different places to the map, with or without a goal. What for a couple of years and do. And if you add more wiki data in that pile of objects, there will be porridge malaša. We must somehow raise the connectivity.

It was at this point that I called for help Levenshtein distance :
The Levenshtein distance is the minimum number of operations to insert one character, delete one character, and replace one character with another, necessary to turn one line into another.

The final algorithm looks like this: for all places, find wiki articles around you, compare your description with other candidates, write the result into the database.
The “compare” moment is not as simple as it seems - approximately 16 measurements should be taken for each pair. In transliteration, without vowels, without “brackets”, without minus words, through secondary directories ...

It is Levenstein (and the distance factor) that helps to understand that the Golden and Golden Lane are the same, just like Fox Gora and Lisa Gora.
Sometimes he can even distinguish between Quang Nam, Quang Ngai and Quang Nin (these are regions of Vietnam). But Bắc, Bạc and Bắc will not be able to, because the transliteration is the same.

The work is further complicated by different data sets for comparison. Nobody is perfect. For example - algorithms. But especially UGC data. We have to work with what is.

I for Interwiki
And a good example of errors in UGS is interwiki.
Let's open the ru: ISO_3166-2: SI page (Slovenia). There, unfortunately, almost no region has been described.
Now open the same thing, but in another language - en: ISO_3166-2: SI . There, of course, not everything is normal either, but there is some hint of data.
For example, there you can find a link to the region SI-060 - Municipality_of_Litija , which has the Russian version - ru: Liya_ (city) . Here are just page 3166-2 in the Russian version contained a link to "Lithia (Slovenia)". Not on "_ (city)".
The second point - en: 3166: SI, among other things, has a link to the "statistical regions" of Slovenia. For example, on SI-08 Osrednjeslovenska.
But! When entering Osrednjeslovenska_statistical_region, it will redirect to en: Central_Slovenia_Statistical_Region , and this article has a Russian version. But no information about this, about such an important (probably) things like SI-08 in the Russian section 3166: SI ... well, as if there is simply no.
Similar addiction occurs quite often. Taking into account the footnote about some problems in interwiki, I will tell you in secret - error: foreign key constraint failed .

T for translate
And there is another problem - I do not know Spanish. And I do not know German well. And I know that someone knows English badly. But my mother likes to relax in Sorrento.
Sorento himself is fine, do not worry. But 99% of the articles there are exclusively in Italian. And no Levenshtein can help to understand that “Valley of the Mills” and “ Vallone dei Mulini ” are one and the same.

At the same time, despite the fact that this is a world-class attraction, the article is only on it. In our country, more precisely in the Wikipedia section, everything is the same everywhere - most of the information about any places or "objects" is not translated into other languages. Well, except for the UNESCO heritage list.
As a result, it turns out that it is required to show information to a person, even if this person cannot read this information. Just because there is no other information.
There are no other options other than how to use the Yandex.Translator API .
PS: It is a pity that the Google API.Translate is no longer with us.
PPS: Ever since then, I think my translator has to write everything, since the quality of what exists leaves much to be desired ...

V - means courtesy
Personally, I have had about two hours since finding the link to toolserver.org and getting the first result. You can repeat my path too, and it will not take long. And, most importantly, will bring a lot of pleasure.
Why moderators and other active bot owners of Wikipedia cannot apply a similar analysis (and generally a “scientific approach”) is an open question.
If you have a wikipedia activist in your friends - let him know about the existence of this article. I myself am a pessimist Wikipedia.
PS: Wikipedia considers any UGC “non-AI” sites, including esosedi for several years as included in the spam lists. Because you can not trust. Like the wikipedia itself, in fact.


Results
A long time ago, about 6 years ago, in the crisis of 2008, standard google maps allowed displaying a wiki overlay. Not anymore.
This year, adding a wiki layer to the neighbors, I understood what was missing - an avalanche of information about places and events in these places. One thing you need to remember - You can not just take and show Wikipedia .

PS: Many pages that eventually ended up on the map do not lead anywhere. Because there are “criteria for the importance of information” and these pages simply erase .
PPS: I know that wiki is not a directory, and not a directory. And not enough complete source of information.

Afterword
Perhaps the world will be better, and the Big_Marry_In Samarra will finally be in Samarra.

Initially, there were a lot of different SQL, and for this reason I added an article to the SQL hub. But then I decided that this is unnecessary, and the requests are all banal and are of interest only within the framework of the solution of this problem.
Let this article be an example of how someone forgot about normal database forms, foreign keys and other automatic integrity checks .
For my part, I’ll just give a SQL dump (Mysql, 150mb) of my database. Play Check the calculations. Run the bots. If it is necessary to someone.

In fact, my story with wikipedia, geonames and Virupaksha in Hampi did not end there. They were joined by openstreetmap, the forward and reverse geocoder, and they are together ... But more about that another time.

Source: https://habr.com/ru/post/239925/


All Articles