📜 ⬆️ ⬇️

Internet map

Hello to all!

I want to present you the Internet Map or the result of clustering more than 350 thousand sites in accordance with the transitions of users between them. The size of the circle is determined by the attendance of the site, the color - by nationality, and the position on the map - by its links to other sites. If two sites have a steady stream of users between them, they will “try” to get closer to each other. After the completion of the algorithm, on the map one can observe clusters of sites (clusters) united by common users.

image
')
For example, if you type in the search for habrahabr.ru, you can see that dirty.ru and leprosorium.ru are in the same “constellation”, and even further away livejournal.ru. This suggests that the one who is reading this text now also visits these sites with a high probability (with respect to the average Runet user of course).

An even more interesting example of clustering can be seen at the bottom of the map, between violet Japan and yellowish Brazil: there is a whole pornostran located in sizes comparable to the whole Euronette. Interestingly, being sufficiently competent in the question under consideration, within a large pornocluster, one can distinguish thematic subclusters of a smaller size.

Those who are interested in a brief technical description - welcome under the cat


Engineering part


The entire project is written in C # and consists of three parts: a clustering program, a tile generation program, and a website. Each part deserves separate consideration and, if there is interest, I could tell more about them later.

image

The original data was taken from Alex, they are records of attendance, upstream and downstream transitions of users (upstream - where they came from, downstream - where they left). After normalization, we get a weighted undirected graph with 350 thousand vertices and more than 2 million edges.

Counting such a graph is a complicated computational task, so a special module for the GPU was written, but fortunately it was not needed. After tricky optimizations, the entire calculation took several weeks of continuous work of powerful, but still household iron.

To put it simply, the algorithm was to move the sites on the map in accordance with the forces acting on them. Many user transitions - strong power tries to bring the sites together; if the sites are too close, the repulsive force starts to work, etc. More details can be found in this work reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-189.pdf

The main problem was the enormous computational complexity of such an algorithm. After all, when solving the problem “in the forehead”, at each step it is necessary to calculate the superposition of forces for each site, calculate forces for each pair of sites, and there are about 122 billion of such pairs (not bad for one step, right?). Therefore, a strict optimization of the algorithm and complete parallelization of computations were carried out. Fortunately, the .net platform was surprisingly good for this kind of fun.

The second stage is the generation of tiles. A tile is a small picture of 256x256 pixels of which the image of the map consists of on google maps, yandex and other services. In general, nothing complicated - I generated a big picture, cut it into squares of the right size, business. But there were almost 30 million of such pictures. Even just copying or deleting a directory with tiles in windows takes two days. And pour them on the hosting a separate problem.

The third stage is to fasten the google maps engine, put everything together and make it show the map. Here, in general, there is nothing difficult, although there were some difficulties with the projections and positioning of the map.

The last stage - the choice of hosting and release. Here, too, was not without adventures. Now everything is spinning in the Amazon cloud and it turned out to be much easier and cheaper than I imagined.

In general, I have gained some experience, and I will be happy to share it with a respected community. Of course, in general, nothing special, on Habré there are really interesting projects and non-trivial solutions, but still, I think, many of them may be interested.

Also, I will be glad to any ideas, reviews, criticism - any feedback!

Thank you all for your kind words! For me it was very important to get support from the professional community, from people who not only see the result, but understand how much work has been done to achieve it. Also thanks to Habr for the fact that there is a platform in Runet uniting all of us!

Continued: Web Map Web Architecture

Source: https://habr.com/ru/post/148351/


All Articles