📜 ⬆️ ⬇️

About duplicating web map tiles

To organize the operation of web maps using the Slippy Map technology, it is required to organize a tile storage, in which tiles can be pre-rendered (generated) in a predetermined map context, or a set of services can be used to generate tiles on demand, or some symbiosis from the first two approaches.
The first approach has a drawback - too large storage is required for tiles. So, according to OpenstreetMap, as of March 2011, 54TB of storage space was required for tiles. According to my calculations for actual data for June 2015, this figure is already about 100TB (this is only an estimate, I did not decide on a real experiment) for storing tiles of scales 0 ... 17. This "increase" of estimates is due to the fact that over the past time, the data of OpenStreetMap has significantly increased, the areas that in March 2011 were empty have been detailed. It is also impossible to discount the non-optimal compression (in my case compared to OpenStreetMap) of PNG format (I have an average tile size of 4.63KB against 633 bytes of OpenStreetMap in March 2011), the complexity of drawing a map mapnik'om and my other nuances . In any case, it requires a lot of space for tile storage, which not every server can afford. The situation is aggravated by the fact that for block file systems, small tiles use a whole block (a 103-byte tile can occupy a whole block, for example, 4KB), which results in inefficient use of physical hard disk space. For a large number of tiles (for large-scale maps) within the same directory, there may still be a problem of the impossibility of storing the required number of files or directories more than the file system allows. But with all this, this campaign provides a comfortable time to fulfill the request for the return of a tile.
Although the second approach is not demanding of the capacity of the tile server, it requires organizing and maintaining several services (PostgreSQL, Postgis, HStore, mapnik, renderd, mod_tile, apache) that would reliably generate and send the tile to the requested client service. Also required to periodically clean up the cache of tiles. In other words, the fee for the small capacity of the hard disk of the tile server is the complexity of the architecture and the considerable time required to fulfill the request for the return of each specific tile (according to my calculations, up to 500ms for only 1 client, for a high-loaded service this time can grow to unacceptable values).

In this publication, I do not touch on the architecture of the tile server. Ultimately, it’s up to you how you build your own web map service, starting from the hardware of your server. In the article I just want to draw attention to some features of tile storage, knowing that you can optimally build your web map service.
By the way, I decided to build my web map service on the basis of a mixed approach. The fact is that the geography of requests from users of my web service is quite clearly defined. Those. I know in advance the context of the map that users will request, which allows me to render tiles in this context of the map. In my case, the volume of required tiles was 511GB for scales of 3 ... 17. At the same time, for scales of 3..13, I generated all the tiles, without starting from the context of the map that I knew in advance. I present statistics on the number and volume of tiles generated by the map scale.
ScaleTotal generated tilesTotal tiles for scale (4 ^ zoom)Share in% of total tilesVolume of generated tilesTotal tiles for scaleShare in% of total tiles
364641001.3M1.3M100
four2562561004.3M4.3M100
five1,0241,02410015M15M100
64,0964,09610050M50M100
716 38416 384100176M176M100
eight65,53665,536100651M651M100
9262 144262 1441001.71.7100
ten1,048,5761,048,5761006.1G6.1G100
eleven4 194 3044 194 30410018G18G100
1216 777 21616 777 21610069G69G100
1367 108 86467 108 864100272272100
14279 938268 435 4560.103.2G1.1T0.29
151,897,5621,073,741,8240.1815G4.35T0.34
sixteen5 574 9384,294,967,2960.1334G17.4T0.19
1718,605,78517,179,869,1840.1194G69.6T0.13
Total115 842 662366 503 875 9250.5151192.8T0.51

Excessive duplication of tiles


The very first thing that I noticed when developing web maps (besides exorbitant capacities) is that the picture often repeats itself. For example, in the ocean, two adjacent tiles look equally blue. But visually identical and binary identical tiles are different things.
Testing my hypothesis, I compared the MD5 checksums of two adjacent tiles and they turned out to be the same.
root@map:~# md5sum 10/0/0.png a99c2d4593978bea709f6161a8e95c03 10/0/0.png root@map:~# md5sum 10/0/1.png a99c2d4593978bea709f6161a8e95c03 10/0/1.png 

Does this mean that all tiles are unique in MD5 checksum? Of course not!

There is such a thing as collisions . Those. two visually (including binary) different tiles can have the same checksum. There is such a risk, although it is small for files of arbitrary size. Since a web map is a resource that, as a rule, does not require absolute accuracy (unlike, for example, from banking transactions or Forex quotes), we assume that the low probability of tile conflicts is an acceptable risk justifying ...
And for the sake of which, in fact, is it important for us to know that there are identical tiles in some hash function? You probably already guessed. Why keep several identical tiles, to occupy a hard disk with them, if you can store only one file and a certain mapping of all duplicates to it (for example, using a common symlink)?
So, the low probability of tile collisions is an acceptable risk that justifies a reduction in the capacity for tile storage. But how much will we gain by removing all duplicates? According to my calculations, up to 70% of the tiles are duplicated. Moreover, the larger the scale, the greater this figure.
It should be noted that I was not the first to guess to eliminate duplicate tiles by hash function. So, the Sputnik team used this nuance to optimally organize the tile cache. Also in the common MBtiles format, the problem of tile deduplication is solved.
Below in the table and in the figure I cite the statistics on the duplicates found (on MD5) of tiles.
ScaleTotal sgen. tilesTotal double tilesNumber share in% double tilesVolume sgen. tilesThe amount of tiles after creating symlinksVolume share in% double tiles
364001.3M1.3M0
four256ten3.914.3M4.2M0.92
five1,024807.8114.6M14.3M2.13
64,09665916.0949.7M47.1M5.18
716 3844,05824.77175.4M159.69.04
eight65,53623,03135.14650.3M560.3M13.83
9262 144184,66870.451710M989M42.18
ten1,048,576767 43173.196.1G3.148.22
eleven4 194 3043,553,10084.6718G4.4G75.41
1216 777 21614,797,68088.1869G12.482.01
1367 108 86460,945,75090.82271.138.785.74
14279 93847,30716.93.2G185M5.71
151,897,537514 00527.0914.212.313.78
sixteen5 574 9381 934 55334.7033.8G26.421.86
1718,605,7858 312 46644.6893.8G62G33.82
Total edit115 842 66291,084,80078.6351116432.07


It should be borne in mind that:

A few words about block file system issues


Sooner or later you will face the question of choosing a file system. At first, you will use the file system that is already in your system. But when you encounter large amounts of data, encounter excessive duplication of tiles, encounter problems of long file system replies with parallel queries, and risks of failure of these disks, you will probably think how to allocate the tile cache.
As a rule, tiles are small in size, which leads to inefficient use of disk space on block file systems , and a large number of tiles can probably use up all the free number of i-nodes. If you reduce the block size to any small size, this will affect the maximum storage capacity, because Any file system is usually limited to the maximum number of i-nodes. Even replacing duplicate tiles with symlink, you will not be able to significantly reduce the required capacity of tile storage. In part, the problem of unfilled blocks can be solved with the help of metatayling mechanisms — several tiles (usually 8x8 or 16x16) are stored in one file with a special header, which can be understood from which byte the required tile is located. But, unfortunately, metatails reduce efforts to deduplicate tiles, since the probability of meeting two identical metatails (in terms of the N x N tiles) is significantly reduced, and the header format itself (the first 532 bytes with 8 x 8 metatails) of the metatail involves recording the metatail address, which makes it unique and therefore eliminates the possibility of deduplication in principle. But nevertheless, today the use of metatayling allows you to "predict" the request for neighboring tiles, and thus reduce the response time of the tile server.
In any case, for the tile storage is required to fulfill a number of conditions:

The file system that best meets these requirements is ZFS. This file system does not have a fixed block size, eliminates file duplication at the block level, implements a cache in the RAM of frequently used files. At the same time, it does not have built-in support for Linux operating systems (due to the incompatibility of GPL and CDDL licenses) and creates an increased load on the processor and RAM (compared to traditional ExtFS, XFS, etc.), which is a consequence of its full control over physical and logical media.
The BtrFS file system is more friendly to Linux, supports deduplication (offline), but is still very raw for production systems.
There are other solutions to eliminate duplication of tiles and maximize the use of disk space. Almost all of them boil down to creating a virtual file system and connecting special services to it, allowing you to access this virtual file system, de-duplicate files on the fly, place and send them to / from the cache.
For example, UKSM, LessFS, NetApp and many others implement data deduplication at the service level. But in production, the clutter of services is fraught with big problems, especially on high-load web services. Therefore, the choice of tile cache architecture for production should be ultra-fault tolerant and easy to administer.
The well-known Satellite (may the developers mentioned above forgive me - this project became for me a kind of positive example, on the basis of which I build my web map service) implements its own deduplication algorithm, which also uses a certain hash function that allows deduplicating tiles, and Tiles are stored in flexible CouchBase.
I also tried to build something similar from the means to which I had confidence in the production. In this case, my choice fell on Redis. My experience has shown that when Redis is placed in the memory of tiles, then it is possible to achieve a 30% reduction in the amount of memory compared to placement in the file system. You think, why use Redis? After all, he lives in RAM?
There are several reasons for this choice. First of all, reliability. For a year in production, Redis has established itself as a very reliable and fast tool. Secondly, theoretically, the response from memory is faster than the response from the file system. Thirdly, the cost of RAM for servers has become relatively low, and the reliability of hard disks has not improved much in recent years. Those. the hard work of the server with the hard disk (as it happens with the return of tiles) greatly increases the risk of its failure. In addition, my organization has about 100 servers with 515GB of RAM on each (but with small hard disks), which makes it possible to efficiently place tiles in memory (with proper proxying, zxy -> specific server). One way or another, but my choice fell on Redis. I do not impose it on a respected reader. You can independently decide on the architecture of your own web map service.
This article had only one goal - to tell about some undocumented nuances of web map services. Save your time and money at the expense of my, I hope, not useless research work!

')

Source: https://habr.com/ru/post/265807/


All Articles