About duplicating web map tiles

To organize the operation of web maps using the Slippy Map technology, it is required to organize a tile storage, in which tiles can be pre-rendered (generated) in a predetermined map context, or a set of services can be used to generate tiles on demand, or some symbiosis from the first two approaches.
The first approach has a drawback - too large storage is required for tiles. So, according to OpenstreetMap, as of March 2011, 54TB of storage space was required for tiles. According to my calculations for actual data for June 2015, this figure is already about 100TB (this is only an estimate, I did not decide on a real experiment) for storing tiles of scales 0 ... 17. This "increase" of estimates is due to the fact that over the past time, the data of OpenStreetMap has significantly increased, the areas that in March 2011 were empty have been detailed. It is also impossible to discount the non-optimal compression (in my case compared to OpenStreetMap) of PNG format (I have an average tile size of 4.63KB against 633 bytes of OpenStreetMap in March 2011), the complexity of drawing a map mapnik'om and my other nuances . In any case, it requires a lot of space for tile storage, which not every server can afford. The situation is aggravated by the fact that for block file systems, small tiles use a whole block (a 103-byte tile can occupy a whole block, for example, 4KB), which results in inefficient use of physical hard disk space. For a large number of tiles (for large-scale maps) within the same directory, there may still be a problem of the impossibility of storing the required number of files or directories more than the file system allows. But with all this, this campaign provides a comfortable time to fulfill the request for the return of a tile.
Although the second approach is not demanding of the capacity of the tile server, it requires organizing and maintaining several services (PostgreSQL, Postgis, HStore, mapnik, renderd, mod_tile, apache) that would reliably generate and send the tile to the requested client service. Also required to periodically clean up the cache of tiles. In other words, the fee for the small capacity of the hard disk of the tile server is the complexity of the architecture and the considerable time required to fulfill the request for the return of each specific tile (according to my calculations, up to 500ms for only 1 client, for a high-loaded service this time can grow to unacceptable values).

In this publication, I do not touch on the architecture of the tile server. Ultimately, it’s up to you how you build your own web map service, starting from the hardware of your server. In the article I just want to draw attention to some features of tile storage, knowing that you can optimally build your web map service.
By the way, I decided to build my web map service on the basis of a mixed approach. The fact is that the geography of requests from users of my web service is quite clearly defined. Those. I know in advance the context of the map that users will request, which allows me to render tiles in this context of the map. In my case, the volume of required tiles was 511GB for scales of 3 ... 17. At the same time, for scales of 3..13, I generated all the tiles, without starting from the context of the map that I knew in advance. I present statistics on the number and volume of tiles generated by the map scale.

Scale	Total generated tiles	Total tiles for scale (4 ^ zoom)	Share in% of total tiles	Volume of generated tiles	Total tiles for scale	Share in% of total tiles
3	64	64	100	1.3M	1.3M	100
four	256	256	100	4.3M	4.3M	100
five	1,024	1,024	100	15M	15M	100
6	4,096	4,096	100	50M	50M	100
7	16 384	16 384	100	176M	176M	100
eight	65,536	65,536	100	651M	651M	100
9	262 144	262 144	100	1.7	1.7	100
ten	1,048,576	1,048,576	100	6.1G	6.1G	100
eleven	4 194 304	4 194 304	100	18G	18G	100
12	16 777 216	16 777 216	100	69G	69G	100
13	67 108 864	67 108 864	100	272	272	100
14	279 938	268 435 456	0.10	3.2G	1.1T	0.29
15	1,897,562	1,073,741,824	0.18	15G	4.35T	0.34
sixteen	5 574 938	4,294,967,296	0.13	34G	17.4T	0.19
17	18,605,785	17,179,869,184	0.11	94G	69.6T	0.13
Total	115 842 662	366 503 875 925	0.51	511	92.8T	0.51

Excessive duplication of tiles

The very first thing that I noticed when developing web maps (besides exorbitant capacities) is that the picture often repeats itself. For example, in the ocean, two adjacent tiles look equally blue. But visually identical and binary identical tiles are different things.
Testing my hypothesis, I compared the MD5 checksums of two adjacent tiles and they turned out to be the same.

root@map:~# md5sum 10/0/0.png a99c2d4593978bea709f6161a8e95c03 10/0/0.png root@map:~# md5sum 10/0/1.png a99c2d4593978bea709f6161a8e95c03 10/0/1.png

Does this mean that all tiles are unique in MD5 checksum? Of course not!

There is such a thing as collisions . Those. two visually (including binary) different tiles can have the same checksum. There is such a risk, although it is small for files of arbitrary size. Since a web map is a resource that, as a rule, does not require absolute accuracy (unlike, for example, from banking transactions or Forex quotes), we assume that the low probability of tile conflicts is an acceptable risk justifying ...
And for the sake of which, in fact, is it important for us to know that there are identical tiles in some hash function? You probably already guessed. Why keep several identical tiles, to occupy a hard disk with them, if you can store only one file and a certain mapping of all duplicates to it (for example, using a common symlink)?
So, the low probability of tile collisions is an acceptable risk that justifies a reduction in the capacity for tile storage. But how much will we gain by removing all duplicates? According to my calculations, up to 70% of the tiles are duplicated. Moreover, the larger the scale, the greater this figure.
It should be noted that I was not the first to guess to eliminate duplicate tiles by hash function. So, the Sputnik team used this nuance to optimally organize the tile cache. Also in the common MBtiles format, the problem of tile deduplication is solved.
Below in the table and in the figure I cite the statistics on the duplicates found (on MD5) of tiles.

Scale	Total sgen. tiles	Total double tiles	Number share in% double tiles	Volume sgen. tiles	The amount of tiles after creating symlinks	Volume share in% double tiles
3	64	0	0	1.3M	1.3M	0
four	256	ten	3.91	4.3M	4.2M	0.92
five	1,024	80	7.81	14.6M	14.3M	2.13
6	4,096	659	16.09	49.7M	47.1M	5.18
7	16 384	4,058	24.77	175.4M	159.6	9.04
eight	65,536	23,031	35.14	650.3M	560.3M	13.83
9	262 144	184,668	70.45	1710M	989M	42.18
ten	1,048,576	767 431	73.19	6.1G	3.1	48.22
eleven	4 194 304	3,553,100	84.67	18G	4.4G	75.41
12	16 777 216	14,797,680	88.18	69G	12.4	82.01
13	67 108 864	60,945,750	90.82	271.1	38.7	85.74
14	279 938	47,307	16.9	3.2G	185M	5.71
15	1,897,537	514 005	27.09	14.2	12.3	13.78
sixteen	5 574 938	1 934 553	34.70	33.8G	26.4	21.86
17	18,605,785	8 312 466	44.68	93.8G	62G	33.82
Total edit	115 842 662	91,084,800	78.63	511	164	32.07

It should be borne in mind that:

I generated tiles in the context of large cities, which in itself worsens the indicator of duplication of tiles, because in large cities there are less chances to meet two identical tiles than at sea. Therefore, data of scales of 3 ... 13 show the degree of duplication of tiles for the entire globe, and data of scales of 14 ... 17 show duplication only in the context of large cities.
Tiles of scales 3 ... 10 I generated with one mapnik style file, and tiles 11 ... 17 with another style file. Moreover, different scales of drawing styles work at different scales, which causes heterogeneity of drawing tiles at different scales. This circumstance introduces a certain noise in the statistics.
as a rule, so-called Empty Tiles are duplicated, having a size of only 103 bytes. Therefore, a significant reduction in the size of the tile storage should not be expected, as the data of scale 9..12 show With an average of 70% duplication, it is possible to reduce the size of the scale directory to just 50%.
in view of the randomness of the choice of the original tile, the scale statistics is noisy, i.e. if the same tile is found on the 10th and on the 12th scale, then taking it as the original tile of scale 10, a tile of scale 12 will be considered a duplicate, and vice versa. Depending on how the tile was classified, this will add noise to the statistics of the corresponding scale. In this regard, there is some element of randomness in the table in rows of scales. Absolute trust deserves only the line "Total".

A few words about block file system issues

Sooner or later you will face the question of choosing a file system. At first, you will use the file system that is already in your system. But when you encounter large amounts of data, encounter excessive duplication of tiles, encounter problems of long file system replies with parallel queries, and risks of failure of these disks, you will probably think how to allocate the tile cache.
As a rule, tiles are small in size, which leads to inefficient use of disk space on block file systems , and a large number of tiles can probably use up all the free number of i-nodes. If you reduce the block size to any small size, this will affect the maximum storage capacity, because Any file system is usually limited to the maximum number of i-nodes. Even replacing duplicate tiles with symlink, you will not be able to significantly reduce the required capacity of tile storage. In part, the problem of unfilled blocks can be solved with the help of metatayling mechanisms — several tiles (usually 8x8 or 16x16) are stored in one file with a special header, which can be understood from which byte the required tile is located. But, unfortunately, metatails reduce efforts to deduplicate tiles, since the probability of meeting two identical metatails (in terms of the N x N tiles) is significantly reduced, and the header format itself (the first 532 bytes with 8 x 8 metatails) of the metatail involves recording the metatail address, which makes it unique and therefore eliminates the possibility of deduplication in principle. But nevertheless, today the use of metatayling allows you to "predict" the request for neighboring tiles, and thus reduce the response time of the tile server.
In any case, for the tile storage is required to fulfill a number of conditions:

to ensure efficient use of disk space, which can be achieved by reducing the block size of the block file system,
provide support for a large number of files and directories
provide the fastest return tiles upon request
eliminate duplicate tiles

The file system that best meets these requirements is ZFS. This file system does not have a fixed block size, eliminates file duplication at the block level, implements a cache in the RAM of frequently used files. At the same time, it does not have built-in support for Linux operating systems (due to the incompatibility of GPL and CDDL licenses) and creates an increased load on the processor and RAM (compared to traditional ExtFS, XFS, etc.), which is a consequence of its full control over physical and logical media.
The BtrFS file system is more friendly to Linux, supports deduplication (offline), but is still very raw for production systems.
There are other solutions to eliminate duplication of tiles and maximize the use of disk space. Almost all of them boil down to creating a virtual file system and connecting special services to it, allowing you to access this virtual file system, de-duplicate files on the fly, place and send them to / from the cache.
For example, UKSM, LessFS, NetApp and many others implement data deduplication at the service level. But in production, the clutter of services is fraught with big problems, especially on high-load web services. Therefore, the choice of tile cache architecture for production should be ultra-fault tolerant and easy to administer.
The well-known Satellite (may the developers mentioned above forgive me - this project became for me a kind of positive example, on the basis of which I build my web map service) implements its own deduplication algorithm, which also uses a certain hash function that allows deduplicating tiles, and Tiles are stored in flexible CouchBase.
I also tried to build something similar from the means to which I had confidence in the production. In this case, my choice fell on Redis. My experience has shown that when Redis is placed in the memory of tiles, then it is possible to achieve a 30% reduction in the amount of memory compared to placement in the file system. You think, why use Redis? After all, he lives in RAM?
There are several reasons for this choice. First of all, reliability. For a year in production, Redis has established itself as a very reliable and fast tool. Secondly, theoretically, the response from memory is faster than the response from the file system. Thirdly, the cost of RAM for servers has become relatively low, and the reliability of hard disks has not improved much in recent years. Those. the hard work of the server with the hard disk (as it happens with the return of tiles) greatly increases the risk of its failure. In addition, my organization has about 100 servers with 515GB of RAM on each (but with small hard disks), which makes it possible to efficiently place tiles in memory (with proper proxying, zxy -> specific server). One way or another, but my choice fell on Redis. I do not impose it on a respected reader. You can independently decide on the architecture of your own web map service.
This article had only one goal - to tell about some undocumented nuances of web map services. Save your time and money at the expense of my, I hope, not useless research work!

Source: https://habr.com/ru/post/265807/

All Articles

About duplicating web map tiles

Excessive duplication of tiles

A few words about block file system issues

More articles: