📜 ⬆️ ⬇️

Disk balancing in Nginx


In this article I will describe an interesting solution based on Nginx for the case when the disk system becomes a bottleneck when distributing content (for example, video).

Formulation of the problem


We have a task: you need to send static files (video) to clients with a total distribution band of tens of gigabits per second.

For obvious reasons, such a band cannot be distributed directly from the repository, it is necessary to apply caching. The amount of content that makes up most of the traffic produced is several orders of magnitude larger than the amount of RAM in one server, so caching in RAM is not possible, and the cache will have to be stored on disks.
')
Network channels of sufficient capacity are available a priori, otherwise the task would be unsolvable.

Choosing a solution


In this situation, disks become a problematic place: in order for the server to produce 20 gigabytes of traffic per second (two optical fibers in the aggregate), it must read ~ 2400 megabytes per second of disks from the disks. In addition to this, the disks can also be busy writing to the cache.
To scale the performance of the disk system used RAID-arrays with alternating blocks. The bet is that when reading a file, its blocks will appear on different disks and the speed of sequential reading of the file will be on average equal to the speed of the slowest disk multiplied by the number of alternating disks.
The problem with this approach is that it works efficiently only for the ideal case, when a sufficiently long file is read (the file size is much larger than the size of the interleaving block), which is located inside the file system without fragmentation. For parallel reading of many small and / or fragmented files, this approach does not even allow to approach the total speed of all disks. For example, a RAID0 of six ssd disks with 100% load on an I / O queue gave a speed like two disks.
Practice has shown that it is more profitable to divide files between disks entirely using separate file systems. This ensures that each disk is recycled, because they are independent.

Implementation


As mentioned above, we will cache nginx. The idea is to divide the distributed files between the disks equally. To do this, in the simplest case, it is enough to hash to display multiple URLs into multiple disks. Something like this we will do, but first things first.
We define the caching zone by the number of disks. In my example there are 10 of them.
In the http section:
  proxy_cache_path /var/www/cache1 levels=1:2 keys_zone=cache1:100m inactive=365d max_size=200g; proxy_cache_path /var/www/cache2 levels=1:2 keys_zone=cache2:100m inactive=365d max_size=200g; ... proxy_cache_path /var/www/cache10 levels=1:2 keys_zone=cache10:100m inactive=365d max_size=200g; 

A separate disk is mounted in the directory of each cache zone.

Sources of content will be three upstream, two servers in each group:
 upstream src1 { server 192.168.1.10; server 192.168.1.11; } upstream src2 { server 192.168.1.12; server 192.168.1.13; } upstream src3 { server 192.168.1.14; server 192.168.1.15; } 

This is not a fundamental moment, taken for likelihood.

server section:

 server { listen 80 default; server_name localhost.localdomain; access_log /var/log/nginx/video.access.log combined buffer=128k; proxy_cache_key $uri; set_by_lua_file $cache_zone /etc/nginx/cache_director.lua 10 $uri_without_args; proxy_cache_min_uses 0; proxy_cache_valid 1y; proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_404; location ~* ^/site1/.*$ { set $be "src1"; include director; } location ~* ^/site2/.*$ { set $be "src2"; include director; } location ~* ^/site3/.*$ { set $be "src3"; include director; } location @cache1 { bytes on; proxy_temp_path /var/www/cache1/tmp 1 2; proxy_cache cache1; proxy_pass http://$be; } location @cache2 { bytes on; proxy_temp_path /var/www/cache2/tmp 1 2; proxy_cache cache2; proxy_pass http://$be; } ... location @cache10 { bytes on; proxy_temp_path /var/www/cache10/tmp 1 2; proxy_cache cache10; proxy_pass http://$be; } } 

The set_by_lua_file directive selects the appropriate drive for this URL by hashing. For the conditional “site”, the backend is selected and remembered. Then in the director file there is a redirection to the internal location, which serves the request from the selected backend, saving the response in the cache defined for this URL.

Here is the director :
 if ($cache_zone = 0) { return 481; } if ($cache_zone = 1) { return 482; } ... if ($cache_zone = 9) { return 490; } error_page 481 = @cache1; error_page 482 = @cache2; ... error_page 490 = @cache10; 

It looks awful, but it is the only way.

All configuration salt in URL-> disk hashing, cache_director.lua :
 function shards_vector(base, seed) local result = {} local shards = {} for shard_n=0,base-1 do table.insert(shards, shard_n) end for b=base,1,-1 do choosen = math.fmod(seed, b)+1 table.insert(result, shards[choosen]) table.remove(shards, choosen) seed = math.floor(seed / b) end return result end function file_exists(filename) local file = io.open(filename) if file then io.close(file) return 1 else return 0 end end disks = ngx.arg[1] url = ngx.arg[2] sum = 0 for c in url:gmatch"." do sum = sum + string.byte(c) end sh_v = shards_vector(disks, sum) for _, v in pairs(sh_v) do if file_exists("/var/www/cache" .. (tonumber(v)+1) .. "/ready") == 1 then return v end end 

In the set_by_lua_file directive mentioned above, this code gets the number of disks and the URL. The idea of ​​directly mapping a URL to a disk is good until at least one disk fails. Redirection of URLs from a problem disk to a healthy one needs to be performed equally for a specific URL (otherwise there will be no cache hits) and at the same time it must be different for different URLs in order to avoid load imbalance. Both of these properties should be maintained if replacement of replacement (etc.) also fails. Therefore, for a system of n disks, I map the URL to the set of various permutations of these n disks and then try to use the corresponding caches in the order of the disks in the order. The criterion for disk activity (cache) is the presence of a flag file in its directory. I have to chattr these files so that nginx does not delete them.

results


Such a smearing of content on disks really allows you to use the entire speed of disk devices. A server with 6 inexpensive SSD disks with a workload was able to develop a return of about 1200 MB / s, which corresponds to the total speed of the disks. The speed of the array ranged around 400 MB / s

Source: https://habr.com/ru/post/233525/


All Articles