Accelerating the distribution of pictures

The problem of slow return of static content every sysadmin sooner or later faces.

This appears approximately as follows: sometimes a 3Kb picture is loaded as if it weighs 3Mb, out of the blue they start to “stick” (give up very slowly) css and javascript. You press ctrl + reload - and, it seems, there is no problem, then after only a few minutes everything repeats again.

The true cause of the “brakes” is not always obvious, and we look askance at either nginx, or at the hoster, or at the “clogged” channel, or at the “brake” or “buggy” browser :)
')
In fact, the problem is the imperfection of the modern hard drive, which has not yet parted with the mechanical subsystems of spindle rotation and head positioning.

In this article I will offer you my solution to this problem, based on practical experience of using SSD drives in conjunction with the nginx web server.

How to understand that hard brakes?

In Linux, problems with the speed of the disk system are directly related to the iowait parameter (the percentage of CPU idle while waiting for I / O operations). In order to monitor this parameter there are several commands: mpstat , iostat , sar . I usually run iostat 5 (measurements will be taken every 5 sec.)
I am calm for the server, whose average iowait is up to 0.5% . Most likely on your server "distribution" this parameter will be higher. It makes sense not to postpone optimization, if iowait> 10% Your system spends a lot of time moving the heads through the hard drive instead of reading information, this can lead to "braking" and other processes on the server.

How to deal with big iowait?

Obviously, if you reduce the number of disk I / O operations, the hard drive is lighter and iowait will fall.
Here are some recommendations:

Disable access_log
Turn off updating the date of the last access to the file and directory, and also allow the system to cache disk writes. To do this, mount the file system with the following options: async, noatime, barrier = 0 . ('barrier = 0' unjustified risk, if the database is on the same section)
You can increase the timeout between flushing dirty buffers vm.dirty_writeback_centisecs in /etc/sysctl.conf . I have installed vm.dirty_writeback_centisecs = 15000
Did you happen to forget about the expires max directive?
It will not be superfluous to turn on the caching of file descriptors .
Application of client optimization: css sprites, all css in one file, all js in one file

This will help a bit and will give time to last until the upgrade. If the project grows, iowait will remind you soon. :)

Upgrade iron

We reinstall RAM
Perhaps you can start with RAM. Linux uses all the “free” RAM and places the disk cache there.
Old verified RAID
You can build software or hardware RAID from multiple HDDs. In some cases, it makes sense to increase the number of hard drives, but do not collect them in a RAID (for example, distributing iso-disk images, large video files, ...)
Solid-state drive: try something new
Well, in my opinion, the cheapest version of the upgrade is to install one or several SSD disks into the system. Today, as you have already guessed, it will be about this method of acceleration.

Absolutely will not affect the speed of distribution of the upgrade CPU, because it does not slow down! :)

Why SSD

A year and a half ago, when I wrote the article “Tuning nginx” , one of the nginx acceleration options I suggested was the use of SSD hard drives. Habrasoobshchestvu restrained showed interest in this technology , there was information about the possible inhibition of SSD over time and fear for a small number of rewriting cycles.
Very soon after the publication of the article in our company appeared Kingston SNE125-S2 / 64GB based on Intel x25e SSD, which is still used on one of the most heavily distributed servers.

After a year of experiments, a number of flaws emerged, which I would like to tell you:

Advertising trick: if the SSD ad says that the maximum reading speed, 250 MB / s , then this means that the average reading speed will be ~ 75% (~ 190Mb / s) of the stated maximum. I was so with MLC and SLC, expensive and cheap
The larger the volume of one SSD, the higher the cost of 1 MB on this disk
Most file systems are not adapted for use on SSDs and can create an uneven write load on the disk.
Only the most modern (and, accordingly, the most expensive) RAID controllers are adapted to connect SSDs to them.
SSD is still expensive technology

Why I use SSD:

Advertising does not lie - seek-to-seek really tends to 0. That can significantly reduce iowait with a parallel "distribution" of a large number of files.
Yes, indeed, the number of rewriting cycles is limited, but we know about it and can minimize the amount of rewritable information using the technique described below
Now drives are available that use SLC (Single-Level Cell) technology with a “smart” controller, the number of rewriting cycles that are an order of magnitude higher than the usual MLC SSD
Modern file systems (for example, btrfs ) already know how to work properly with SSD
As a rule, a caching server requires a small amount of cache space (we have 100-200G), which can fit on 1 SSD. It turns out that it is significantly cheaper than a solution based on a hardware RAID array with several SAS disks.

Configuring SSD cache

File system selection
At the beginning of the experiment, ext4 was installed on the Kingston SNE125-S2 / 64GB. On the Internet, you will find many recommendations on how to “chop off” logging, the last file access dates, etc. Everything worked perfectly and for a long time. The most important thing that didn’t suit was that with a large number of small photographs 1-5K on 64G SSD less than half was placed - ~ 20G. I began to suspect that my SSD is not being used rationally.

Upgraded the kernel to 2.6.35 and decided to try (still experimental) btrfs, there is an opportunity to specify when mounting that ssd is mounted. The disk can not be divided into sections, as is customary, but format as a whole.

Example:

mkfs.btrfs /dev/sdb

When mounting, you can disable many features that we do not need and enable compression of files and metadata. (In fact, jpeg-and will not be compressed, btrfs smart, only metadata will be compressed). Here is what my fstab mount line looks like (all in one line):

UUID = 7db90cb2-8a57-42e3-86bc-013cc0bcb30e / var / www / ssd btrfs device = / dev / sdb, device = / dev / sdc, device = / dev / sdd, noatime, ssd, nobarrier, compress, nodatacow, nodatasow , noacl, notreelog 1 2

You can get the formatted disk UUID using the command:

 blkid /dev/sdb

As a result, the disc “got into” more than 41G (2 times more than on ext4). At the same time, the speed of distribution did not suffer (since iowait did not increase).

We collect RAID from SSD
The moment came when 64G SSD was not enough, I wanted to collect several SSDs into one large section and at the same time there was a desire to use not only expensive SLCs, but also ordinary MLC SSDs. Here you need to insert a bit of theory:

Btrfs saves 3 types of data on a disk: data about the file system itself, addresses of metadata blocks (there are always 2 copies of metadata on the disk) and, in fact, the data itself (file contents). Experimentally, I found that in our directory structure “compressed” metadata occupies ~ 30% of all data in the section. Metadata is the most intensely variable block, since any addition of a file, transfer of a file, change of access rights entails overwriting a block of metadata. The area where the data is stored is simply overwritten less often. Here we come to the most interesting possibility of btrfs: it is to create software RAID-masyvy and explicitly indicate on which drives to save data on which metadata.

Example:

 mkfs.btrfs -m single /dev/sdc -d raid0 /dev/sdb /dev/sdd

as a result, the metadata will be created on / dev / sdc and the data on / dev / sdb and / dev / sdd, which will be collected in the stripped raid. Moreover, you can connect more disks to the existing system , perform data balancing, etc.

To find out the UUID btrfs RAID-run:

 btrfs device scan

Attention: feature of working with btrfs-rayd: before each mount the RAID array and after loading the btrfs module it is necessary to run the command: btrfs device scan . To automatically mount via fstab, you can do without 'btrfs device scan' by adding the device options to the mount line. Example:

 /dev/sdb /mnt btrfs device=/dev/sdb,device=/dev/sdc,device=/dev/sdd,device=/dev/sde

Caching on nginx without proxy_cache

I assume that you have a storage-server on which all the content is located, there is a lot of space on it and the usual "floppy" SATA hard drives that are not able to hold a large share of access.
Between the storage server and site users, there is a “distribution” server, the task of which is to take the load off the storage server and ensure uninterrupted distribution of statics to any number of clients.

Install one or more SSDs with btrfs on board to the distribution server. This is where the proxy_cache-based nginx configuration comes to mind. But she has a few drawbacks for our system:

at each restart, progin_cache begins to gradually scan the entire contents of the cache. For several hundred thousand, this is perfectly acceptable, but if we put a large number of files in the cache, then this behavior of nginx is an unjustified expenditure of disk operations
for proxy_cache, there is no native cache “cleansing” system, and third-party modules allow cleaning the cache only one file at a time.
there is a small overhead in terms of CPU consumption, since at each return, MD5 hashing is performed on the line specified in the proxy_cache_key directive
But the most important thing for you is that proxy_cache does not care about updating the cache with the least amount of information rewriting cycles. If the file “flies” out of the cache, then it is deleted and, if it is requested again, it is re-recorded in the cache.

We will take another approach to caching. The idea flashed on one of the conferences on hiload. Create 2 cache0 and cache1 directories in the cache section. When proxying, all files are saved in cache0 (using proxy_store). nginx make the file check (and give the file to the client) first in cache0 and then in cache1 and if the file is not found, go to the storage server behind the file, then save it to cache0.
After some time (week / month / quarter), delete cache1, rename cache0 to cache1, and create an empty cache0. We analyze the logs of access to the cache1 section and those files that are requested from this section are interlinked into cache0.

This method allows to significantly reduce write operations on SSD, since file relinking is still less than full file overwriting. In addition, you can collect a raid of several SSDs, 1 of which will be SLC for metadata and MLC SSD for regular data. (On our system, metadata takes up about 30% of the total data) . When relinking, only metadata will be overwritten!

Nginx configuration example

 log_format cache0 '$request'; # ... server { expires max; location / { root /var/www/ssd/cache0/ ; try_files $uri @cache1; access_log off; } location @cache1 { root /var/www/ssd/cache1; try_files $uri @storage; access_log /var/www/log_nginx/img_access.log cache0; } location @storage { proxy_pass http://10.1.1.1:8080/$request_uri; proxy_store on; proxy_store_access user:rw group:rw all:r; proxy_temp_path /var/www/img_temp/; #    SSD! root /var/www/ssd/cache0/; access_log off; } # ...

Scripts for cache0 and cache1 rotation
I wrote several scripts on bash to help you implement the previously described rotation scheme. If the size of your cache is measured in hundreds of gigabytes and the amount of content in the cache is in millions, then it is advisable to run the ria_ssd_cache_mover.sh script several times in a row after the rotation with the following command:

 for i in `seq 1 10`; do ria_ssd_cache_mover.sh; done;

Time for which this command will be executed install experimentally. She worked for me almost a day. On the next. day set launch ria_ssd_cache_mover.sh on cron every hour.

DOS protection and storage server
If the storage server is hilovat and there are ill-wishers thirsting for your system, you can use the secure_link module together with the described solution .

useful links

The most complete btrfs documentation (eng)
About btrfs in Russian
Well complements the theme of the report of Cyril Mokevnin "Storage, processing and return of statics"

UPD1: Still, I advise you to use the kernel> = 2.6.37 and later , because I recently had a large crash cache at 2.6.35 due to an overflow of space on the SSD with metadata. As a result, the incoming formatted several SSDs and reassembled btrfs-raid. :(

Source: https://habr.com/ru/post/108958/

All Articles