Distribution of static content - an account for milliseconds

8 years ago I wrote an article about accelerating the distribution of static content , it attracted some habrachiteli and remained relevant for a long time.

And so we decided to accelerate what works so fast and, at the same time, to share the experience of what happened in the end. Of course, I will talk about rakes, about where HTTP / 2 is not needed , why we buy 7.6Tb NVMe SSD instead of 8x1Tb SATA SSD and a lot of other highly specialized information.
')
Let's immediately agree that the storage and distribution of content is 2 different tasks and we will only talk about distribution (advanced cache)

Let's start with iron ...

NVMe SSD

As you already understood, we will keep up with progress and use modern 7.68 TB SSD HGST Ultrastar SN260 {HUSMR7676BHP3Y1} NVMe, HH-HL AIC to store the cache. Screw tests show not such a beautiful picture as in marketing materials, but quite optimistic

[root@4 www]# hdparm -Tt --direct /dev/nvme1n1 /dev/nvme1n1: Timing O_DIRECT cached reads: 2688 MB in 2.00 seconds = 1345.24 MB/sec Timing O_DIRECT disk reads: 4672 MB in 3.00 seconds = 1557.00 MB/sec [root@4 www]# hdparm -Tt /dev/nvme1n1 /dev/nvme1n1: Timing cached reads: 18850 MB in 1.99 seconds = 9452.39 MB/sec Timing buffered disk reads: 4156 MB in 3.00 seconds = 1385.08 MB/sec

Of course, you choose the size and the manufacturer “for yourself”, you can also consider the SATA interface, but we write about what you should strive for :)

If you still chose NVMe, for obtaining info about the screw, install the nvme-cli package and look at the characteristics of our “workhorse”

 [root@4 www]# nvme smart-log /dev/nvme1n1 Smart Log for NVME device:nvme1n1 namespace-id:ffffffff critical_warning : 0 temperature : 35 C available_spare : 100% available_spare_threshold : 10% percentage_used : 0% data_units_read : 158 231 244 data_units_written : 297 968 host_read_commands : 45 809 892 host_write_commands : 990 836 controller_busy_time : 337 power_cycles : 18 power_on_hours : 127 unsafe_shutdowns : 14 media_errors : 0 num_err_log_entries : 10 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 35 C Temperature Sensor 2 : 27 C Temperature Sensor 3 : 33 C Temperature Sensor 4 : 35 C

As you can see, the screw feels great, looking ahead to say that under load the temperature condition is maintained within the same limits.

During peak periods, we distribute about 4000 / photo per second (photo size is about 10-100K), at half of such a load iostat does not rise more than 0.1% , there is still RAM plays a big role, but I'll write about it)

A few words about why we are now betting on expensive NVMe, instead of a bag of cheap SATA SSDs. We conducted tests that show that with a similar server architecture, RAM and the same load, for example, the Ultrastar SN260 7.68TB NVMe works with 10 times less iowait than 8xSamsung SSD 850 PRO 1TB in Stripped RAID with Areca ARC-1882 Raid Controller PCI. On servers there is a slight difference in the number of cores with NVMe 26, with ARC-1882 24 cores, the number of RAM 128G there and there. Unfortunately, there is no way to compare the power consumption on these servers. It was possible to measure the power consumption of the NVMe platform with a similarly designed system on AMD processors with software Stripped RAID from 8xINTEL SSDSC2BB480G4 480G and ARC-1680 Raid Controller PCI controller on 24 AMD Opteron (tm) Processor 6174 cores, with the same load, the new system eats 2.5 times less energy than 113 Watts against 274 Watts on AMD. Well, the CPU load and iowait there are also an order of magnitude smaller (AMD does not have hardware encryption)

File system

8 years ago we used btrfs , tried XFS, but ext4 on a large parallel load behaves more lively, so our choice is ext4. Correct tuning ext4, can further increase the already excellent performance of this fs.
Optimization starts from the moment of formatting, for example, if you mainly distribute 1-5K files, you can slightly reduce the block size when formatting:

 mkfs.ext4 -b 2048 /dev/sda1

or even

 mkfs.ext4 -b 1024 /dev/sda1

To find out the current block size on the file system, use:

 tune2fs -l /dev/sda1 | grep Block

 [root@4 www]# fdisk -l /dev/nvme1n1  /dev/nvme1n1: 7681,5 , 7681501126656 і, 1875366486 і і = і  1 * 4096 = 4096 і і  (і/і): 4096 і / 4096 і і - (іі/): 4096 і / 4096 і  і : dos Іі : 0x00000000

Since on our NVMe SSD sector size is 4k, the block size is less than optimal to do less than this value, formatting:

 mkfs.ext4 /dev/nvme1n1

Please note that I did not partition the disk, but format it with a whole block device. For OS, I use another SSD and there is a breakdown. In the file system, you can mount it as / dev / nvme1n1

It is advisable to mount the disk in such a way as to squeeze the maximum speed out of it. To do this, disable all unnecessary, we will mount with the “noatime, barrier = 0” option, if the atime attribute is important to you, in kernel 4 and above there is the lazytime option, it keeps the atime value in RAM and partially solves the problem of frequent access time updates while reading.

Ram

If your static is placed in RAM, forget about all the above and enjoy the distribution of files from RAM.

When distributing from RAM, the OS itself often delays the requested files into its cache and you do not need to configure anything for this, but if the device on which files are stored is slow and there are a lot of files (hundreds of thousands), it can happen that the OS simultaneously requests many files from the file system and your storage device at the same time will begin to greatly “slow down”. You can try to solve this problem as follows, load the static onto the RAM disk and distribute it from it.

An example of creating a RAM disk:

 mount -t tmpfs -o size=1G,mode=0700,noatime tmpfs /cache

if you forgot with what parameters you mounted, then you can see with findmnt:

 findmnt --target /cache

You can remount without rebooting:

 mount -o remount,size=4G,noatime /cache

You can also combine the part in RAM (some often often requested thumbnails) the rest on the SSD.

In nginx it will look something like this:

  location / { root /var/cache/ram; try_files $uri @cache1; } location @cache1 { root /var/cache/ssd; try_files $uri @storage; }

If the content in the RAM does not fit, you have to trust the kernel, we put 128G RAM at the size of the active cache 3-5G and think about increasing to 256G.

CPU

There are no special requirements for the frequency of the processor, rather there is a requirement for functionality: if your traffic will need to be encrypted (for example, via the https protocol), it is important to choose a processor that will support AES-NI hardware encryption (Intel Advanced Encryption Standard).

On Linux, you can verify that the processor supports AES-NI instructions with the command

 grep -m1 -o aes /proc/cpuinfo aes

If there is no output "aes", then the processor does not support such instructions and encryption will devour the performance of the processor.

We configure nginx

General ideas are described in the previous article , but still there are a few points on optimization, I will tell about them now:

We avoid rewright at the top level of the server directive, we don’t welcome regular expressions in the location sections. If without this, then at least try to wrap the regular season in a normal location, for example:
```
  location /example/dir { rewrite ^/example/dir(.*) /newexample/$1; } 
```
If this is not done, then the regular schedule will be applied for each request to the distribution system, and in our example the fitting of the regular redirect will occur only when the path to the photo starts at / example / dir
When storing content in the cache, follow the rule not to delete more than 255 files or folders in one folder (if you formatted a disk with “default” settings with a 4K block size), 128 if 2K, etc.
We put a new core, preferably not below 4.1. In nginx, do not forget to enable SO_REUSEPORT support. This directive positively affects the parallel download of files from the distribution server.
We place the server closer to your users, so we make happier users and know this and are appreciated by search engines.

What about cdn

CDN is good, if you work for different countries, you have a lot of content and little traffic, but if there is a lot of traffic, there is little content, and you work for a specific country, then it makes sense to calculate everything and understand what is more profitable for you. For example, we are working on the Ukrainian market, many world leaders providing the CDN service do not have servers in Ukraine, and distribution comes from Germany or Poland. So we get instead of + 3-5ms - + 30-50ms to the answer on level ground. Placing a 2U server in a good Ukrainian DC starts from $ 18 + payment per channel, for example, 100Mbps - $ 10 totaling $ 28. The commercial price of distributing CDN traffic in Ukraine is about $ 0.05 / G , i.e. if we distribute more than 560G / month, then we can already consider the option of self-distribution. RIA.com services distribute several terabyte statics per day, and therefore we have long ago made decisions about self-distribution.

How to be friends with search engines

For many search engines, the important characteristics are TTFB (time to first byte) and how “close” the content is to the person who is looking for it, besides, texts in links to content, description in Exif tags, uniqueness, size of content, etc. are important. .d
All about what I am writing here, mainly in order to protjunit TTFB and be closer to the user. You can use the User-Agent trick to detect search bots and give them content from a separate server in order to eliminate “jams” or “slowing down during peak periods” (usually bots give a uniform load), thus making the search engines happy not by the user . We do not do that, besides there is a suspicion that Google Chrome and Yandex Browser trust the information that browsers provide them about the speed of loading your pages from the perspective of the client.

It is also worth noting that the load from different bots can be so substantial that you have to spend almost half the resources on servicing these bots. RIA.com projects serve about 10-15 million requests from bots per day (this includes not only static messages but also regular pages), which is not much less than the number of requests from real users.

Optimization for distributing static content

Well, if the distribution process is already set up - it's time to think about what can be done with the content itself, so that it is more accessible, loads faster, takes up less space, is attractive to search engines.

The first thing you should pay attention to is the photo format: it turns out that jpeg, which is very popular for many projects, is already inferior in size with comparable quality to newer formats on the Web, such as the WebP format that Google has been promoting since 2010. According to various sources, we get 20-30% less size, with equal quality. At the same time, you can also use a special tag on the client that allows you to describe several formats that the browser can display and if the browser does not support the WebP format, it will load, for example, jpeg
Also a little about SEO requirements:
- You have already done part of SEO optimization by placing the distribution server closer to the client and accelerating its return.
- A few words about exif-tags, many of them cut out when scaling a photo - in vain! Google also analyzes this information, why not indicate for the photo what is depicted on the photo in the exif-tag ImageDescription, or in the exif-tag Copyright not to enter the copyright to your content, if of course it is yours :)
- Do not forget about HTTP headers, which indicate different useful meta-information about the content file. For example, using the Expires header, you can specify the storage time of the content in the browser cache.
If your site works using the HTTP / 2 protocol, then it is worth experimenting with the following optimizations:
- You can stop using css sprites, since multiplexing the transfer of small files over a single connection can compensate for the gain that is usually achieved with http / 1.1 using this optimization.
- Try web-push technology , note that this method of optimizing downloads does not take into account the content of the browser in the cache, which is fired, but this is solved using cookies and simple nginx settings, as shown in the example
```
 server { listen 443 ssl http2 default_server; ssl_certificate ssl/certificate.pem; ssl_certificate_key ssl/key.pem; root /var/www/html; http2_push_preload on; location = /demo.html { add_header Set-Cookie "session=1"; add_header Link $resources; } } map $http_cookie $resources { "~*session=1" ""; default "</style.css>; as=style; rel=preload, </image1.jpg>; as=image; rel=preload, </image2.jpg>; as=style; rel=preload"; } 
```
- From our experience, distribution of content is best done with a domain name different from the main domain of the project.
Mysterious HTTP / 2

As you know HTTP / 2 multiplexes one connection, through which several files are transmitted, while prioritizing the sending of a file, header compression and other advantages of the new protocol are possible, but there are also disadvantages that few people write about. I’ll come from afar: maybe there are some older hackers who remember the Internet before the uTorrent era, many of you had to use flashget, download master, etc. rocking chairs. Do you remember how they worked? They downloaded one file in 6 or 8 streams, opening 6-8 connections with the giving server. Why did they do that? After all, a channel with a distributing and receiving server should not depend on the number of connections between them, but in fact this dependence exists if the channel is bad, with packet loss and packet transmission errors. It turns out that with such alignments, in several threads swings faster. In addition, if a channel is used by several clients, then an increase in the number of connections from one client helps to get more bandwidth in the channel and drag the “blanket of resources” to itself. Of course, this is not always the case, but still there is a threat to get a “competing” competitor in the form of a browser using the http / 1.1 protocol, which will open 6 connection points to one site, instead of 1 via http2. In my practice, there was a case with a “photo hosting website wallpaper” type site, which refused http / 2, for the reason that the site on the http / 2 protocol slowed down noticeably, without visible load on the server itself, the guys stayed on https, but switched on http / 1.1 and the situation is resolved.

I would experiment with getting content from different domains (but actually from the same server), this technique is called domain sharding, perhaps as one could get out of the situation without giving up http / 2 and forcing the browser to install as many connections as the site administrator needs.

Instead of conclusion

The moment when the site starts to slow down can not be called unpleasant, because you feel that traffic is growing, customers become more. We never set ourselves the task of avoiding “brakes”, we learned to respond quickly to this growth. Optimization of speed is an endless process, do not deny yourself the pleasure of competing with your competitors in the art of being fast!

Source: https://habr.com/ru/post/353672/

All Articles