📜 ⬆️ ⬇️

Distribution of static content - an account for milliseconds

image

8 years ago I wrote an article about accelerating the distribution of static content , it attracted some habrachiteli and remained relevant for a long time.

And so we decided to accelerate what works so fast and, at the same time, to share the experience of what happened in the end. Of course, I will talk about rakes, about where HTTP / 2 is not needed , why we buy 7.6Tb NVMe SSD instead of 8x1Tb SATA SSD and a lot of other highly specialized information.
')
Let's immediately agree that the storage and distribution of content is 2 different tasks and we will only talk about distribution (advanced cache)

Let's start with iron ...

NVMe SSD


As you already understood, we will keep up with progress and use modern 7.68 TB SSD HGST Ultrastar SN260 {HUSMR7676BHP3Y1} NVMe, HH-HL AIC to store the cache. Screw tests show not such a beautiful picture as in marketing materials, but quite optimistic

[root@4 www]# hdparm -Tt --direct /dev/nvme1n1 /dev/nvme1n1: Timing O_DIRECT cached reads: 2688 MB in 2.00 seconds = 1345.24 MB/sec Timing O_DIRECT disk reads: 4672 MB in 3.00 seconds = 1557.00 MB/sec [root@4 www]# hdparm -Tt /dev/nvme1n1 /dev/nvme1n1: Timing cached reads: 18850 MB in 1.99 seconds = 9452.39 MB/sec Timing buffered disk reads: 4156 MB in 3.00 seconds = 1385.08 MB/sec 

Of course, you choose the size and the manufacturer “for yourself”, you can also consider the SATA interface, but we write about what you should strive for :)

If you still chose NVMe, for obtaining info about the screw, install the nvme-cli package and look at the characteristics of our “workhorse”

 [root@4 www]# nvme smart-log /dev/nvme1n1 Smart Log for NVME device:nvme1n1 namespace-id:ffffffff critical_warning : 0 temperature : 35 C available_spare : 100% available_spare_threshold : 10% percentage_used : 0% data_units_read : 158 231 244 data_units_written : 297 968 host_read_commands : 45 809 892 host_write_commands : 990 836 controller_busy_time : 337 power_cycles : 18 power_on_hours : 127 unsafe_shutdowns : 14 media_errors : 0 num_err_log_entries : 10 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 35 C Temperature Sensor 2 : 27 C Temperature Sensor 3 : 33 C Temperature Sensor 4 : 35 C 

As you can see, the screw feels great, looking ahead to say that under load the temperature condition is maintained within the same limits.

During peak periods, we distribute about 4000 / photo per second (photo size is about 10-100K), at half of such a load iostat does not rise more than 0.1% , there is still RAM plays a big role, but I'll write about it)

A few words about why we are now betting on expensive NVMe, instead of a bag of cheap SATA SSDs. We conducted tests that show that with a similar server architecture, RAM and the same load, for example, the Ultrastar SN260 7.68TB NVMe works with 10 times less iowait than 8xSamsung SSD 850 PRO 1TB in Stripped RAID with Areca ARC-1882 Raid Controller PCI. On servers there is a slight difference in the number of cores with NVMe 26, with ARC-1882 24 cores, the number of RAM 128G there and there. Unfortunately, there is no way to compare the power consumption on these servers. It was possible to measure the power consumption of the NVMe platform with a similarly designed system on AMD processors with software Stripped RAID from 8xINTEL SSDSC2BB480G4 480G and ARC-1680 Raid Controller PCI controller on 24 AMD Opteron (tm) Processor 6174 cores, with the same load, the new system eats 2.5 times less energy than 113 Watts against 274 Watts on AMD. Well, the CPU load and iowait there are also an order of magnitude smaller (AMD does not have hardware encryption)

File system


8 years ago we used btrfs , tried XFS, but ext4 on a large parallel load behaves more lively, so our choice is ext4. Correct tuning ext4, can further increase the already excellent performance of this fs.
Optimization starts from the moment of formatting, for example, if you mainly distribute 1-5K files, you can slightly reduce the block size when formatting:

 mkfs.ext4 -b 2048 /dev/sda1 

or even

 mkfs.ext4 -b 1024 /dev/sda1 

To find out the current block size on the file system, use:

 tune2fs -l /dev/sda1 | grep Block 

or

 [root@4 www]# fdisk -l /dev/nvme1n1  /dev/nvme1n1: 7681,5 , 7681501126656 і, 1875366486 і і = і  1 * 4096 = 4096 і і  (і/і): 4096 і / 4096 і і - (іі/): 4096 і / 4096 і  і : dos Іі : 0x00000000 

Since on our NVMe SSD sector size is 4k, the block size is less than optimal to do less than this value, formatting:

 mkfs.ext4 /dev/nvme1n1 

Please note that I did not partition the disk, but format it with a whole block device. For OS, I use another SSD and there is a breakdown. In the file system, you can mount it as / dev / nvme1n1

It is advisable to mount the disk in such a way as to squeeze the maximum speed out of it. To do this, disable all unnecessary, we will mount with the “noatime, barrier = 0” option, if the atime attribute is important to you, in kernel 4 and above there is the lazytime option, it keeps the atime value in RAM and partially solves the problem of frequent access time updates while reading.

Ram


If your static is placed in RAM, forget about all the above and enjoy the distribution of files from RAM.

When distributing from RAM, the OS itself often delays the requested files into its cache and you do not need to configure anything for this, but if the device on which files are stored is slow and there are a lot of files (hundreds of thousands), it can happen that the OS simultaneously requests many files from the file system and your storage device at the same time will begin to greatly “slow down”. You can try to solve this problem as follows, load the static onto the RAM disk and distribute it from it.

An example of creating a RAM disk:

 mount -t tmpfs -o size=1G,mode=0700,noatime tmpfs /cache 

if you forgot with what parameters you mounted, then you can see with findmnt:

 findmnt --target /cache 


You can remount without rebooting:

 mount -o remount,size=4G,noatime /cache 

You can also combine the part in RAM (some often often requested thumbnails) the rest on the SSD.

In nginx it will look something like this:

  location / { root /var/cache/ram; try_files $uri @cache1; } location @cache1 { root /var/cache/ssd; try_files $uri @storage; } 

If the content in the RAM does not fit, you have to trust the kernel, we put 128G RAM at the size of the active cache 3-5G and think about increasing to 256G.

CPU


There are no special requirements for the frequency of the processor, rather there is a requirement for functionality: if your traffic will need to be encrypted (for example, via the https protocol), it is important to choose a processor that will support AES-NI hardware encryption (Intel Advanced Encryption Standard).

On Linux, you can verify that the processor supports AES-NI instructions with the command

 grep -m1 -o aes /proc/cpuinfo aes 

If there is no output "aes", then the processor does not support such instructions and encryption will devour the performance of the processor.

We configure nginx


General ideas are described in the previous article , but still there are a few points on optimization, I will tell about them now:


What about cdn


CDN is good, if you work for different countries, you have a lot of content and little traffic, but if there is a lot of traffic, there is little content, and you work for a specific country, then it makes sense to calculate everything and understand what is more profitable for you. For example, we are working on the Ukrainian market, many world leaders providing the CDN service do not have servers in Ukraine, and distribution comes from Germany or Poland. So we get instead of + 3-5ms - + 30-50ms to the answer on level ground. Placing a 2U server in a good Ukrainian DC starts from $ 18 + payment per channel, for example, 100Mbps - $ 10 totaling $ 28. The commercial price of distributing CDN traffic in Ukraine is about $ 0.05 / G , i.e. if we distribute more than 560G / month, then we can already consider the option of self-distribution. RIA.com services distribute several terabyte statics per day, and therefore we have long ago made decisions about self-distribution.

How to be friends with search engines


For many search engines, the important characteristics are TTFB (time to first byte) and how “close” the content is to the person who is looking for it, besides, texts in links to content, description in Exif tags, uniqueness, size of content, etc. are important. .d
All about what I am writing here, mainly in order to protjunit TTFB and be closer to the user. You can use the User-Agent trick to detect search bots and give them content from a separate server in order to eliminate “jams” or “slowing down during peak periods” (usually bots give a uniform load), thus making the search engines happy not by the user . We do not do that, besides there is a suspicion that Google Chrome and Yandex Browser trust the information that browsers provide them about the speed of loading your pages from the perspective of the client.

It is also worth noting that the load from different bots can be so substantial that you have to spend almost half the resources on servicing these bots. RIA.com projects serve about 10-15 million requests from bots per day (this includes not only static messages but also regular pages), which is not much less than the number of requests from real users.

Optimization for distributing static content


Well, if the distribution process is already set up - it's time to think about what can be done with the content itself, so that it is more accessible, loads faster, takes up less space, is attractive to search engines.

Source: https://habr.com/ru/post/353672/


All Articles