How we scaled Nginx and save the world 54 years of waiting each day

“The @Cloudflare team has just made changes that significantly improved the performance of our network, especially for the slowest requests. How much faster? We estimate that we are saving the Internet about 54 years of time per day , which otherwise would have been spent waiting for the sites to load . ” - Tweet by Matthew Prince, June 28, 2018

10 million sites, applications and APIs use Cloudflare to speed up content downloads for users. At the peak, we process more than 10 million requests per second in 151 data centers. Over the years we have made many changes to our version of Nginx to cope with growth. This article is about one of these changes.

How does nginx work

Nginx is one of the programs that uses event loops to solve the C10K problem . Each time a network event arrives (a new connection, request or notification to send more data, etc.), Nginx wakes up, processes the event, and then returns to another job (this may be processing other events). When an event arrives, the data for it is already ready, which allows you to efficiently process many simultaneous requests without downtime.

num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1); // events is list of active events // handle event[0]: incoming request GET http://example.com/ // handle event[1]: send out response to GET http://cloudflare.com/

For example, here’s what a code snippet for reading data from a file descriptor might look like:
')

 // we got a read event on fd while (buf_len > 0) { ssize_t n = read(fd, buf, buf_len); if (n < 0) { if (errno == EWOULDBLOCK || errno == EAGAIN) { // try later when we get a read event again } if (errno == EINTR) { continue; } return total; } buf_len -= n; buf += n; total += n; }

If fd is a network socket, the bytes already received will be returned. The last call will return EWOULDBLOCK . This means that the local read buffer has ended and should no longer be read from this socket until the data appears.

Disk I / O is different from network

If fd is a regular file in Linux, then EWOULDBLOCK and EAGAIN never appear, and a read operation always waits to read the entire buffer, even if the file is opened using O_NONBLOCK . As written in the open (2) manual:

Please note that this flag is not valid for regular files and block devices.

In other words, the code above is essentially reduced to this:

 if (read(fd, buf, buf_len) > 0) { return buf_len; }

If the handler needs to read from disk, then it blocks the event loop until reading is completed, and subsequent event handlers wait.

This is normal for most tasks, since disk reading is usually fast enough and much more predictable compared to waiting for a packet from the network. Especially now, when everyone has SSD, and all our caches are on SSD. In modern SSDs, there is a very small delay, usually in tens of microseconds. In addition, you can run Nginx with multiple workflows, so that a slow event handler does not block requests in other processes. Most of the time, you can rely on Nginx to process requests quickly and efficiently.

SSD performance: not always as promised

As you might have guessed, these bright assumptions are not always true. If each reading always takes 50 μs, then reading 0.19 MB in blocks of 4 KB each (and we read in even larger blocks) takes only 2 ms. But tests have shown that the time to the first byte is sometimes much worse, especially in the 99th and 999th percentile. In other words, the slowest reading out of every 100 (or 1000) readings often takes much longer.

Solid state drives are very fast, but known for their complexity. They have computers inside that queue and reorder I / O, and also perform various background tasks, such as garbage collection and defragmentation. From time to time requests slow down noticeably. My colleague Ivan Bobrov launched several I / O benchmarks and registered read delays of up to 1 second. Moreover, some of our SSDs have more such bursts of performance than others. In the future, we are going to take this figure into account when purchasing an SSD, but now we need to develop a solution for existing equipment.

Uniform load sharing with `SO_REUSEPORT`

It’s hard to avoid one slow response per 1000 requests, but what we really don’t want is blocking the remaining 1000 requests for a full second. Conceptually, Nginx is able to process many requests in parallel, but it runs only 1 event handler at a time. Therefore, I added a special metric:

 gettimeofday(&start, NULL); num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1); // events is list of active events // handle event[0]: incoming request GET http://example.com/ gettimeofday(&event_start_handle, NULL); // handle event[1]: send out response to GET http://cloudflare.com/ timersub(&event_start_handle, &start, &event_loop_blocked);

The 99th percentile (p99) event_loop_blocked exceeded 50% of our TTFB. In other words, half the time when servicing a request is the result of blocking the event loop by other requests. event_loop_blocked measures only half of the blocking (because deferred epoll_wait() calls epoll_wait() not measured), so the actual ratio of the blocked time is much higher.

Each of our machines starts Nginx with 15 workflows, that is, one slow I / O will block no more than 6% of requests. But the events are unevenly distributed: the main worker receives 11% of requests.

SO_REUSEPORT can solve the problem of uneven distribution. Marek Maykovsky previously wrote about the lack of such an approach in the context of other Nginx instances, but here it can be largely ignored: upstream connections in the cache are durable, so you can ignore the slight increase in delay when opening a connection. This one configuration change with SO_REUSEPORT activation improved the peak p99 by 33%.

Moving read () to thread pool: not silver bullet

The solution to the problem is to make read () not blockable. In fact, this function is implemented in the usual Nginx ! When using the following configuration, read () and write () are executed in the thread pool and do not block the event loop:

 aio threads; aio_write on;

But we tested this configuration and instead of improving the response time by 33 times, we noticed only a small change in p99, the difference is within the margin of error. The result we were very discouraged, so we have postponed this option for a while.

There are several reasons why we have not had significant improvements, like the developers of Nginx. They used in the test 200 simultaneous connections to request files of 4 MB on the HDD. On hard drives, there is much more I / O latency, so optimization has a greater effect.

In addition, we are mainly concerned with the performance of p99 (and p999). Optimizing average latency does not necessarily solve the problem of peak emissions.

Finally, in our environment, typical file sizes are much smaller. 90% of our cache hits are less than 60 KB. The smaller the files, the smaller the cases of blocking (usually we read the entire file in two readings).

Let's look at disk I / O when getting into cache:

 //     https://example.com    0xCAFEBEEF fd = open("/cache/prefix/dir/EF/BE/CAFEBEEF", O_RDONLY); //    32    //    ,  "aio threads"  read(fd, buf, 32*1024);

Not always read 32 KB. If the headers are small, then you only need to read 4 KB (we do not use I / O directly, so the kernel rounds up to 4 KB). open() seems harmless, but actually takes resources. At a minimum, the kernel must check if the file exists and if the calling process has permission to open it. He needs to find an inode for /cache/prefix/dir/EF/BE/CAFEBEEF , and for this you have to look for CAFEBEEF in /cache/prefix/dir/EF/BE/ . In short, in the worst case, the kernel performs the following search:

 /cache /cache/prefix /cache/prefix/dir /cache/prefix/dir/EF /cache/prefix/dir/EF/BE /cache/prefix/dir/EF/BE/CAFEBEEF

These are 6 separate readings that open() produces, compared to 1 reading read() ! Fortunately, in most cases, the search falls into the dentry cache and does not reach the SSD. But it is clear that processing read() in the thread pool is only half of the picture.

Final chord: non-blocking open () in thread pools

Therefore, we made a change to Nginx so that open() is mostly executed inside the thread pool and does not block the event loop. And here is the result of non-blocking open () and read () at the same time:

On June 26, we rolled changes to the 5 busiest data centers, and the next day - to all the other 146 data centers around the world. The total peak p99 TTFB decreased 6 times. In fact, if we sum up all the time from processing 8 million requests per second, then we save the Internet 54 years of waiting each day.

Our cycle of events has not completely got rid of locks. In particular, the lock still occurs when the file is first cached (both open(O_CREAT) and rename() ) or when the revalidation is updated. But such cases are rare compared to cache accesses. In the future, we will consider the possibility of moving these elements out of the event loop to further improve the delay rate p99.

Conclusion

Nginx is a powerful platform, but scaling extremely high I / O loads on Linux can be a daunting task. Standard Nginx offloads reading in separate streams, but on our scale it is often necessary to go a step further.

Source: https://habr.com/ru/post/419023/

All Articles