Note: Once our colleagues in the workshop, specialists from another service to speed up and protect sites, were faced with the fact that some very slow file downloads by users were suddenly cut off. Below we provide a translation of their story about the problem with our comments.Problem: some users could not download a binary file of several megabytes. The connection for some reason was broken, although the file was in the process of downloading. Soon we were convinced that somewhere in our system there was a bug. It was possible to reproduce the problem simply with a single curl command, but fixing it required an incredible amount of time and effort.
Problem downloads
Two things surprised us in this problem: first, only users of mobile devices were subject to the problem, and second, the file that caused the problems was still quite large - about 30 megabytes.
')
After fruitful work with tcpdump, one of our engineers was able to reproduce the problem. It turned out that it is enough to put a large file for download on the test domain, and use the
--limit rate
option in the
curl
command:
$ curl -v http://example.com/large.bin --limit-rate 10k> / dev / null
* Closing connection # 0
curl: (56) Recv failure: Connection reset by peer
Picking at tcpdump showed that there was always an RST packet that flew in from our server exactly 60 seconds after the connection was established:
$ tcpdump -tttttni eth0 port 80
00:00:00 IP 192.168.1.10.50112> 1.2.3.4.80: Flags [S], seq 3193165162, win 43690, options [mss 65495, sackOK, TS val 143660119 ecr 0, nop, wscale 7], length 0
...
00:01:00 IP 1.2.3.4.80> 192.168.1.10.50112: Flags [R.], seq 1579198, ack 88, win 342, options [nop, nop, TS val 143675137 ecr 143675135], length 0
Our server was definitely doing something wrong. The RST packet leaving our server is bad. The client "behaves well", sends ACK-packets, consumes data at the speed with which it can, and we suddenly chop off the connection.
Not our problem?
To isolate the problem, we launched the basic NGINX server with standard settings, and the problem was easily reproducible locally:
$ curl --limit-rate 10k localhost: 8080 / large.bin> / dev / null
* Closing connection # 0
curl: (56) Recv failure: Connection reset by peer
This showed that the problem is not specific to our installation; it was a broader problem with NGINX.
After further study, we found out that we use the
reset_timedout_connection
setting. This causes NGINX to terminate connections. When NGINX wants to close a connection on a timeout, it sets
SO_LINGER
without a timeout on the socket, followed by
close()
.
This starts the RST packet instead of the normal termination of the TCP connection. Here is the NGINX
strace
log:
04:20:22 setsockopt (5, SOL_SOCKET, SO_LINGER, {onoff = 1, linger = 0}, 8) = 0
04:20:22 close (5) = 0
We could just disable
reset_timedout_connection
, but that would not solve the problem. The question was: why does NGINX close this connection at all?
Next, we noticed the
send_timeout
parameter. Its default value is 60 seconds, exactly as we observed in our case.
http {
send_timeout 60s;
...
The
send_timeout
parameter
send_timeout
used in NGINX to ensure that all connections will be terminated sooner or later. This parameter controls the time allowed between consecutive
send
/
sendfile
calls on each connection. In simple terms, it is wrong that one connection uses a server resource for too long. If the download lasts too long, or stops altogether, it is normal if the http server disconnects.
Also non-NGINX problem
In the hands of C
strace
, we looked at what NGINX does:
04:54:05 accept4 (4, ...) = 5
04:54:05 sendfile (5, 9, [0], 51773484) = 5325752
04:55:05 close (5) = 0
In the configuration, we told NGINX to use
sendfile
to transfer data. The
sendfile
call succeeds and sends 5 megabytes of data to the send buffer. What is interesting: this is almost the same size that we have set by default in the write buffer:
$ sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 5242880 33554432
One minute after the first call to
sendfile
socket is closed. What will happen if we increase the value of
send_timeout
to some larger value, for example, 600 seconds:
08:21:37 accept4 (4, ...) = 5
08:21:37 sendfile (5, 9, [0], 51773484) = 6024754
08:24:21 sendfile (5, 9, [6024754], 45748730) = 1768041
08:27:09 sendfile (5, 9, [7792795], 43980689) = 1768041
08:30:07 sendfile (5, 9, [9560836], 42212648) = 1768041
...
After the first big “pulling out” of the data,
sendfile is called several more times. Between each successive call, it transfers approximately 1.7 megabytes. Between these calls, approximately every 180 seconds, the socket was constantly empty because of the slow curl, so why didn't NGINX replenish it constantly?
Asymmetry
The motto of Unix is ​​“everything is a file.” Otherwise, you can say "everything can be read or written using poll". Let's look at the behavior of network sockets in Linux.
The semantics of reading from a socket is simple:
- The
read()
call will return the data available in the socket until it is empty. poll
responds that the socket is readable when there is some data in it.
You might think that writing to a socket in a similar way looks like:
- A
write()
call will copy the data to the buffer until the send buffer is full. poll
responds that the socket is writable if it has at least some free space.
Surprisingly, this is NOT true.
Different "ways" of code
It is important to understand that the Linux kernel has two different mechanisms for running the executable code: the actual data is sent (sent) one at a time, and the second checks whether the socket is writable.
For the
send()
command to succeed, two conditions must be met:
- There should be free space in the send buffer.
- The amount of unsent data queued must be less than the
LOWAT
parameter. In this case, everything was fine, so we simply omit this condition .
On the other hand, the conditions under which poll considers a socket available for writing are somewhat stricter:
- There should be free space in the send buffer.
- The amount of unsent data queued must be less than the
LOWAT
parameter. - The free space in the send buffer must be more than half the occupied space in the buffer.
The last condition is critical. After the send buffer is 100% full, it will be available for writing again not earlier than its fill level drops to at least 66%.
If we go back to tracking the behavior of NGINX, then in the second case, with c sendfile, we saw this:
08:24:21 sendfile (5, 9, [6024754], 45748730) = 1768041
We successfully sent 1.7 megabytes of data, which is close to 33% of 5 megabytes, our default value for the send buffer size wmem.
This threshold was probably set in Linux to avoid buffer replenishment too often. There is no need to "kick" the sending program after each sent byte.
Decision
Now we can say for sure when the problem happens:
- The send dispatch buffer is at least 66% full.
- The user download speed is low, and the buffer does not empty to 66% in 60 seconds.
- When this happens, the send buffer is not replenished, it is not considered writable, and the connection is timed out.
There are several ways to solve a problem.
One is to increase the
send_timeout
to, say, 280 seconds. Then, given the size of the send buffer, users whose speed is greater than 50Kb / s will not be disconnected due to a timeout.
Another option is to reduce the size of the send buffer
tcp_wmem
.
Well, the last option is to patch NGINX so that it reacts differently to the timeout. Instead of closing the connection immediately, you can look at the amount of data in the send buffer. This can be done using
ioctl(TIOCOUTQ)
. Then we can understand how quickly the buffer is emptied, and maybe give the connection a little more time.
Chris Branch prepared a patch for NGINX , which implements the option with the
send_minimum_rate
option, which allows you to determine how slow the download is allowed to the client.
findings
The Linux network stack is very complex. Although it usually works well, sometimes you can find surprises in it.
Not all even experienced programmers know all its insides. While we were dealing with this error, we realized that the assignment of timeouts for data recording requires special attention, it is impossible to just take and set timeouts for recording the same as for reading.
And now we know that due to the wmem values, unexpected timeouts can be obtained when sending data.