⬆️ ⬇️

A few more words about Path MTU Discovery Black Hole

A few more words about Path MTU Discovery Black Hole





Instead of intro




Once for each real system administrator (or one acting as such) a moment of truth comes. He is destined to configure the router on a computer running the GNU / Linux OS. Those who have already done so know that there is nothing difficult in this and you can fit in a couple of teams. And our admin finds these commands, drives them into the console and proudly goes to the users to say that everything is already working. But it was not there - users say that their favorite sites do not open. After spending some part of my life finding out the details, it turns out that most sites behave like this:

1. When you open the page loads the title and nothing else;

2. In this state, the page hangs indefinitely;

3. The status bar of the browser all the time shows what loads the page;

4. Pings and tracing to this site are normal;

5. Connection via telnet to port 80 is also normal.

Discouraged admin calls the technical support of the provider, but they quickly get rid of it, advising to try to configure the router on Windows OS, and if it doesn’t work there then ... buy a hardware router.

I think this situation is familiar to many. Some got into it themselves, someone came across her acquaintances, and someone met such administrators in forums and other conferences. So: if you have this situation, then - Congratulations! You are facing Path MTU Discovering Black Hole . This article is devoted to why this happens, and how to solve this problem.

')





Terms Needed to Understand Article




MTU (Maximum Transmission Unit) - this term is used to define the maximum packet size (in bytes) that can be transmitted at the data link layer of the OSI network model. For Ethernet, this is 1500 bytes. If a larger packet arrives (for example, according to Token Ring), then the data is reassembled into packets no larger than MTU (that is, no more than 1500 bytes). The operation of reassembling packets under a different MTU is called fragmentation and is costly for the router.

PMTU (Path MTU) —This parameter indicates the smallest MTU among the MTU of data channels between the source and receiver.

PMTU discovery is a PMTU detection technology designed to reduce the load on routers. Described in RFC 1191 in 1988. The essence of the technology lies in the fact that when connecting two hosts, the DF parameter is set (don't fragment, not fragment), which prohibits packet fragmentation. This causes a node whose MTU value is smaller than the packet size, rejects the packet transmission and sends an ICMP message of type Destination is unreachable. Attach the MTU value of the node to the error message. The sending host reduces the packet size and sends it again. This operation occurs until the packet is small enough to reach the destination host without fragmentation.

MSS (Maximum Segment Size) - the maximum segment size, i.e. the largest piece of data that TCP sends to the remote other end of the connection. It is calculated using the following formula:

Interface MTU_IP_IP_Size_ (20 bytes) - Size_TCP_ of the header (20 bytes). Total is usually 1460 bytes. When a connection is established, each side can declare its MSS. The smallest value is selected. More details can be found here .

DF (Don't fragment) flag - A bit in the flags field of an IP packet header that, when set to one, indicates that it is forbidden to fragment this packet. If a packet with this flag is larger than the MTU of the next transfer, then this packet will be discarded, and the ICMP error “fragmentation is necessary, but the bit is not fragmented” is sent to the sender.



Test ground




This problem is best seen in practice (but not in time trouble, when the authorities scream over their ears). For this, I created a test network, shown in Figure 1





Fig. 1. Test network.



This is a simplified version of the global network. Roles:

1. The computer named deb-serv-03 is our Linux router. Attention - on its eth2 interface, the MTU size is reduced to 1,400 bytes;

2. deb-serv-05 - client on the local network;

3. deb-home - a router located at the provider;

4. deb-serv - A web server on the Internet with which we want to exchange data. We get from www.site.local , located on it a page of 5.9Kb in size.

Of course, in reality, the chain is much larger, but for an illustrative example of this is enough. All computers on this network are running Debian GNU / Linux 5.0 Lenny. At various points in the network, I control the situation with the tcpdump program.



PMTU Normal Definition




To begin, see what happens on the network when you open the page. We study how packages from the web server will go. We look at the output of TCPDUMP # 1 (on eth0 deb-serv):



1 IP 172.16.5.3.48547 > 192.168.0.1.80: Flags [S], seq 2947128725, win 5840, options [mss 1460...], length 0

2 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [S.], seq 757312786, ack 2947128726, win 5792, options [mss 1460...], length 0

3 IP 172.16.5.3.48547 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0

4 IP 172.16.5.3.48547 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117

5 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], ack 118, win 181, options [...], length 0

6 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1:2897, ack 118, win 181, options [...], length 2896

7 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

8 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1:1349, ack 118, win 181, options [...], length 1348

9 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1349:2697, ack 118, win 181, options [...], length 1348

10 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556




I quote only the first 10 packets and cut off all unnecessary in the standard tcpdump output. Parse:

1. In lines 1 through 3 we see the installation of a tcp connection. The parties exchange packets SYN, SYN-ACK, ACK. Here you should pay attention to the options field, namely the MSS parameter, which the parties exchange. On both sides it is 1460 bytes. This means that the maximum packet size that the parties will send to each other is 1460 (MSS) +20 (TCP Header) +20 (IP Header) = 1500 bytes.

2. In line 4, sending a request for receiving a web page from deb-serv-05. On line 5 acknowledgment of receipt of this package.

3. In line 6 we see the sending of the response to the request (ie, sending a piece of a web page). Probably because of the peculiarities of pcap on this interface, tcpdump sees one packet of 2948 bytes in size, while 2 packets of 1500 and 1452 bytes will go to the network, respectively. If you look at the more detailed output of tcpdump, you will see that the DF flag is on this package (more precisely, the packages):

IP (tos 0x0, ttl 64, id 5177, offset 0, flags [DF], proto TCP (6), length 2948)

192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1:2897, ack 118, win 181, options [nop,nop,TS val 86620459 ecr 4922429], length 2896


4. When these data packets reach deb-serv-03, they are discarded because they cannot pass through the connection with the MTU 1400 and cannot be fragmented (the DF flag), and the ICMP type 3 code 4 message is generated in response: ICMP 172.16 .5.3 unreachable - need to frag (mtu 1400) , which we see in line 7 (in line 10 there is a message for the 2nd package). This message is passed the desired MTU.

5. In lines 8 and 9 we observe as deb-serv, having received MTU = 1400, sends the same piece of web page in packets of 1400 bytes in size. These packets go to deb-serv-05, where a confirmation is generated, and this is repeated until the entire page is transferred. The size of all subsequent packets will be no more than 1400 bytes.

This example demonstrates the Transport MTU Definition (PMTU) procedure described in RCF1911. I presented it in a simplified form in Figure 2.





Figure 2. The procedure for determining the PMTU.



Meeting with Path MTU Discovery Black Hole




And now let's imagine that a new specialist came to the provider and decided (for example, to protect against icmp flood) to prohibit sending icmp packets via deb-home, which is now in his custody. See what happens:

TCPDUMP output # 1 (on eth0 deb-serv):



1 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [S], seq 1723325723, win 5840, options [mss 1460...], length 0

2 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [S.], seq 2482933888, ack 1723325724, win 5792, options [mss 1460...], length 0

3 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0

4 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117

5 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], ack 118, win 181, options [...], length 0

6 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:2897, ack 118, win 181, options [...], length 2896

7 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

8 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

9 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

10 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448




TCPDUMP # 2 output (on eth0 deb-serv-03):



1 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [S], seq 1723325723, win 5840, options [mss 1460...], length 0

2 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [S.], seq 2482933888, ack 1723325724, win 5792, options [mss 1460...], length 0

3 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0

4 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117

5 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], ack 118, win 181, options [...], length 0

6 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

7 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

8 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1449:2897, ack 118, win 181, options [...], length 1448

9 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

10 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

11 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

12 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

13 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

14 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

15 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

16 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

17 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

18 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448

19 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556

20 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448




As you can see, the situation is quite expected. The first 6 lines in each output are exactly the same as in the normal transmission (see the description in the previous example). But further discrepancies begin. ICMP 3: 4 is also generated on deb-serv-03 (lines 7, 9 11.13, 15, 17, 19 in TCPDUMP # 2), but deb-serv does not receive it and continues to send 1500 byte packets (lines with 6 to 12 in TCPDUMP # 1 and 6, 8, 10, 12, 14, 16, 18 and 20 in TCPDUMP # 2). Each time, the time between the re-sending all increases (in these examples, I dropped the timestamps, but in fact this is how TCP retransmit mechanism works). There is no data larger than PMTU, in which case it cannot be transmitted. But alas, TCP does not know this and continues to send packets with the MSS selected at the time of the connection. This situation is called Path MTU Discovery Black Hole (Black Hole in the definition of the transport MTU). I tried to present it in a simplified form in fig. 3





Fig. 3. Black hole in the definition of PMTU.



This problem is not new at all. It is described in RFC 2923 in 2000. Nevertheless, it continues to meet with the enviable persistence of many providers. But the provider is to blame for this situation: you don’t need to block ICMP type 3 code 4. Moreover, they usually don’t want to obey the “voice of reason” (i.e. clients who understand what the problem is).



Solving the problem with PMTU




We will not call technical support, but try to solve the problem based on our own funds.

Linux developers, also aware of it, have provided a special option in iptables. Quote from man iptables:



TCPMSS

This target allows to alter the MSS value of TCP SYN packets, to control the maximum size for that connection (usually limiting it to your outgoing interface's MTU minus 40 for IPv4 or 60 for IPv6, respectively). Of course, it can only be used in conjunction with -p tcp. It is only valid in the mangle table. This target is used to overcome criminally braindead ISPs or servers which block "ICMP Fragmentation Needed" or "ICMPv6 Packet Too Big" packets. The symptoms of this problem are that everything works fine from your Linux firewall/router, but machines behind it can never exchange large packets:

1) Web browsers connect, then hang with no data received.

2) Small mail works fine, but large emails hang.

3) ssh works fine, but scp hangs after initial handshaking.

Workaround: activate this option and add a rule to your firewall configuration like:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \

-j TCPMSS --clamp-mss-to-pmtu



--set-mss value

Explicitly set MSS option to specified value.



--clamp-mss-to-pmtu

Automatically clamp MSS value to (path_MTU - 40 for IPv4; -60 for IPv6).



These options are mutually exclusive.




My free translation for those who have difficulty with English:



TCPMSS

MSS TCP SYN , ( MTU 40 IPv4 60 IPv6). , -p tcp. mangle. , "ICMP Fragmentation Needed" "ICMPv6 Packet Too Big" . – , :

1) , .

2) , .

3) ssh , scp ( : TCP " ").

: , , :

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \

-j TCPMSS --clamp-mss-to-pmtu



--set-mss

MSS .



--clamp-mss-to-pmtu

MSS (path_MTU - 40 IPv4; -60 IPv6).

.


As you can see, they wrote a lot of things, even described approximate simpotoms of the problem. And such behavior of providers was called “criminal incompetence (criminally braindead)”, in which I fully agree with them. Let's explore how this option will work in our example. Add the recommended rule to deb-serv-03:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN, RST SYN -j TCPMSS –set-mss 1360

And look what happened:

TCPDUMP output # 1 (on eth0 deb-serv):



1 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [s], seq 1484543117, win 5840, options [mss 1360...], length 0

2 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [s.], seq 2230206317, ack 1484543118, win 5792, options [mss 1460...], length 0

3 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [.], ack 1, win 1460, options [...], length 0

4 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [p.], seq 1:118, ack 1, win 1460, options [...], length 117

5 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [.], ack 118, win 181, options [...], length 0

6 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [.], seq 1:2697, ack 118, win 181, options [...], length 2696

7 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [.], ack 1349, win 2184, options [...], length 0

8 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [.], seq 2697:5393, ack 118, win 181, options [...], length 2696

9 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [fp.], seq 5393:6380, ack 118, win 181, options [...], length 987

10 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [.], ack 2697, win 2908, options [...], length 0




TCPDUMP # 3 output (on eth0 deb-serv-05):



1 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [S], seq 1484543117, win 5840, options [mss 1460...], length 0

2 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [S.], seq 2230206317, ack 1484543118, win 5792, options [mss 1360...], length 0

3 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0

4 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117

5 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], ack 118, win 181, options [...], length 0

6 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], seq 1:1349, ack 118, win 181, options [...], length 1348

7 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], seq 1349:2697, ack 118, win 181, options [...], length 1348

8 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [.], ack 1349, win 2184, options [...], length 0

9 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [.], ack 2697, win 2908, options [...], length 0

10 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], seq 2697:4045, ack 118, win 181, options [...], length 1348




Parse:

1. In lines 1-3, we are already familiar with the installation of a TCP connection. But pay attention to the MSS values. In TCPDUMP # 1 from deb-serv-05, the value 1360 comes in, while in TCDUMP # 3 you can see that the packet leaves with MSS = 1460. This is exactly how the rule with –set-mss 1360 works. It edits the MSS value of the passing packets. For the SYN packet that came back, this value is also edited.

2. In lines 4 and 5 of both conclusions, we again observe sending the GET request and acknowledgment of receipt.

3. In line 6 for TCPDUMP # 1 and lines 6 and 7 for TCPDUMP # 3 we see the sending of data packets, but now the size of each packet does not exceed 1400 bytes. Again there is a strange glitch with TCPDUMP # 1, where one big packet is visible, while in TCPDUMP # 3 we observe the arrival of 2 packets.

4. Further packet exchange is in accordance with the rules of the TCP protocol. But never once did the packet size exceed 1400 bytes.



In simplified form, the behavior of the MSS is presented in Fig. 4. I did not show the exchange of data, as it is similar to the usual behavior.



Fig. 4. Change MSS on the fly.



Although man iptables describes two options, but so far I have applied only one. The required option depends on the specific situation. All situations can be divided into 2 types:



1. On your router, the sites open normally, clients on the local network have problems.

In this case, the smallest MTU all the way is on your server. Usually these are some encapsulation protocols, such as PPPoE, PPtP, and so on. For this situation, the –clamp-mss-to-pmtu option is best, which automatically sets the minimum MSS on all transit packets.



2. The websites do not open on your router and on clients in the local network.

In this case, the smallest MTU is somewhere at the provider and it is difficult to calculate it by standard means. Especially for this, I wrote a small python script (not caring about PEP8 and not being able to shoot in the leg), which will help determine the required MSS size for this situation:



 #!/usr/bin/env python # -*-coding: utf-8 -*- import socket import os import time import sys #        .    # ,    . HOST = 'www.site.local' #  ,        . #       ,  #  -    . TIMEOUT = 25.0 #  ,      ,     #  .      MTU BUF = 3000 #  MTU    . MTU = 1500 #  MSS      MTU-LIM-40  MTU-40.  #    MTU        # 100-200 -        . LIM = 100 #     .     #    . TRY_TIME = 0 def set_mss(mss, action='A'): return os.system("iptables -t mangle -%s OUTPUT -p tcp --tcp-flags \ SYN,RST SYN -j TCPMSS --set-mss %d" % (action, mss) ) def check_connection(host): sock = socket.socket() sock.connect( (host, 80) ) sock.send('GET / HTTP/1.1\r\nHost: %s\r\n\r\n' % host) sock.settimeout(TIMEOUT) try: answer_size = len( sock.recv(BUF) ) except: answer_size = 0 sock.close() return answer_size def main(): mss = MTU - 40 if not check_connection(HOST): mss = MTU - 40 - LIM set_mss(mss) if not check_connection(HOST): set_mss(mss,'D') print "Error: Too small LIM" sys.exit(1) else: while check_connection(HOST): time.sleep(TRY_TIME) set_mss(mss,'D') if mss >= MTU-40: print "Error in determining MSS" sys.exit(1) mss += 1 set_mss(mss) set_mss(mss,'D') mss -= 1 print 'MSS = %d' % (mss) if __name__ == '__main__': main() sys.exit(0) 




You need to run the script with superuser rights. The algorithm of his work is as follows:

1. We are trying to get some amount of data from a site with a normal MSS value.

2. If this fails, then lower the MSS on the iptables OUTPUT chain to MTU - 40 - LIM.

3. If after this we cannot receive data, then we give an error message that LIM is too small.

4. Consistently increasing the MSS, we are looking for the moment when the data will cease to arrive. After that, we display the last working value of MSS.

5. If we have reached MSS = MTU-40, then we deduce an error that we cannot determine MSS. This situation is erroneous, since in paragraph 1 we conduct a similar test, and if the results do not coincide, this is a reason to think.



After obtaining the required MSS, it is necessary to enter it into the appropriate rule. You can do without a script, by lowering the value of MSS, but it is better to find out exactly - less overhead for sending packets.



Often in the forums you can find tips to lower the MTU on a particular interface. You need to understand that this is not a panacea, and the result depends on which interface to downgrade. If we lower one of the interfaces of the participants in a TCP connection, then this will have an effect, since the declared MSS will correspond to the minimum packet size. But if these are not endpoints, but one of the transit routers, then without the inclusion of the - clamp-mss-to-pmtu option, there will be no effect.



I hope this article will help you solve a similar problem both at home and at your friends and acquaintances. Once again I appeal to the specialists of the providers - WITHOUT EXTREME NECESSITY, DO NOT BLOCK ICMP TYPE 3 CODE 4 - by doing this you create problems for your colleagues.

Source: https://habr.com/ru/post/136871/



All Articles