Ethernet MTU and TCP MSS: Why connections stall

MTU and MSS are two terms that are easily mistaken and their misconfiguration is often the cause of networking problems. Spending enough time working on production systems that interface with large networks of computers or the Internet almost guarantees coming across situations where an interface was configured with the wrong MTU, or a firewall was filtering ICMP. This results in a client being unable to transfer large amounts of data when smaller transfers work fine. This post will walk through MTU, MSS and packet size negotiation for TCP connections, and the common situations where it breaks down. This post was inspired by multiple discussions during the course of investigating errors on production systems as part of my role at GitHub.

If you want to take away a simple snippet from this post, the summary is:

The MTU of the interfaces on either side of a physical
or logical link must be equal. Don't block ICMP.

The examples mentioned in this blog post will be reproducible in the lab from theojulienne/blog-lab-mtu - clone this repository and bring up the lab, then poke around at these examples in a real system:

$ git clone https://github.com/theojulienne/blog-lab-mtu.git
$ cd blog-lab-mtu
$ vagrant up
$ vagrant ssh

Ethernet MTU: Maximum Transmission Unit

The MTU (Maximum Transmission Unit) on an Ethernet network specifies the maximum payload size of the data to be transmitted along with an Ethernet header on a network. Typically this payload will be an IP packet, in which case the MTU specifies the maximum combined size of the IP header and IP data.

Ethernet / IP / TCP headers with MTU indicated

The MTU is specified at the interface level as it is a link-level setting, and is typically propagated down to the underlying network card driver. The expectation is that packets that are larger than this configured size that appear to be transmitted over the wire are invalid or corrupt and should be dropped. In a valid configuration, hosts connected together via a link will have the same MTU specified:

Client and server connected with matching MTU

If interfaces on either side of a link have mismatching MTU configurations, then the smaller side will treat packets larger than the local MTU as invalid and drop the packets before any software has the chance to see them.

Client and server connected with mismatching MTU

Streams of data that are larger than the MTU will be broken up into packets that completely fill an Ethernet frame, up to the MTU in each. If the remote end has a smaller MTU configured for the same link, those larger packets will be dropped. MTU should be configured the same on both interfaces on either side of a link, and so the MTU should be considered a bidirectional maximum.

MTU in the lab

The lab in this blog post can be used to observe this in an example system. In one terminal, bring up the lab hosts inside the Vagrant machine:

$ vagrant ssh -- /vagrant/bin/run-lab

In another terminal, vagrant ssh then enable the first scenario from above with matching MTU of 1500 on client and server:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario direct_1500

Log in to the client and server hosts and observe that we can send a packet with 1400 bytes of payload as expected, since both hosts have an MTU of 1500. The -s 1400 argument to ping sets the payload size, and the -M do argument instructs ping to set the DF (Don’t Fragment) bit, ensuring that the whole IP packet must arrive in one piece or not at all.

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client
root@client:/# ping -c 2 -M do -s 1400 server-direct
PING server-direct (172.28.0.40) 1400(1428) bytes of data.
1408 bytes from server-direct (172.28.0.40): icmp_seq=1 ttl=64 time=0.096 ms
1408 bytes from server-direct (172.28.0.40): icmp_seq=2 ttl=64 time=0.080 ms

--- server-direct ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 2ms
rtt min/avg/max/mdev = 0.080/0.088/0.096/0.008 ms
root@client:/# exit
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 1400 client-direct
PING client-direct (172.28.0.10) 1400(1428) bytes of data.
1408 bytes from client-direct (172.28.0.10): icmp_seq=1 ttl=64 time=0.079 ms
1408 bytes from client-direct (172.28.0.10): icmp_seq=2 ttl=64 time=0.071 ms

--- client-direct ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 32ms
rtt min/avg/max/mdev = 0.071/0.075/0.079/0.004 ms
root@server:/# 

Now switch to the second scenario with mismatching MTU, and observe that 1400 byte payloads no longer succeed:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario direct_mismatch

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client
root@client:/# ping -c 2 -M do -s 1400 server-direct
PING server-direct (172.28.0.40) 1400(1428) bytes of data.
ping: local error: Message too long, mtu=1200
ping: local error: Message too long, mtu=1200

--- server-direct ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 10ms

root@client:/# exit
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 1400 client-direct
PING client-direct (172.28.0.10) 1400(1428) bytes of data.

--- client-direct ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 19ms

root@server:/# 

Notice that the client host is immediately able to observe that it cannot send a packet this large since the MTU on the interface is 1200. However, the server host believes the MTU of the link is 1500, so sends the packet, however the client is unable to receive it. This occurs at such a low level that neither host is aware of the failure - the packet just disappears.

TCP MSS: Maximum Segment Size

The TCP MSS (Maximum Segment Size) sounds very similar to MTU, and since it relates to the maximum size of network packets, they are easy to conflate even though they are quite different. A TCP segment is the TCP header and TCP data that forms part of a single packet. The MSS specifies the expected maximum size of the data component of this segment that a host expects it would be able to receive without the IP packet being fragmented. IP fragmentation is typically disabled for TCP packets on modern networking stacks due to the added complexity and overhead, so the MSS represents the maximum size that the host expects to be able to receive in any given packet.

Ethernet / IP / TCP headers with MTU and MSS indicated

Rather than being an interface-level configuration like MTU, the MSS advertisement forms a part of the typical TCP handshake and is calculated based on the underlying MTU of the interface that a local host will use to communicate with a remote host. MSS can be thought of as a TCP hint around how much data can be included in a single TCP packet, given the current MTU. Each host calculates the MSS it will advertise by taking the local MTU and subtracting the size of the IP and TCP headers, then includes that MSS in the TCP options of the SYN or SYN-ACK packet as part of the TCP three-way handshake.

This is not a negotiation of a single MSS, but rather each host is giving the remote host an indication of the maximum size of a single packet it expects will be possible to send back. This number must be less than the MTU minus IP/TCP headers, since there’s no way any larger packet could arrive given the local MTU. Each host will use the remote host’s advertised MSS as a hint for what size individual outgoing packets should be. Since this is a configurable hint, it is also only unidirectional, and although a host may advertise a lower MSS than it can otherwise handle, that doesn’t in any way restrict it from sending packets larger than the MSS it advertised (providing the remote host allowed for it).

The simple MSS exchange happens to work around small misconfigurations of MTU, such as the trivial example described above:

In this case, the client would advertise an MSS of 1200 (MTU) - 20 (IP hdr) - 20 (TCP hdr) = 1160, which would cause the server to refrain from sending packets that contained more than 1160 bytes in the TCP payload, which also ensures it would be able to arrive within the bounds of the MTU of 1200 once those headers are added on.

However, the above network is still misconfigured, since even though TCP happens to work around it, other protocols will fail since they don’t exchange MSS values. MSS is actually intended to allow hosts to work around valid configurations where their own local networks have different MTU, such as the following:

Client and server connected with different MTUs, but valid configuration

In this example, if the server with a valid MTU of 9000 attempted to send an Ethernet frame containing more than 1500 bytes without fragmentation being allowed, that packet would not be able to make it to the client. The intermediary router, being the first host that is aware of this problem as it is aware of the MTU of both links, would send an ICMPv4 “Fragmentation required, but DF set” message or an ICMPv6 “Packet Too Big” message back to the sender to inform it that forwarding the packet without breaking it up is not possible (and that the IP header had the DF, or Don’t Fragment, bit set).

However, TCP will succeed in unrestricted communications between these hosts due to the MSS advertisements. The server in this configuration will receive an MSS from the client that will ensure no Ethernet frames with a payload larger than 1500 bytes are generated, so they will be received successfully.

MSS in the lab

Select the scenario from above with the client on a network with 1500 MTU and the server on a network with 9000 MTU:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario client_net_smaller

Running a ping from the side with the larger MTU, we can observe that packets larger than the client’s MTU cause the intermediary router to return an ICMP message since it is unable to forward the packet:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 3000 client
PING client (172.29.0.10) 3000(3028) bytes of data.
From 172.30.0.20 icmp_seq=1 Frag needed and DF set (mtu = 1500)
ping: local error: Message too long, mtu=1500

--- client ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 18ms
pipe 2
root@server:/# 

In a slightly more complex example, open up a few terminals and spin up a simple HTTP server that sends a large payload and observe in a tcpdump that the MSS advertisements allow the connection to succeed despite the differing MTU:

# reset everything so Linux doesn't remember that ICMP frag message from above
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# sample-http-server 

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# tcpdump -i any icmp or port 80

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client
root@client:/# curl http://server/

The tcpdump should return something like the following - note the MSS advertised by each side in the first 2 SYN packets is the MTU minus the IP and TCP header size of 40 bytes - mss 1460 and mss 8960. The packets with an HTTP payload are broken into smaller packets with a TCP segment of just 1448 bytes - small enough to fit inside an MTU of 1500 with an IP and TCP header with 12 additional bytes for TCP options (you can observe those options where it says [nop,nop,TS val 3808245569 ecr 3455879517]).

IP client.51424 > server.80: Flags [S], seq 4195639166, win 64240, options [mss 1460,sackOK,TS val 3456553938 ecr 0,nop,wscale 6], length 0
IP server.80 > client.51424: Flags [S.], seq 3403777541, ack 4195639167, win 62636, options [mss 8960,sackOK,TS val 3808919991 ecr 3456553938,nop,wscale 6], length 0
IP client.51424 > server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3456553939 ecr 3808919991], length 0
IP client.51424 > server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val 3456553939 ecr 3808919991], length 70: HTTP: GET / HTTP/1.1
IP server.80 > client.51424: Flags [.], ack 71, win 978, options [nop,nop,TS val 3808919991 ecr 3456553939], length 0
IP server.80 > client.51424: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS val 3808920010 ecr 3456553939], length 113: HTTP: HTTP/1.1 200 OK
IP client.51424 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 3456553958 ecr 3808920010], length 0
IP server.80 > client.51424: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 1448: HTTP
IP server.80 > client.51424: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 1448: HTTP
IP server.80 > client.51424: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 1448: HTTP
IP server.80 > client.51424: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 457: HTTP
IP client.51424 > server.80: Flags [.], ack 1562, win 1002, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 > server.80: Flags [.], ack 3010, win 995, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 > server.80: Flags [.], ack 4458, win 984, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 > server.80: Flags [.], ack 4915, win 980, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 > server.80: Flags [F.], seq 71, ack 4915, win 1002, options [nop,nop,TS val 3456553960 ecr 3808920011], length 0
IP server.80 > client.51424: Flags [F.], seq 4915, ack 72, win 978, options [nop,nop,TS val 3808920013 ecr 3456553960], length 0
IP client.51424 > server.80: Flags [.], ack 4916, win 1002, options [nop,nop,TS val 3456553961 ecr 3808920013], length 0

One interesting note is that on many modern network devices, running a packet capture may result in tcpdump and similar tools observing packets that appear larger than the configured MTU due to Large Receive Offload and Large Send Offload and other technologies which coalesce multiple packets that are part of the same flow into a single pseudo-packet. On receive, the network card will coalesce subsequent packets from a stream together before passing them to the kernel as a single packet for faster processing. On send, the kernel will provide one larger packet that the network card will split appropriately as it sends over the wire based on the configured MSS.

This packet coelescing has been intentionally disabled in the lab to make it simpler to observe when packets are being split up on the (virtual) wire, however if the same example was run on real server the HTTP payload would likely appear to tcpdump as a single larger packet, though it would still be broken up the same way on the wire.

Path MTU: Hidden bottlenecks

Although in the above example, TCP MSS was able to work around a simple configuration where hosts had valid but differing MTUs on their links, this is still not a complete solution as there may be additional intermediary links involved with an MTU that is lower than either the client or server link.

In this example, all Ethernet payloads larger than 1200 bytes from either side will not be able to be forwarded past the first hop (if IP fragmentation is disabled). However, both client and server will advertise an MSS that will allow for Ethernet payloads larger than 1200 bytes to be sent.

With full visibility of the network, using a diagram like we have here, we can see that packets can only make it between client and server if they are no more than 1200 bytes including headers. This is the Path MTU, or the minimum MTU of all links on the path between communicating hosts. In practice, where hosts are communicating arbitrarily over the Internet and where multiple paths could be available between those hosts, we don’t have visibility into the full system and therefore we are unable to put a specific number on the Path MTU up front. Instead, it must be possible for hosts to discover this Path MTU as needed during existing communications, as the need for it arises.

Path MTU Discovery

Path MTU Discovery is the process of hosts working from the local MTU and the remote initial MSS advertisement as hints, and arriving at the actual Path MTU of the (current) full path between those hosts in each direction.

The process starts by assuming that the advertised MSS is correct for the full path, after reducing it if the local link’s MTU minus IP/TCP header size is smaller (since we couldn’t send a larger packet regardless of the MSS). When a packet is sent that is larger than the smallest link along the path, it will at least make it one hop to the first router, since we know the local link MTU is large enough to fit it.

When a router receives a packet on one interface and needs to forward it to another interface that the packet cannot fit on, the router sends back an ICMPv4 “Fragmentation required” message or an ICMPv6 “Packet Too Big” message. The router includes the MTU of the next (smaller) hop in that message, since it knows it. Upon receipt of that message, the originating host is able to reduce the calculated Path MTU for communications with that remote host, and resend the data as multiple smaller packets. From then on, packet size is correctly limited by the size of the MTU of the smallest link in the path observed so far.

A full example is below, though note that in practice there may not be complete symmetry in the path in each direction, multiple hops may progressively have smaller MTU values along the way, and the path may even change throughout the lifetime of a single connection:

Client and server individually working out the effective MSS and Path MTU

This example shows how critical it is for TCP that ICMP messages of this type are forwarded correctly. This exchange is where most problems around MTU occur in production systems, when firewalls along the path block or throttle ICMP traffic in a way that inhibits Path MTU Discovery. Don’t block ICMP, it will break Path MTU Discovery and also TCP connections with large data transfers where the initial MSS advertisement is not enough to limit the Path MTU. At the very least, don’t block ICMPv4 “Fragmentation required” or ICMPv6 “Packet Too Big”, even if you block other ICMP messages.

The common traceroute utility observes hops between hosts using the TTL to observe each hop via TTL Exceeded messages, and this can be extended to show Path MTU (as well as the hops along the way), which is functionality that the tracepath utility provides. tracepath sends large packets, starting at the maximum sendable on the local link, to a remote host and shows any ICMP messages and the adjusted Path MTU along the way as it gradually increases TTL and decreases packet size. tracepath is a good first place to start when diagnosing issues observed between 2 hosts where MTU misconfiguration or ICMP filtering is suspected.

Path MTU Discovery in the lab

Select the scenario from above with the client on a network with 1500 MTU, the server on a network with 9000 MTU, and an additional intermediary network with 1200 MTU that packets must traverse:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario hidden_smaller

Observe that neither side can immediately ascertain the correct Path MTU and must see an ICMP message from the intermediary router before they become aware of the smaller link:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client
root@client:/# ping -c 2 -M do -s 1400 server
PING server (172.31.0.40) 1400(1428) bytes of data.
From vagrant_router-a_1.vagrant_client_router_a (172.29.0.20) icmp_seq=1 Frag needed and DF set (mtu = 1200)
ping: local error: Message too long, mtu=1200

--- server ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2ms

root@client:/# exit
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 1400 client
PING client (172.29.0.10) 1400(1428) bytes of data.
From vagrant_router-b_1.vagrant_router_b_server (172.31.0.30) icmp_seq=1 Frag needed and DF set (mtu = 1200)
ping: local error: Message too long, mtu=1200

--- client ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms

root@server:/# 

Bringing up the example HTTP server from earlier, we can also observe the full process off Path MTU Discovery. In this case, note that we observe the tcpdump from the router-b on its interface towards server since it has a better vantage point for observing retransmits.

# reset everything so Linux doesn't remember that ICMP frag message from above
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# sample-http-server 

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell router-b # better vantage point
root@router-b:/# tcpdump -i eth1 icmp or port 80

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client
root@client:/# curl http://server/

The tcpdump will return something like the following:

IP client.51428 > server.80: Flags [S], seq 644598568, win 64240, options [mss 1460,sackOK,TS val 3457600577 ecr 0,nop,wscale 6], length 0
IP server.80 > client.51428: Flags [S.], seq 1840446146, ack 644598569, win 62636, options [mss 8960,sackOK,TS val 3809966630 ecr 3457600577,nop,wscale 6], length 0
IP client.51428 > server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3457600578 ecr 3809966630], length 0
IP client.51428 > server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val 3457600578 ecr 3809966630], length 70: HTTP: GET / HTTP/1.1
IP server.80 > client.51428: Flags [.], ack 71, win 978, options [nop,nop,TS val 3809966630 ecr 3457600578], length 0
IP server.80 > client.51428: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600578], length 113: HTTP: HTTP/1.1 200 OK
IP client.51428 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 3457600598 ecr 3809966650], length 0
IP server.80 > client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 457: HTTP
IP client.51428 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 3457600598 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP server.80 > client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 > client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 > client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 > client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 900: HTTP
IP client.51428 > server.80: Flags [.], ack 1262, win 993, options [nop,nop,TS val 3457600599 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP client.51428 > server.80: Flags [.], ack 2410, win 976, options [nop,nop,TS val 3457600599 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP client.51428 > server.80: Flags [.], ack 3558, win 967, options [nop,nop,TS val 3457600599 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP client.51428 > server.80: Flags [.], ack 4915, win 970, options [nop,nop,TS val 3457600599 ecr 3809966650], length 0
IP server.80 > client.51428: Flags [F.], seq 4915, ack 71, win 978, options [nop,nop,TS val 3809966651 ecr 3457600599], length 0
IP client.51428 > server.80: Flags [F.], seq 71, ack 4916, win 1002, options [nop,nop,TS val 3457600600 ecr 3809966651], length 0
IP server.80 > client.51428: Flags [.], ack 72, win 978, options [nop,nop,TS val 3809966652 ecr 3457600600], length 0

Note that the advertised MSS values are the same as the earlier example, which does not reflect the Path MTU since it is not yet known. Each of the initial large packets sent from the server to the client cause an ICMP fragmentation message:

IP server.80 > client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

The server then resends the failing packets, this time respecting the newly calculated Path MTU of 1200:

IP server.80 > client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 > client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 > client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 > client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 900: HTTP

We can also use tracepath to perform Path MTU Discovery and observe which routers are responding - the tool starts with the local network’s MTU then discovers the reduced MTU link as it progresses:

# reset everything so Linux doesn't remember that ICMP frag message from above
vagrant@blog-lab-mtu:~$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client
root@client:/# tracepath -n server
 1?: [LOCALHOST]                      pmtu 1500
 1:  172.29.0.20                                           0.088ms 
 1:  172.29.0.20                                           0.032ms 
 2:  172.29.0.20                                           0.030ms pmtu 1200
 2:  172.30.0.30                                           0.046ms 
 3:  172.31.0.40                                           0.058ms reached
     Resume: pmtu 1200 hops 3 back 3 
root@client:/# exit
vagrant@blog-lab-mtu:~$ /vagrant/bin/shell server
root@server:/# tracepath -n client
 1?: [LOCALHOST]                      pmtu 9000
 1:  172.31.0.30                                           0.159ms 
 1:  172.31.0.30                                           0.034ms 
 2:  172.31.0.30                                           0.027ms pmtu 1200
 2:  172.30.0.20                                           0.092ms 
 3:  172.29.0.10                                           0.060ms reached
     Resume: pmtu 1200 hops 3 back 3 
root@server:/# 

The tracepath tool is extremely useful in determining whether a connection stalling failure is indeed a Path MTU blackhole due to a router or firewall blocking ICMP packets.

Path MTU Discovery and Anycast

One final complexity occurs when routers have multiple equal-cost paths (ECMP) to multiple hosts that share the same IP address, a common situation with deployments of Anycast. In this case, routers hash packets across the different available paths and attempt to be consistent so that packets from the same connection arrive on the same remote host (and/or travel via the same path).

However, the input to the hash may not (and typically does not) understand that an ICMP fragmentation or packet too big message is related to the TCP connection that triggered it since the IP source is different to a normal returning packet, and instead is a router along the way, not the expected remote host. This leads to a situation where one host receives the TCP packets for a connection, and another unrelated host receives the ICMP packet relating to that connection, which gets disregarded. This introduces a Path MTU blackhole, as if ICMP were being filtered.

In practice, there are ways to work around this issue. One way is to broadcast those ICMP messages to all hosts. An alternative approach is used by GLB Director which allows the routers to perform the ICMP-unaware ECMP hashing, but then re-hashes it correctly at the first software load balancer layer. GLB inspects inside ICMP messages, since they contain part of the triggering packet, and hashes those packets the same way they would be hashed if they were the original TCP packet that triggered them, ensuring ICMP messages land on the same host as the related TCP connection. In general, it’s important that any system involving hashing or otherwise manipulating TCP packets ensures that ICMP messages relating to the stream are sent to the appropriate host, as they are a crucial part of the way that TCP operates.

Wrapping up

It is often possible to ignore the details of MTU, MSS advertisement and Path MTU Discovery and have things continue to work to a certain extent. However, when these systems fail, connections will stall entirely and in a very blocking way for users. This is often seen only on large transfers, as smaller data transfers don’t trigger the issue, since the packets remain small. It’s also often only intermittent in cases where only one path between hosts has a reduced Path MTU, or just one path has a router blocking ICMP packets.

Thankfully, the rule for keeping networks functioning correctly with regards to MTU can be summarised simply as:

The MTU of the interfaces on either side of a physical
or logical link must be equal. Don't block ICMP.

Asking if this rule holds true both internally and externally in any trouble ticket that has the pattern of “Why is my connection stalling when (action that transfers large data) but not when (action that transfers small data)?” will almost always yield an MTU misconfiguration or ICMP filtering and a root cause.