<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Theo Julienne</title>
    <description>Theo Julienne is a software &amp; infrastructure engineer.
</description>
    <link>http://theojulienne.io/</link>
    <atom:link href="http://theojulienne.io/feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Wed, 15 Apr 2026 03:24:29 +0000</pubDate>
    <lastBuildDate>Wed, 15 Apr 2026 03:24:29 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Ethernet MTU and TCP MSS: Why connections stall</title>
        <description>&lt;p&gt;MTU and MSS are two terms that are easily mistaken and their misconfiguration is often the cause of networking problems. Spending enough time working on production systems that interface with large networks of computers or the Internet almost guarantees coming across situations where an interface was configured with the wrong MTU, or a firewall was filtering ICMP. This results in a client being unable to transfer large amounts of data when smaller transfers work fine. This post will walk through MTU, MSS and packet size negotiation for TCP connections, and the common situations where it breaks down. This post was inspired by multiple discussions during the course of investigating errors on production systems as part of my role at GitHub.&lt;/p&gt;

&lt;p&gt;If you want to take away a simple snippet from this post, the summary is:&lt;/p&gt;

&lt;div class=&quot;important-quote&quot;&gt;
The MTU of the interfaces on either side of a physical&lt;br /&gt;
or logical link must be equal. Don&apos;t block ICMP.
&lt;/div&gt;

&lt;p&gt;The examples mentioned in this blog post will be reproducible in the lab from &lt;i class=&quot;fa fa-github&quot;&gt;&lt;/i&gt; &lt;a href=&quot;https://github.com/theojulienne/blog-lab-mtu&quot;&gt;theojulienne/blog-lab-mtu&lt;/a&gt; - clone this repository and &lt;a href=&quot;https://www.vagrantup.com/&quot;&gt;bring up the lab&lt;/a&gt;, then poke around at these examples in a real system:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ git clone https://github.com/theojulienne/blog-lab-mtu.git
$ cd blog-lab-mtu
$ vagrant up
$ vagrant ssh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;ethernet-mtu-maximum-transmission-unit&quot;&gt;Ethernet MTU: Maximum Transmission Unit&lt;/h2&gt;

&lt;p&gt;The MTU (Maximum Transmission Unit) on an Ethernet network specifies the maximum payload size of the data to be transmitted along with an Ethernet header on a network. Typically this payload will be an IP packet, in which case the MTU specifies the maximum combined size of the IP header and IP data.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Ethernet / IP / TCP headers with MTU indicated&quot; src=&quot;/images/mtu-mss-why-connections-stall/mtu-packet.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;The MTU is specified at the interface level as it is a link-level setting, and is typically propagated down to the underlying network card driver. The expectation is that packets that are larger than this configured size that appear to be transmitted over the wire are invalid or corrupt and should be dropped. In a valid configuration, hosts connected together via a link will have the same MTU specified:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Client and server connected with matching MTU&quot; src=&quot;/images/mtu-mss-why-connections-stall/simple-2-computers-match.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;If interfaces on either side of a link have mismatching MTU configurations, then the smaller side will treat packets larger than the local MTU as invalid and drop the packets before any software has the chance to see them.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Client and server connected with mismatching MTU&quot; src=&quot;/images/mtu-mss-why-connections-stall/simple-2-computers-mismatch.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Streams of data that are larger than the MTU will be broken up into packets that completely fill an Ethernet frame, up to the MTU in each. If the remote end has a smaller MTU configured for the same link, those larger packets will be dropped. MTU should be configured the same on both interfaces on either side of a link, and so the MTU should be considered a bidirectional maximum.&lt;/p&gt;

&lt;h3 id=&quot;mtu-in-the-lab&quot;&gt;MTU in the lab&lt;/h3&gt;

&lt;p&gt;The lab in this blog post can be used to observe this in an example system. In one terminal, bring up the lab hosts inside the Vagrant machine:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ vagrant ssh -- /vagrant/bin/run-lab
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In another terminal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vagrant ssh&lt;/code&gt; then enable the first scenario from above with matching MTU of 1500 on client and server:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario direct_1500
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Log in to the client and server hosts and observe that we can send a packet with 1400 bytes of payload as expected, since both hosts have an MTU of 1500. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-s 1400&lt;/code&gt; argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ping&lt;/code&gt; sets the payload size, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-M do&lt;/code&gt; argument instructs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ping&lt;/code&gt; to set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DF&lt;/code&gt; (Don’t Fragment) bit, ensuring that the whole IP packet must arrive in one piece or not at all.&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client
root@client:/# ping -c 2 -M do -s 1400 server-direct
PING server-direct (172.28.0.40) 1400(1428) bytes of data.
1408 bytes from server-direct (172.28.0.40): icmp_seq=1 ttl=64 time=0.096 ms
1408 bytes from server-direct (172.28.0.40): icmp_seq=2 ttl=64 time=0.080 ms

--- server-direct ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 2ms
rtt min/avg/max/mdev = 0.080/0.088/0.096/0.008 ms
root@client:/# exit
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 1400 client-direct
PING client-direct (172.28.0.10) 1400(1428) bytes of data.
1408 bytes from client-direct (172.28.0.10): icmp_seq=1 ttl=64 time=0.079 ms
1408 bytes from client-direct (172.28.0.10): icmp_seq=2 ttl=64 time=0.071 ms

--- client-direct ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 32ms
rtt min/avg/max/mdev = 0.071/0.075/0.079/0.004 ms
root@server:/# 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now switch to the second scenario with mismatching MTU, and observe that 1400 byte payloads no longer succeed:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario direct_mismatch

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client
root@client:/# ping -c 2 -M do -s 1400 server-direct
PING server-direct (172.28.0.40) 1400(1428) bytes of data.
ping: local error: Message too long, mtu=1200
ping: local error: Message too long, mtu=1200

--- server-direct ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 10ms

root@client:/# exit
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 1400 client-direct
PING client-direct (172.28.0.10) 1400(1428) bytes of data.

--- client-direct ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 19ms

root@server:/# 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;client&lt;/code&gt; host is immediately able to observe that it cannot send a packet this large since the MTU on the interface is 1200. However, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;server&lt;/code&gt; host believes the MTU of the link is 1500, so sends the packet, however the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;client&lt;/code&gt; is unable to receive it. This occurs at such a low level that neither host is aware of the failure - the packet just disappears.&lt;/p&gt;

&lt;h2 id=&quot;tcp-mss-maximum-segment-size&quot;&gt;TCP MSS: Maximum Segment Size&lt;/h2&gt;

&lt;p&gt;The TCP MSS (Maximum Segment Size) sounds very similar to MTU, and since it relates to the maximum size of network packets, they are easy to conflate even though they are quite different. A TCP segment is the TCP header and TCP data that forms part of a single packet. The MSS specifies the expected maximum size of the data component of this segment that a host expects it would be able to receive without the IP packet being &lt;a href=&quot;https://en.wikipedia.org/wiki/IP_fragmentation&quot;&gt;fragmented&lt;/a&gt;. IP fragmentation is typically disabled for TCP packets on modern networking stacks due to the added complexity and overhead, so the MSS represents the maximum size that the host expects to be able to receive in any given packet.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Ethernet / IP / TCP headers with MTU and MSS indicated&quot; src=&quot;/images/mtu-mss-why-connections-stall/mtu-mss-packet.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Rather than being an interface-level configuration like MTU, the MSS advertisement forms a part of the typical TCP handshake and is calculated based on the underlying MTU of the interface that a local host will use to communicate with a remote host. MSS can be thought of as a TCP hint around how much data can be included in a single TCP packet, given the current MTU. Each host calculates the MSS it will advertise by taking the local MTU and subtracting the size of the IP and TCP headers, then includes that MSS in the TCP options of the SYN or SYN-ACK packet as part of the &lt;a href=&quot;https://en.wikipedia.org/wiki/TCP_handshake&quot;&gt;TCP three-way handshake&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is not a negotiation of a single MSS, but rather each host is giving the remote host an indication of the maximum size of a single packet it expects will be possible to send back. This number must be less than the MTU minus IP/TCP headers, since there’s no way any larger packet could arrive given the local MTU. Each host will use the remote host’s advertised MSS as a hint for what size individual outgoing packets should be. Since this is a configurable hint, it is also only unidirectional, and although a host may advertise a lower MSS than it can otherwise handle, that doesn’t in any way restrict it from sending packets larger than the MSS it advertised (providing the remote host allowed for it).&lt;/p&gt;

&lt;p&gt;The simple MSS exchange happens to work around small misconfigurations of MTU, such as the trivial example described above:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Client and server connected with mismatching MTU&quot; src=&quot;/images/mtu-mss-why-connections-stall/simple-2-computers-mismatch.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;In this case, the client would advertise an MSS of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1200 (MTU) - 20 (IP hdr) - 20 (TCP hdr) = 1160&lt;/code&gt;, which would cause the server to refrain from sending packets that contained more than 1160 bytes in the TCP payload, which also ensures it would be able to arrive within the bounds of the MTU of 1200 once those headers are added on.&lt;/p&gt;

&lt;p&gt;However, the above network is still misconfigured, since even though TCP happens to work around it, other protocols will fail since they don’t exchange MSS values. MSS is actually intended to allow hosts to work around valid configurations where their own local networks have different MTU, such as the following:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Client and server connected with different MTUs, but valid configuration&quot; src=&quot;/images/mtu-mss-why-connections-stall/router-different-mtu.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;In this example, if the server with a valid MTU of 9000 attempted to send an Ethernet frame containing more than 1500 bytes without fragmentation being allowed, that packet would not be able to make it to the client. The intermediary router, being the first host that is aware of this problem as it is aware of the MTU of both links, would send an ICMPv4 “Fragmentation required, but DF set” message or an ICMPv6 “Packet Too Big” message back to the sender to inform it that forwarding the packet without breaking it up is not possible (and that the IP header had the DF, or Don’t Fragment, bit set).&lt;/p&gt;

&lt;p&gt;However, TCP will succeed in unrestricted communications between these hosts due to the MSS advertisements. The server in this configuration will receive an MSS from the client that will ensure no Ethernet frames with a payload larger than 1500 bytes are generated, so they will be received successfully.&lt;/p&gt;

&lt;h3 id=&quot;mss-in-the-lab&quot;&gt;MSS in the lab&lt;/h3&gt;

&lt;p&gt;Select the scenario from above with the client on a network with 1500 MTU and the server on a network with 9000 MTU:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario client_net_smaller
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Running a ping from the side with the larger MTU, we can observe that packets larger than the client’s MTU cause the intermediary router to return an ICMP message since it is unable to forward the packet:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 3000 client
PING client (172.29.0.10) 3000(3028) bytes of data.
From 172.30.0.20 icmp_seq=1 Frag needed and DF set (mtu = 1500)
ping: local error: Message too long, mtu=1500

--- client ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 18ms
pipe 2
root@server:/# 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In a slightly more complex example, open up a few terminals and spin up a simple HTTP server that sends a large payload and observe in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; that the MSS advertisements allow the connection to succeed despite the differing MTU:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# reset everything so Linux doesn&apos;t remember that ICMP frag message from above
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# sample-http-server 

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# tcpdump -i any icmp or port 80

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client
root@client:/# curl http://server/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; should return something like the following - note the MSS advertised by each side in the first 2 SYN packets is the MTU minus the IP and TCP header size of 40 bytes - &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mss 1460&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mss 8960&lt;/code&gt;. The packets with an HTTP payload are broken into smaller packets with a TCP segment of just 1448 bytes - small enough to fit inside an MTU of 1500 with an IP and TCP header with 12 additional bytes for TCP options (you can observe those options where it says &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[nop,nop,TS val 3808245569 ecr 3455879517]&lt;/code&gt;).&lt;/p&gt;

&lt;div class=&quot;highlight needs-scroll&quot;&gt;
&lt;pre&gt;
IP client.51424 &amp;gt; server.80: Flags [S], seq 4195639166, win 64240, options [mss 1460,sackOK,TS val 3456553938 ecr 0,nop,wscale 6], length 0
IP server.80 &amp;gt; client.51424: Flags [S.], seq 3403777541, ack 4195639167, win 62636, options [mss 8960,sackOK,TS val 3808919991 ecr 3456553938,nop,wscale 6], length 0
IP client.51424 &amp;gt; server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3456553939 ecr 3808919991], length 0
IP client.51424 &amp;gt; server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val 3456553939 ecr 3808919991], length 70: HTTP: GET / HTTP/1.1
IP server.80 &amp;gt; client.51424: Flags [.], ack 71, win 978, options [nop,nop,TS val 3808919991 ecr 3456553939], length 0
IP server.80 &amp;gt; client.51424: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS val 3808920010 ecr 3456553939], length 113: HTTP: HTTP/1.1 200 OK
IP client.51424 &amp;gt; server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 3456553958 ecr 3808920010], length 0
IP server.80 &amp;gt; client.51424: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 1448: HTTP
IP server.80 &amp;gt; client.51424: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 1448: HTTP
IP server.80 &amp;gt; client.51424: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 1448: HTTP
IP server.80 &amp;gt; client.51424: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,TS val 3808920011 ecr 3456553958], length 457: HTTP
IP client.51424 &amp;gt; server.80: Flags [.], ack 1562, win 1002, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 &amp;gt; server.80: Flags [.], ack 3010, win 995, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 &amp;gt; server.80: Flags [.], ack 4458, win 984, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 &amp;gt; server.80: Flags [.], ack 4915, win 980, options [nop,nop,TS val 3456553959 ecr 3808920011], length 0
IP client.51424 &amp;gt; server.80: Flags [F.], seq 71, ack 4915, win 1002, options [nop,nop,TS val 3456553960 ecr 3808920011], length 0
IP server.80 &amp;gt; client.51424: Flags [F.], seq 4915, ack 72, win 978, options [nop,nop,TS val 3808920013 ecr 3456553960], length 0
IP client.51424 &amp;gt; server.80: Flags [.], ack 4916, win 1002, options [nop,nop,TS val 3456553961 ecr 3808920013], length 0
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;One interesting note is that on many modern network devices, running a packet capture may result in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; and similar tools observing packets that appear larger than the configured MTU due to &lt;a href=&quot;https://en.wikipedia.org/wiki/Large_receive_offload&quot;&gt;Large Receive Offload&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Large_send_offload&quot;&gt;Large Send Offload&lt;/a&gt; and other technologies which coalesce multiple packets that are part of the same flow into a single pseudo-packet. On receive, the network card will coalesce subsequent packets from a stream together before passing them to the kernel as a single packet for faster processing. On send, the kernel will provide one larger packet that the network card will split appropriately as it sends over the wire based on the configured MSS.&lt;/p&gt;

&lt;p&gt;This packet coelescing has been intentionally disabled in the lab to make it simpler to observe when packets are being split up on the (virtual) wire, however if the same example was run on real server the HTTP payload would likely appear to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; as a single larger packet, though it would still be broken up the same way on the wire.&lt;/p&gt;

&lt;h2 id=&quot;path-mtu-hidden-bottlenecks&quot;&gt;Path MTU: Hidden bottlenecks&lt;/h2&gt;

&lt;p&gt;Although in the above example, TCP MSS was able to work around a simple configuration where hosts had valid but differing MTUs on their links, this is still not a complete solution as there may be additional intermediary links involved with an MTU that is lower than either the client or server link.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Client and server connected with different MTUs, but valid configuration&quot; src=&quot;/images/mtu-mss-why-connections-stall/router-hidden-mtu.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;In this example, all Ethernet payloads larger than 1200 bytes from either side will not be able to be forwarded past the first hop (if IP fragmentation is disabled). However, both client and server will advertise an MSS that will allow for Ethernet payloads larger than 1200 bytes to be sent.&lt;/p&gt;

&lt;p&gt;With full visibility of the network, using a diagram like we have here, we can see that packets can only make it between client and server if they are no more than 1200 bytes including headers. This is the Path MTU, or the minimum MTU of all links on the path between communicating hosts. In practice, where hosts are communicating arbitrarily over the Internet and where multiple paths could be available between those hosts, we don’t have visibility into the full system and therefore we are unable to put a specific number on the Path MTU up front. Instead, it must be possible for hosts to discover this Path MTU as needed during existing communications, as the need for it arises.&lt;/p&gt;

&lt;h2 id=&quot;path-mtu-discovery&quot;&gt;Path MTU Discovery&lt;/h2&gt;

&lt;p&gt;Path MTU Discovery is the process of hosts working from the local MTU and the remote initial MSS advertisement as hints, and arriving at the actual Path MTU of the (current) full path between those hosts in each direction.&lt;/p&gt;

&lt;p&gt;The process starts by assuming that the advertised MSS is correct for the full path, after reducing it if the local link’s MTU minus IP/TCP header size is smaller (since we couldn’t send a larger packet regardless of the MSS). When a packet is sent that is larger than the smallest link along the path, it will at least make it one hop to the first router, since we know the local link MTU is large enough to fit it.&lt;/p&gt;

&lt;p&gt;When a router receives a packet on one interface and needs to forward it to another interface that the packet cannot fit on, the router sends back an ICMPv4 “Fragmentation required” message or an ICMPv6 “Packet Too Big” message. The router includes the MTU of the next (smaller) hop in that message, since it knows it. Upon receipt of that message, the originating host is able to reduce the calculated Path MTU for communications with that remote host, and resend the data as multiple smaller packets. From then on, packet size is correctly limited by the size of the MTU of the smallest link in the path observed so far.&lt;/p&gt;

&lt;p&gt;A full example is below, though note that in practice there may not be complete symmetry in the path in each direction, multiple hops may progressively have smaller MTU values along the way, and the path may even change throughout the lifetime of a single connection:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Client and server individually working out the effective MSS and Path MTU&quot; src=&quot;/images/mtu-mss-why-connections-stall/router-pmtud.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This example shows how critical it is for TCP that ICMP messages of this type are forwarded correctly. This exchange is where most problems around MTU occur in production systems, when firewalls along the path block or throttle ICMP traffic in a way that inhibits Path MTU Discovery. Don’t block ICMP, it will break Path MTU Discovery and also TCP connections with large data transfers where the initial MSS advertisement is not enough to limit the Path MTU. At the very least, don’t block ICMPv4 “Fragmentation required” or ICMPv6 “Packet Too Big”, even if you block other ICMP messages.&lt;/p&gt;

&lt;p&gt;The common &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;traceroute&lt;/code&gt; utility observes hops between hosts using the &lt;a href=&quot;https://karla.io/2016/06/13/dont-panic.html&quot;&gt;TTL to observe each hop via TTL Exceeded messages&lt;/a&gt;, and this can be extended to show Path MTU (as well as the hops along the way), which is functionality that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tracepath&lt;/code&gt; utility provides. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tracepath&lt;/code&gt; sends large packets, starting at the maximum sendable on the local link, to a remote host and shows any ICMP messages and the adjusted Path MTU along the way as it gradually increases TTL and decreases packet size. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tracepath&lt;/code&gt; is a good first place to start when diagnosing issues observed between 2 hosts where MTU misconfiguration or ICMP filtering is suspected.&lt;/p&gt;

&lt;h3 id=&quot;path-mtu-discovery-in-the-lab&quot;&gt;Path MTU Discovery in the lab&lt;/h3&gt;

&lt;p&gt;Select the scenario from above with the client on a network with 1500 MTU, the server on a network with 9000 MTU, and an additional intermediary network with 1200 MTU that packets must traverse:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario hidden_smaller
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Observe that neither side can immediately ascertain the correct Path MTU and must see an ICMP message from the intermediary router before they become aware of the smaller link:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client
root@client:/# ping -c 2 -M do -s 1400 server
PING server (172.31.0.40) 1400(1428) bytes of data.
From vagrant_router-a_1.vagrant_client_router_a (172.29.0.20) icmp_seq=1 Frag needed and DF set (mtu = 1200)
ping: local error: Message too long, mtu=1200

--- server ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2ms

root@client:/# exit
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# ping -c 2 -M do -s 1400 client
PING client (172.29.0.10) 1400(1428) bytes of data.
From vagrant_router-b_1.vagrant_router_b_server (172.31.0.30) icmp_seq=1 Frag needed and DF set (mtu = 1200)
ping: local error: Message too long, mtu=1200

--- client ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms

root@server:/# 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bringing up the example HTTP server from earlier, we can also observe the full process off Path MTU Discovery. In this case, note that we observe the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;router-b&lt;/code&gt; on its interface towards &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;server&lt;/code&gt; since it has a better vantage point for observing retransmits.&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# reset everything so Linux doesn&apos;t remember that ICMP frag message from above
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server
root@server:/# sample-http-server 

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell router-b # better vantage point
root@router-b:/# tcpdump -i eth1 icmp or port 80

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client
root@client:/# curl http://server/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; will return something like the following:&lt;/p&gt;

&lt;div class=&quot;highlight needs-scroll&quot;&gt;
&lt;pre&gt;
IP client.51428 &amp;gt; server.80: Flags [S], seq 644598568, win 64240, options [mss 1460,sackOK,TS val 3457600577 ecr 0,nop,wscale 6], length 0
IP server.80 &amp;gt; client.51428: Flags [S.], seq 1840446146, ack 644598569, win 62636, options [mss 8960,sackOK,TS val 3809966630 ecr 3457600577,nop,wscale 6], length 0
IP client.51428 &amp;gt; server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3457600578 ecr 3809966630], length 0
IP client.51428 &amp;gt; server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val 3457600578 ecr 3809966630], length 70: HTTP: GET / HTTP/1.1
IP server.80 &amp;gt; client.51428: Flags [.], ack 71, win 978, options [nop,nop,TS val 3809966630 ecr 3457600578], length 0
IP server.80 &amp;gt; client.51428: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600578], length 113: HTTP: HTTP/1.1 200 OK
IP client.51428 &amp;gt; server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 3457600598 ecr 3809966650], length 0
IP server.80 &amp;gt; client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b &amp;gt; server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 &amp;gt; client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b &amp;gt; server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 &amp;gt; client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b &amp;gt; server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 &amp;gt; client.51428: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 457: HTTP
IP client.51428 &amp;gt; server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 3457600598 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP server.80 &amp;gt; client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 &amp;gt; client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 &amp;gt; client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 &amp;gt; client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 900: HTTP
IP client.51428 &amp;gt; server.80: Flags [.], ack 1262, win 993, options [nop,nop,TS val 3457600599 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP client.51428 &amp;gt; server.80: Flags [.], ack 2410, win 976, options [nop,nop,TS val 3457600599 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP client.51428 &amp;gt; server.80: Flags [.], ack 3558, win 967, options [nop,nop,TS val 3457600599 ecr 3809966650,nop,nop,sack 1 {4458:4915}], length 0
IP client.51428 &amp;gt; server.80: Flags [.], ack 4915, win 970, options [nop,nop,TS val 3457600599 ecr 3809966650], length 0
IP server.80 &amp;gt; client.51428: Flags [F.], seq 4915, ack 71, win 978, options [nop,nop,TS val 3809966651 ecr 3457600599], length 0
IP client.51428 &amp;gt; server.80: Flags [F.], seq 71, ack 4916, win 1002, options [nop,nop,TS val 3457600600 ecr 3809966651], length 0
IP server.80 &amp;gt; client.51428: Flags [.], ack 72, win 978, options [nop,nop,TS val 3809966652 ecr 3457600600], length 0
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Note that the advertised MSS values are the same as the earlier example, which does not reflect the Path MTU since it is not yet known. Each of the initial large packets sent from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;server&lt;/code&gt; to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;client&lt;/code&gt; cause an ICMP fragmentation message:&lt;/p&gt;
&lt;div class=&quot;highlight needs-scroll&quot;&gt;
&lt;pre&gt;
IP server.80 &amp;gt; client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b &amp;gt; server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 &amp;gt; client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b &amp;gt; server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 &amp;gt; client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1448: HTTP
IP router-b &amp;gt; server: ICMP client unreachable - need to frag (mtu 1200), length 556
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;server&lt;/code&gt; then resends the failing packets, this time respecting the newly calculated Path MTU of 1200:&lt;/p&gt;
&lt;div class=&quot;highlight needs-scroll&quot;&gt;
&lt;pre&gt;
IP server.80 &amp;gt; client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 &amp;gt; client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 &amp;gt; client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 1148: HTTP
IP server.80 &amp;gt; client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS val 3809966650 ecr 3457600598], length 900: HTTP
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;We can also use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tracepath&lt;/code&gt; to perform Path MTU Discovery and observe which routers are responding - the tool starts with the local network’s MTU then discovers the reduced MTU link as it progresses:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# reset everything so Linux doesn&apos;t remember that ICMP frag message from above
vagrant@blog-lab-mtu:~$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client
root@client:/# tracepath -n server
 1?: [LOCALHOST]                      pmtu 1500
 1:  172.29.0.20                                           0.088ms 
 1:  172.29.0.20                                           0.032ms 
 2:  172.29.0.20                                           0.030ms pmtu 1200
 2:  172.30.0.30                                           0.046ms 
 3:  172.31.0.40                                           0.058ms reached
     Resume: pmtu 1200 hops 3 back 3 
root@client:/# exit
vagrant@blog-lab-mtu:~$ /vagrant/bin/shell server
root@server:/# tracepath -n client
 1?: [LOCALHOST]                      pmtu 9000
 1:  172.31.0.30                                           0.159ms 
 1:  172.31.0.30                                           0.034ms 
 2:  172.31.0.30                                           0.027ms pmtu 1200
 2:  172.30.0.20                                           0.092ms 
 3:  172.29.0.10                                           0.060ms reached
     Resume: pmtu 1200 hops 3 back 3 
root@server:/# 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tracepath&lt;/code&gt; tool is extremely useful in determining whether a connection stalling failure is indeed a Path MTU blackhole due to a router or firewall blocking ICMP packets.&lt;/p&gt;

&lt;h2 id=&quot;path-mtu-discovery-and-anycast&quot;&gt;Path MTU Discovery and Anycast&lt;/h2&gt;

&lt;p&gt;One final complexity occurs when routers have multiple equal-cost paths (&lt;a href=&quot;https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing&quot;&gt;ECMP&lt;/a&gt;) to multiple hosts that share the same IP address, a common situation with  deployments of &lt;a href=&quot;https://en.wikipedia.org/wiki/Anycast&quot;&gt;Anycast&lt;/a&gt;. In this case, routers hash packets across the different available paths and attempt to be consistent so that packets from the same connection arrive on the same remote host (and/or travel via the same path).&lt;/p&gt;

&lt;p&gt;However, the input to the hash may not (and typically does not) understand that an ICMP fragmentation or packet too big message is related to the TCP connection that triggered it since the IP source is different to a normal returning packet, and instead is a router along the way, not the expected remote host. This leads to a situation where one host receives the TCP packets for a connection, and another unrelated host receives the ICMP packet relating to that connection, which gets disregarded. This introduces a Path MTU blackhole, as if ICMP were being filtered.&lt;/p&gt;

&lt;p&gt;In practice, there are ways to work around this issue. One way is to &lt;a href=&quot;https://blog.cloudflare.com/path-mtu-discovery-in-practice/&quot;&gt;broadcast those ICMP messages to all hosts&lt;/a&gt;. An alternative approach is used by &lt;a href=&quot;https://theojulienne.io/2018/08/08/glb-director-open-source-load-balancer.html&quot;&gt;GLB Director&lt;/a&gt; which allows the routers to perform the ICMP-unaware ECMP hashing, but then re-hashes it correctly at the first software load balancer layer. GLB inspects inside ICMP messages, since they contain part of the triggering packet, and hashes those packets the same way they would be hashed if they were the original TCP packet that triggered them, ensuring ICMP messages land on the same host as the related TCP connection. In general, it’s important that any system involving hashing or otherwise manipulating TCP packets ensures that ICMP messages relating to the stream are sent to the appropriate host, as they are a crucial part of the way that TCP operates.&lt;/p&gt;

&lt;h2 id=&quot;wrapping-up&quot;&gt;Wrapping up&lt;/h2&gt;

&lt;p&gt;It is often possible to ignore the details of MTU, MSS advertisement and Path MTU Discovery and have things continue to work to a certain extent. However, when these systems fail, connections will stall entirely and in a very blocking way for users. This is often seen only on large transfers, as smaller data transfers don’t trigger the issue, since the packets remain small. It’s also often only intermittent in cases where only one path between hosts has a reduced Path MTU, or just one path has a router blocking ICMP packets.&lt;/p&gt;

&lt;p&gt;Thankfully, the rule for keeping networks functioning correctly with regards to MTU can be summarised simply as:&lt;/p&gt;

&lt;div class=&quot;important-quote&quot;&gt;
The MTU of the interfaces on either side of a physical&lt;br /&gt;
or logical link must be equal. Don&apos;t block ICMP.
&lt;/div&gt;

&lt;p&gt;Asking if this rule holds true both internally and externally in any trouble ticket that has the pattern of “Why is my connection stalling when &lt;i&gt;(action that transfers large data)&lt;/i&gt; but not when &lt;i&gt;(action that transfers small data)&lt;/i&gt;?” will almost always yield an MTU misconfiguration or ICMP filtering and a root cause.&lt;/p&gt;
</description>
        <pubDate>Fri, 21 Aug 2020 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html</guid>
        
        
      </item>
    
      <item>
        <title>Scaling Linux Services: Before accepting connections</title>
        <description>&lt;p&gt;When writing services that accept TCP connections, we tend to think of our work as starting from the point where our service accepts a new client connection and finishing when we complete the request and close the socket. For services at scale, operations can happen at such a high rate that some of the default resource limits of the Linux kernel can break this abstraction and start causing impact to incoming connections outside of that connection lifecycle. This post focuses on some standard resource limitations that exist before the client socket is handed to the application - all of which came up during the course of investigating errors on production systems as part of my role at GitHub (in some cases, multiple times across different applications).&lt;/p&gt;

&lt;p&gt;In its most basic form (ignoring non-blocking variants), listening for TCP connections requires a call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; to actually start allowing incoming connections, followed by repeated calls to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; to take the next pending connection and return a file descriptor that is for that particular client. In C this pattern looks something like:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;server_fd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socket&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AF_INET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SOCK_STREAM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server_fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// start listening for connections&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server_fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;512&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;running&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// block until a connection arrives and then accept it&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;client_fd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;accept&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server_fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// ... handle client_fd ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;close&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server_fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is often hidden behind further layers of abstraction, and we tend to hide away all the implementation details of accepting connections and view it as a stream of new connections that we pick up and then process in parallel. However, when building a system that runs at scale, this abstraction tends to break down because there are resource limitations introduced in the period where connections are being established but are not yet returned by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt;. Before that point, those client connections are considered as being part of the server/listen socket and not as independent resources exposed to the application.&lt;/p&gt;

&lt;p&gt;The examples mentioned in this blog post will be reproducible in the lab from &lt;i class=&quot;fa fa-github&quot;&gt;&lt;/i&gt; &lt;a href=&quot;https://github.com/theojulienne/blog-lab-scaling-accept&quot;&gt;theojulienne/blog-lab-scaling-accept&lt;/a&gt; - clone this repository and &lt;a href=&quot;https://www.vagrantup.com/&quot;&gt;bring up the lab&lt;/a&gt;, then poke around at these examples in a real system:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ git clone https://github.com/theojulienne/blog-lab-scaling-accept.git
$ cd blog-lab-scaling-accept
$ vagrant up
$ vagrant ssh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;from-syn-to-accept&quot;&gt;From SYN to accept()&lt;/h2&gt;

&lt;p&gt;The Linux kernel maintains 2 queues of connections that maintain the backlog of connections that are not yet &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt;ed by the application:&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;LISTEN backlogs&quot; src=&quot;/images/scaling-linux-services-before-accepting-connections/pre-accept-backlogs.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;When a SYN packet is received to initiate a new connection to a listen socket, a SYN-ACK is sent to the client and the half-completed connection state is stored in the “SYN backlog” or “Request Socket Queue”. This represents connections that have not yet been fully validated as having two-way communication between the hosts, where the server hasn’t yet validated that the remote end has successfully received a packet from the server (the SYN could have come from another host spoofing the source IP).&lt;/p&gt;

&lt;p&gt;Once the client responds to the server’s SYN-ACK with an ACK, the connection has then completed the full &lt;a href=&quot;https://en.wikipedia.org/wiki/Handshaking#TCP_three-way_handshake&quot;&gt;TCP 3-way handshake&lt;/a&gt;, and the server knows that two-way communication has been established. At this point, the client connection is ready to be provided to the application at a future call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt;, and is added to the appropriately named “accept queue”.&lt;/p&gt;

&lt;h2 id=&quot;syn-backlog&quot;&gt;SYN backlog&lt;/h2&gt;

&lt;p&gt;Connections in the SYN backlog remain there for a period of time relative to the Round Trip Time between the server and the client. If there are N slots in the backlog then you can have at most N connections in this backlog per average RTT, after which the backlog overflows.&lt;/p&gt;

&lt;p&gt;This won’t actually cause the connections to fail by default on Linux, but instead it will cause &lt;a href=&quot;https://en.wikipedia.org/wiki/SYN_cookies&quot;&gt;SYN cookies&lt;/a&gt; to be sent. This is due to the server being unable to validate that the client is who they say they at the point where only a SYN has been received. Before receiving an ACK that contains the &lt;a href=&quot;https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_establishment&quot;&gt;sequence number&lt;/a&gt; that the server sent to the client in the SYN-ACK, the client could be spoofing packets as coming from a different IP. This is a common Denial of Service attack called a &lt;a href=&quot;https://en.wikipedia.org/wiki/SYN_flood&quot;&gt;SYN Flood&lt;/a&gt; which is so common that the Linux kernel has built in mitigation by sending a SYN cookie when there is no room in the SYN backlog for the new connection.&lt;/p&gt;

&lt;p&gt;This means that if the SYN backlog overflows under normal circumstances with no DoS attack, the kernel allows the connection to move further along in the handshake, and will not store any resources for the connection until an ACK completes the handshake – sending SYN cookies during normal circumstances does however indicate that the rate of connections is probably too high for the default limits as SYN cookies are really only meant for mitigating SYN floods.&lt;/p&gt;

&lt;p&gt;The circumstances in which the kernel will send SYN cookies is configured via the sysctl &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.ipv4.tcp_syncookies&lt;/code&gt;, by default this is set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt; which indicates that SYN cookies should be sent when the SYN backlog overflows, but can also be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt; to disable entirely or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2&lt;/code&gt; to force SYN cookies to be sent 100% of the time.&lt;/p&gt;

&lt;p&gt;When SYN cookies are enabled as needed (the default), they are triggered when the number of pending connections in the SYN backlog is more than the configured accept queue backlog size for the socket - the logic for this is &lt;a href=&quot;https://github.com/torvalds/linux/blob/cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2/net/ipv4/tcp_input.c#L6612-L6613&quot;&gt;here&lt;/a&gt;. When SYN cookies are fully disabled, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.ipv4.tcp_max_syn_backlog&lt;/code&gt; configures the number of connections allowed in the SYN backlog separately - the logic for this case is &lt;a href=&quot;https://github.com/torvalds/linux/blob/cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2/net/ipv4/tcp_input.c#L6670-L6672&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the default configuration where the backlog overflows and SYN cookies are sent, the kernel increments the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TCPReqQFullDoCookies&lt;/code&gt; counter and logs this line to the kernel log, which is often confused for an indicator of a real SYN flood even when it’s just due to legitimate connections coming in too fast:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;TCP: request_sock_TCP: Possible SYN flooding on port 8080. Sending cookies.  Check SNMP counters.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If SYN cookies are explicitly disabled, the kernel increments the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TCPReqQFullDrop&lt;/code&gt; counter and the following would be logged instead:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;TCP: request_sock_TCP: Possible SYN flooding on port 8080. Dropping request.  Check SNMP counters.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you see this message for internal service-to-service connections, chances are you’re dealing with a scaling problem or a &lt;a href=&quot;https://en.wikipedia.org/wiki/Thundering_herd_problem&quot;&gt;thundering herd problem&lt;/a&gt;, and not an actual SYN flood or intentional bad actor.&lt;/p&gt;

&lt;p&gt;Those “SNMP Counters” that the kernel mentions are available in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nstat&lt;/code&gt; in the network namespace:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ nstat -a | grep TCPReqQFull
TcpExtTCPReqQFullDoCookies      706                0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;syn-backlog-in-the-lab&quot;&gt;SYN backlog in the lab&lt;/h3&gt;

&lt;p&gt;To see how the SYN backlog behaves in the lab, we can simulate latency on local connections with the following:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh
vagrant@blog-lab-scaling-accept:~$ sudo tc qdisc add dev lo root netem delay 200ms
vagrant@blog-lab-scaling-accept:~$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=401 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=400 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This means we can now simulate overflowing the SYN backlog. To make it a bit easier to reproduce, reduce the size of the SYN backlog from the default of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;128&lt;/code&gt; by reducing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; (because SYN cookies are enabled):&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo sysctl net.core.somaxconn
net.core.somaxconn = 128
vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=10
net.core.somaxconn = 10
vagrant@blog-lab-scaling-accept:~$ sudo systemctl restart nginx
vagrant@blog-lab-scaling-accept:~$ 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, if more than 10 connections arrive within 400ms (before the SYN-ACK and ACK handshake complete), then 10 connections will be in the SYN backlog, which is the maximum configured. This will trigger SYN cookies to be sent, triggering the message and counters above. Let’s test out if this works, run a simulation that opens N concurrent connections:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:80 20
Waiting, Ctrl+C to exit.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And verify that SYN cookies were sent as expected:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo dmesg -c
[ 2571.784749] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies.  Check SNMP counters.
vagrant@blog-lab-scaling-accept:~$ sudo nstat | grep ReqQ
TcpExtTCPReqQFullDoCookies      10                 0.0
vagrant@blog-lab-scaling-accept:~$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Try disabling the simulated latency with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo tc qdisc del dev lo root&lt;/code&gt; and re-run the same test, now connections move through fast enough that SYN cookies are not sent.&lt;/p&gt;

&lt;p&gt;This also demonstrates the way the Round Trip Time effects the amount of connections that can burst into a LISTEN socket - &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test_send_concurrent_connections.py&lt;/code&gt; does actually attempt to initiate complete connections and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nginx&lt;/code&gt; on the other end is readily calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; as fast as it can, but because 20 connections are initiated at once, 20 SYN packets arrive before any handshake can continue, and the SYN backlog overflows. You can imagine that in the real world, with some of these values often defaulting to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;128&lt;/code&gt;, and users often being far away from servers (on the other side of the world), it’s pretty easy to accidentally trigger this scenario without a true SYN flood.&lt;/p&gt;

&lt;h2 id=&quot;accept-queue&quot;&gt;Accept queue&lt;/h2&gt;

&lt;p&gt;Once an ACK packet is received and validated, a new client connection is ready for the application to process. This connection is moved into the accept queue, waiting for the application to call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; and receive it.&lt;/p&gt;

&lt;p&gt;Unlike the SYN backlog, the accept queue has no backup plan for when it overflows. Back in the original call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt;, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; argument was provided, which indicates how many connections can have completed the 3-way handshake and be waiting in the kernel for the application to accept.&lt;/p&gt;

&lt;p&gt;This is the first common issue: If the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; number provided to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; is not large enough to contain any number of connections that could reasonably complete their handshake between 2 calls to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt;, then a connection will be dropped on the floor, and the application generally won’t even notice - the next call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; will succeed without any indication of a dropped connection! This is particularly likely to happen when some reasonable amount of work can happen between calls to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; or when the incoming connections tend to arrive at the same time (such as jobs running on many servers with cron, or a thundering herd of reconnection attempts).&lt;/p&gt;

&lt;p&gt;However, even when you specify a high enough &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; value to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt;, there’s another location that silently limits this value. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; sysctl specifies a network-system-wide maximum for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; of any socket. When a larger &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; value is provided, the kernel silently caps it at the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt;. This is the next most common issue: Both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; need to be adjusted appropriately so that the backlog is actually adjusted as expected.&lt;/p&gt;

&lt;p&gt;As an added complexity, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; sysctl is not system global, but is global to a Linux network namespace. For new network namespaces, such as used by most Docker containers launched, the values from the default network namespace are not inherited, but instead set to the built-in kernel default:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh
vagrant@blog-lab-scaling-accept:~$ sudo sysctl net.core.somaxconn
net.core.somaxconn = 128
vagrant@blog-lab-scaling-accept:~$ sudo docker run -it ubuntu sysctl net.core.somaxconn
net.core.somaxconn = 128
vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=1024
net.core.somaxconn = 1024
vagrant@blog-lab-scaling-accept:~$ sudo docker run -it ubuntu sysctl net.core.somaxconn
net.core.somaxconn = 128
vagrant@blog-lab-scaling-accept:~$ 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Therein lies the third common issue: Running in containers that have their own network namespace (which is most of them launched by Kubernetes/Docker), even if the system has &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; correctly tweaked, that value will be ignored, and so the container must also have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; tweaked to match the application running inside of it.&lt;/p&gt;

&lt;p&gt;If any of these situations occurs and causes impact, they are visible in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ListenOverflows&lt;/code&gt; counter:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ nstat -a | grep ListenOverflows
TcpExtListenOverflows           6811               0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;accept-queue-in-the-lab&quot;&gt;Accept queue in the lab&lt;/h3&gt;

&lt;p&gt;To see how the accept queue behaves in the lab, we can run a server that is bad at accepting connections (it sleeps a lot) with a backlog of 10:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh
vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=1024
net.core.somaxconn = 1024
vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 10
Listening with backlog 10
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In another window, send a bunch of connections to the laggy server, then exit after a few seconds:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:8080 20
Waiting, Ctrl+C to exit.
(wait a few seconds)
^C
vagrant@blog-lab-scaling-accept:~$ sudo nstat | grep ListenOverflows
TcpExtListenOverflows           44                 0.0
vagrant@blog-lab-scaling-accept:~$ 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, we can see how important the backlog argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; is. Next, let’s review how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somaxconn&lt;/code&gt; caps this by setting up a server with a longer backlog, that we expect to be capped:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh
vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=10
net.core.somaxconn = 10
vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 1024
Listening with backlog 1024
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All looks good - notice how we effectively called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen(1024)&lt;/code&gt; and nothing went wrong. Running the same commands as we did when we provided the small &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; value, we can observe the same problem when the backlog is silently truncated by the kernel:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:8080 20
Waiting, Ctrl+C to exit.
(wait a few seconds)
^C
vagrant@blog-lab-scaling-accept:~$ sudo nstat | grep ListenO
TcpExtListenOverflows           77                 0.0
vagrant@blog-lab-scaling-accept:~$ 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;monitoring-counters&quot;&gt;Monitoring counters&lt;/h2&gt;

&lt;p&gt;A few counters were mentioned above that can be read with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nstat&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TcpExtTCPReqQFullDoCookies&lt;/code&gt; - detecting where SYN cookies were used to mitigate a lack of SYN backlog space&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TcpExtTCPReqQFullDrop&lt;/code&gt; - detecting where SYNs were dropped because SYN cookies were disabled and the SYN backlog was full&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TcpExtListenOverflows&lt;/code&gt; - detecting when a TCP connection completed the 3-way handshake but the accept queue was full or when a SYN was received while the accept queue was full&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because these counters are all part of the network namespace and are global within a namespace, no information about which socket or application caused the issue will be visible, and if the process is in a container, it will not be visible in the base system counters (in the default network namespace). Because of this, every network namespace will need to be inspected/monitored for these counters, and then additional work needs to be done to trace this back to the application, potentially using the kernel log lines as a hint in the case of the SYN backlog variant.&lt;/p&gt;

&lt;h2 id=&quot;detecting-backlog-overflows-with-tracing&quot;&gt;Detecting backlog overflows with tracing&lt;/h2&gt;

&lt;p&gt;The metrics available from the counters above provide a minimal picture of backlog overflows and dropped connections, but ideally we would be able to inspect this situation system-wide, seeing through any network namespaces, and be able to link it back to an application and even socket/port.&lt;/p&gt;

&lt;p&gt;This is possible using kprobes and eBPF tracing using &lt;a href=&quot;https://github.com/iovisor/bcc&quot;&gt;bcc&lt;/a&gt;, by hooking kernel functions that handle these failure cases and inspecting the context at that point in time. This allows us to extract realtime data from systems in production, unlike the counters which are at best vague indicators of an underlying issue existing somewhere on the system.&lt;/p&gt;

&lt;p&gt;We know that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ListenOverflows&lt;/code&gt; counter is incremented any time the accept backlog overflows - we can start there and work back to building a tracing program.&lt;/p&gt;

&lt;p&gt;Searching the Linux source code for &lt;a href=&quot;https://github.com/torvalds/linux/search?q=ListenOverflows&amp;amp;unscoped_q=ListenOverflows&quot;&gt;ListenOverflows&lt;/a&gt; shows what this counter is called internally - &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LINUX_MIB_LISTENOVERFLOWS&lt;/code&gt;. Searching the &lt;a href=&quot;https://elixir.bootlin.com/linux/latest/ident/LINUX_MIB_LISTENOVERFLOWS&quot;&gt;kernel source tree&lt;/a&gt; shows all the places in the kernel that increment that counter - in this case, the best candidates are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_conn_request&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net/ipv4/tcp_input.c&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_v4_syn_recv_sock&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net/ipv4/tcp_ipv4.c&lt;/code&gt;. They handle slightly different points in the connection lifecycle:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_conn_request&lt;/code&gt; handles a SYN packet that initiates a new connection against a LISTEN socket and places it in the SYN backlog&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_v4_syn_recv_sock&lt;/code&gt; handles an ACK packet that completes a connection and adds it to the accept queue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These map to the earlier diagram:&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;LISTEN backlogs with tcp_* functions&quot; src=&quot;/images/scaling-linux-services-before-accepting-connections/backlog-functions.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_conn_request&lt;/code&gt; drops connections if the accept queue is full as a safety mechanism, even though it wouldn’t be adding to it yet. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_v4_syn_recv_sock&lt;/code&gt; also drops connections if the accept queue is full, and rightly so, since it would be adding to it. If a SYN packet has already been accepted and added to the SYN backlog while the accept queue had available space, but was full by the point the ACK arrived, drops will occur in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_v4_syn_recv_sock&lt;/code&gt; when the ACK is received. If a new SYN arrives while the accept queue is full, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_conn_request&lt;/code&gt; will drop instead. For local testing in a VM, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_conn_request&lt;/code&gt; is the easier one to test since it doesn’t require careful timing between SYN and ACK to reproduce.&lt;/p&gt;

&lt;p&gt;The code path that increments the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ListenOverflows&lt;/code&gt; counter in each of those functions looks like the following:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;tcp_conn_request&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;request_sock_ops&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsk_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tcp_request_sock_ops&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;af_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk_buff&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;skb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_acceptq_is_full&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;NET_INC_STATS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LINUX_MIB_LISTENOVERFLOWS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;

&lt;span class=&quot;nl&quot;&gt;drop:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tcp_listendrop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;tcp_v4_syn_recv_sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk_buff&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;skb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                  &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;request_sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;req&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                  &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dst_entry&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                  &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;request_sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;req_unhash&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                  &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;own_req&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_acceptq_is_full&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;exit_overflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    
    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;

&lt;span class=&quot;nl&quot;&gt;exit_overflow:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;NET_INC_STATS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LINUX_MIB_LISTENOVERFLOWS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/torvalds/linux/blob/4a21185cda0fbb860580eeeb4f1a70a9cda332a4/include/net/sock.h#L917-L920&quot;&gt;function called to check for overflow&lt;/a&gt; is a simple inlined check:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;sk_acceptq_is_full&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;READ_ONCE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;READ_ONCE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_max_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When looking to trace functions that are inlined like this, the easiest way to hook the condition is to attach to the start of the calling function (in this case &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_conn_request&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcp_v4_syn_recv_sock&lt;/code&gt;) which is not inlined, and perform the same conditional check in our eBPF code. We can do this with kprobes:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* generic handler */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;handle_sk_potential_overflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pt_regs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;syn_recv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/* we need to read these using bpf_probe_read to ensure the read is safe */&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk_ack_backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk_max_ack_backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bpf_probe_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bpf_probe_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_max_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_max_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_max_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_ack_backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk_max_ack_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;cm&quot;&gt;/* handle the condition */&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* when a SYN arrives */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tcp_request_sock_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;kprobe__tcp_conn_request&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pt_regs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;request_sock_ops&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rsk_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tcp_request_sock_ops&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;af_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk_buff&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;skb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;handle_sk_potential_overflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* when an ACK arrives for a SYN_RECV socket */&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;kprobe__tcp_v4_syn_recv_sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pt_regs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* no need for the remaining unused args */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;handle_sk_potential_overflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, we have a hook on the same condition that would trigger incrementing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ListenOverflows&lt;/code&gt; counter in TCP over IPv4. This can be fleshed out to read detailed information about the socket and return it to the userspace tracing program.&lt;/p&gt;

&lt;h3 id=&quot;tracing-backlog-overflows-in-the-lab&quot;&gt;Tracing backlog overflows in the lab&lt;/h3&gt;

&lt;p&gt;One such program is included in the lab, let’s start up our laggy server again:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh
vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 10
Listening with backlog 10
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In another session, start up the tracing program:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_backlog_overflow.py 
CONTAINER/HOST          PID PROCESS              NETNSID    BIND                 PKT     BL    MAX
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And finally, run our client:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ python /vagrant/test_send_concurrent_connections.py 127.0.0.1:8080 20
Waiting, Ctrl+C to exit.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The result will be traces showing where the backlog overflowed the configured maximum, along with details of the process and whether it occured during SYN or ACK:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_backlog_overflow.py
CONTAINER/HOST          PID PROCESS              NETNSID    BIND                 PKT     BL    MAX
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       SYN     11     10
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As discussed earlier, if many SYN packets are queued while there is room in the accept queue, but then the queue overflows anyway, the ACK packets trigger it instead (run the client again in another window to see this output):&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo tc qdisc add dev lo root netem delay 200ms
vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_backlog_overflow.py
CONTAINER/HOST          PID PROCESS              NETNSID    BIND                 PKT     BL    MAX
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
blog-lab-scaling-acc  14742 python /vagrant/lagg 4026531992 127.0.0.1:8080       ACK     11     10
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;detecting-misconfigured-listen-sockets-with-tracing&quot;&gt;Detecting misconfigured listen sockets with tracing&lt;/h2&gt;

&lt;p&gt;We can also trace calls to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; to observe when the system places caps on the backlog size, to see a leading indicator for applications launching on the machine that have misconfigured values for the backlog or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somaxconn&lt;/code&gt;. This will only work at the time that applications start up, since we’ll be hooking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; as it is actually called, which is typically just when the application first launches.&lt;/p&gt;

&lt;p&gt;The easiest way to find the kernel source responsible for this is to start from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; syscall, which is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__sys_listen()&lt;/code&gt; &lt;a href=&quot;https://github.com/torvalds/linux/blob/cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2/net/socket.c#L1677-L1696&quot;&gt;in the kernel&lt;/a&gt;:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__sys_listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socket&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;err&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fput_needed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sockfd_lookup_light&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;err&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fput_needed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;somaxconn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;core&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sysctl_somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;err&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;security_socket_listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;err&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;err&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;fput_light&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fput_needed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;err&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From this snippet alone, it’s easy to observe the way that somaxconn (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sysctl_somaxconn&lt;/code&gt;) from the current network namespace (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sock_net(sock-&amp;gt;sk)&lt;/code&gt;) silently limits the backlog. By tracing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen&lt;/code&gt; syscall itself, we’ll get the originally requested &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt;, but to see what the limit was eventually set to we need to hook a point in time after that limit (we could also read the sysctl ourselves and repeat the logic).&lt;/p&gt;

&lt;p&gt;Conveniently, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sock-&amp;gt;ops-&amp;gt;listen(...)&lt;/code&gt; calls the underlying listen operation in the address family, in this case &lt;a href=&quot;https://github.com/torvalds/linux/blob/cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2/net/ipv4/af_inet.c#L196&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inet_listen&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;inet_listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socket&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inet_listen&lt;/code&gt; is called with the capped backlog. By combining a trace on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt; syscall itself with a trace on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inet_listen&lt;/code&gt;, we can map the original requested value to the actual value used:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;BPF_HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requested_backlogs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;TRACEPOINT_PROBE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;syscalls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sys_enter_listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bpf_get_current_pid_tgid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;u64&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;requested_backlogs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;kprobe__inet_listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pt_regs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socket&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bpf_get_current_pid_tgid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;u64&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requested_backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;requested_backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;requested_backlogs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requested_backlog&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;requested_backlogs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;delete&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/*
        `*requested_backlogs` contains the argument to listen()
        `backlog` contains the capped backlog
    */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With some extra work we can also peek inside the network namespace to see the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somaxconn&lt;/code&gt; at the time of the call. From above, we can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sock_net(sock-&amp;gt;sk)-&amp;gt;core.sysctl_somaxconn&lt;/code&gt; if we have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct socket *sock&lt;/code&gt; - or since we now have the internal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct sock *sk&lt;/code&gt;, we’ll want &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sock_net(sk)-&amp;gt;core.sysctl_somaxconn&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sock_net&lt;/code&gt; function is &lt;a href=&quot;https://github.com/torvalds/linux/blob/cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2/include/net/sock.h#L2520-L2524&quot;&gt;pretty simple&lt;/a&gt;:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;net&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sock_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read_pnet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* ... elsewhere */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;net&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;read_pnet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;possible_net_t&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pnet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#ifdef CONFIG_NET_NS
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pnet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#else
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;init_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#endif
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’ll need to simulate that function chain with a call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bpf_probe_read&lt;/code&gt; to ensure it’s safe to read that member, otherwise the eBPF validator will complain because the dereferencing is a bit too complex for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bcc&lt;/code&gt; to automatically safeguard with a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bpf_probe_read&lt;/code&gt; call. That gives us the following:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;net&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;safe_sock_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#ifdef CONFIG_NET_NS
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;net&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// was: read_pnet(&amp;amp;sk-&amp;gt;sk_net)-&amp;gt;net&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bpf_probe_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ret&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#else
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;init_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#endif
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;kprobe__inet_listen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pt_regs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socket&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backlog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;netns_somaxconn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bpf_probe_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;netns_somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;netns_somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;safe_sock_net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;core&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sysctl_somaxconn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point we have the requested backlog (the argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt;), the actual backlog calculated (after any somaxconn capping) and the value for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somaxconn&lt;/code&gt; as well. We can fill this out with further information about the pid and namespace and provide it to the userspace side of the tracing app.&lt;/p&gt;

&lt;h3 id=&quot;tracing-accept-queue-overflows-in-the-lab&quot;&gt;Tracing accept queue overflows in the lab&lt;/h3&gt;

&lt;p&gt;A fleshed out tracing app for the above is also included in the lab, this time we need to start the tracing app before our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen()&lt;/code&gt;ing server:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo /vagrant/reset-lab.sh
vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_listen_backlog_capped.py 
CONTAINER/HOST          PID PROCESS              NETNSID    BIND                    REQ    MAX ACTUAL
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now we can start up our server with a small backlog, then with a larger backlog, and then with a backlog limited by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somaxconn&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 10
Listening with backlog 10
^C
vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 128
Listening with backlog 128
^C
vagrant@blog-lab-scaling-accept:~$ python /vagrant/laggy_server.py 1024
Listening with backlog 1024
^C
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The tracing app will show all these cases, along with where it was silently limited:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_listen_backlog_capped.py 
CONTAINER/HOST          PID PROCESS              NETNSID    BIND                    REQ    MAX ACTUAL
blog-lab-scaling-acc  14849 python /vagrant/lagg 4026531992 127.0.0.1:8080           10    128     10 
blog-lab-scaling-acc  14851 python /vagrant/lagg 4026531992 127.0.0.1:8080          128    128    128 
blog-lab-scaling-acc  14853 python /vagrant/lagg 4026531992 127.0.0.1:8080         1024    128    128 LIMITED
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This also works through container network namespaces:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo docker run -v /vagrant:/vagrant -it python:2.7 python /vagrant/laggy_server.py 1024
Listening with backlog 1024
^C
vagrant@blog-lab-scaling-accept:~$ sudo sysctl -w net.core.somaxconn=1024
net.core.somaxconn = 1024
vagrant@blog-lab-scaling-accept:~$ sudo docker run -v /vagrant:/vagrant -it python:2.7 python /vagrant/laggy_server.py 1024
Listening with backlog 1024
^C
vagrant@blog-lab-scaling-accept:~$ sudo docker run --sysctl net.core.somaxconn=1024 -v /vagrant:/vagrant -it python:2.7 python /vagrant/laggy_server.py 1024
Listening with backlog 1024
^C
vagrant@blog-lab-scaling-accept:~$ 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somaxconn&lt;/code&gt; value on the base system had no effect, but setting the value in the container with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker run --sysctl&lt;/code&gt; did change it:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vagrant@blog-lab-scaling-accept:~$ sudo python /vagrant/trace_listen_backlog_capped.py 
CONTAINER/HOST          PID PROCESS              NETNSID    BIND                    REQ    MAX ACTUAL
380944436fe3          15274 python /vagrant/lagg 4026532171 127.0.0.1:8080         1024    128    128 LIMITED
9e82132ad3d6          15387 python /vagrant/lagg 4026532171 127.0.0.1:8080         1024    128    128 LIMITED
af3b584488ea          15495 python /vagrant/lagg 4026532171 127.0.0.1:8080         1024   1024   1024 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;wrapping-up&quot;&gt;Wrapping up&lt;/h2&gt;

&lt;p&gt;We’ve seen how scaling something as simple as a TCP service can run into unexpected resource limitations, even before the program accepts a connection. Key takeaways are:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;listen&lt;/code&gt; should always be set high enough so that any valid burst of connections that could arrive between calls to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; will be queued up&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; must be set high enough that any &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backlog&lt;/code&gt; value is not restricted - if not, it will be silently capped&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.core.somaxconn&lt;/code&gt; is namespaced, so must be set in the namespace/container that the socket is created in&lt;/li&gt;
  &lt;li&gt;SYN flood warnings are often logged due to the SYN backlog being too small, and it should be treated as a potential scaling issue unless a real SYN flood is suspected&lt;/li&gt;
  &lt;li&gt;The number of connections in the SYN backlog are relative to the RTT between the client and server, not just the rate of connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tracing tools provided here are fairly lightweight in kernel space and are mostly gated to failure cases and hook relatively low rate system calls, making them reasonably safe to use on production systems, unless CPU resource usage is an active bottleneck at the time. These tracing tools may help track down which application or socket the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ListenOverflow&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TCPReqQFull*&lt;/code&gt; counters are being triggered by, as well as monitoring applications launching for any misconfigurations without needing to audit the configuration of each one individually.&lt;/p&gt;
</description>
        <pubDate>Fri, 03 Jul 2020 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2020/07/03/scaling-linux-services-before-accepting-connections.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2020/07/03/scaling-linux-services-before-accepting-connections.html</guid>
        
        
      </item>
    
      <item>
        <title>Debugging network stalls on Kubernetes</title>
        <description>&lt;p&gt;We’ve talked about &lt;a href=&quot;https://github.blog/2017-08-16-kubernetes-at-github/&quot;&gt;Kubernetes&lt;/a&gt; before, and over the last couple of years it’s become the standard deployment pattern at GitHub. We now run a large portion of both internal and public-facing services on Kubernetes. As our Kubernetes clusters have grown, and our targets on the latency of our services have become more stringent, we began to notice that certain services running on Kubernetes in our environment were experiencing sporadic latency that couldn’t be attributed to the performance characteristics of the application itself.&lt;/p&gt;

&lt;p&gt;Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries. Services were expected to be able to respond to requests in well under 100ms, which wasn’t feasible when the connection itself was taking so long. Separately, we also observed very fast MySQL queries, which we expected to take a matter of milliseconds and that MySQL observed taking only milliseconds, were being observed taking 100ms or more from the perspective of the querying application.&lt;/p&gt;

&lt;p&gt;The problem was initially narrowed down to communications that involved a Kubernetes node, even if the other side of a connection was outside Kubernetes. The most simple reproduction we had was a &lt;a href=&quot;https://github.com/tsenart/vegeta&quot;&gt;Vegeta&lt;/a&gt; benchmark that could be run from any internal host, targeting a Kubernetes service running on a node port, and would observe the sporadically high latency. In this post, we’ll walk through how we tracked down the underlying issue.&lt;/p&gt;

&lt;h2 id=&quot;removing-complexity-to-find-the-path-at-fault&quot;&gt;Removing complexity to find the path at fault&lt;/h2&gt;

&lt;p&gt;Using an example reproduction, we wanted to narrow down the problem and remove layers of complexity. Initially, there were too many moving parts in the flow between Vegeta and pods running on Kubernetes to determine if this was a deeper network problem, so we needed to rule some out.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;vegeta-to-nodeport&quot; src=&quot;/images/debugging-network-latency-kubernetes/vegeta-to-nodeport.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;The client, Vegeta, creates a TCP connection to any kube-node in the cluster. Kubernetes runs in our data centers as an &lt;a href=&quot;https://en.wikipedia.org/wiki/Overlay_network&quot;&gt;overlay network&lt;/a&gt; (a network that runs on top of our existing datacenter network) that uses &lt;a href=&quot;https://en.wikipedia.org/wiki/IP_in_IP&quot;&gt;IPIP&lt;/a&gt; (which encapsulates the overlay network’s IP packet inside the datacenter’s IP packet). When a connection is made to that first kube-node, it performs stateful &lt;a href=&quot;https://en.wikipedia.org/wiki/Network_address_translation&quot;&gt;Network Address Translation&lt;/a&gt; (NAT) to convert the kube-node’s IP and port to an IP and port on the overlay network (specifically, of the pod running the application). On return, it undoes each of these steps. This is a complex system with a lot of state, and a lot of moving parts that are constantly updating and changing as services deploy and move around.&lt;/p&gt;

&lt;p&gt;As part of running a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tcpdump&lt;/code&gt; on the original Vegeta benchmark, we observed the latency during a TCP handshake (between SYN and SYN-ACK). To simplify some of the complexity of HTTP and Vegeta, we can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hping3&lt;/code&gt; to just “ping” with a SYN packet and see if we observe the latency in the response packet—then throw away the connection. We can filter it to only include packets over 100ms and get a simpler reproduction case than a full Layer 7 Vegeta benchmark or attack against the service. The following “pings” a kube-node using TCP SYN/SYN-ACK on the “node port” for the service (30927) with an interval of 10ms, filtered for slow responses:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered &apos;rtt=[0-9]{3}\.&apos;
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Our first new observation from the sequence numbers and timings is that this isn’t a one-off, but is often grouped, like a backlog that eventually gets processed.&lt;/p&gt;

&lt;p&gt;Next up, we want to narrow down which component(s) were potentially at fault. Is it the kube-proxy iptables NAT rules that are hundreds of rules long? Is it the IPIP tunnel and something on the network handling them poorly? One way to validate this is to test each step of the system. What happens if we remove the NAT and firewall logic and only use the IPIP part:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;ipip-only&quot; src=&quot;/images/debugging-network-latency-kubernetes/ipip-only.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Linux thankfully lets you just talk directly to an overlay IP when you’re on a machine that’s part of the same network, so that’s pretty easy to do:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered &apos;rtt=[0-9]{3}\.&apos;
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Based on our results, the problem still remains! That rules out iptables and NAT. Is it TCP that’s the problem? Let’s see what happens when we perform a normal ICMP ping:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered &apos;rtt=[0-9]{3}\.&apos;
len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms
len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms
len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms
len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Our results show that the problem still exists. Is it the IPIP tunnel that’s causing the problem? Let’s simplify things further:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;direct-ping&quot; src=&quot;/images/debugging-network-latency-kubernetes/direct-ping.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Is it possible that it’s every packet between these two hosts?&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered &apos;rtt=[0-9]{3}\.&apos;
len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms
len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Behind the complexity, it’s as simple as two kube-node hosts sending any packet, even ICMP pings, to each other. They’ll still see the latency, if the target host is a “bad” one (some are worse than others).&lt;/p&gt;

&lt;p&gt;Now there’s one last thing to question: we clearly don’t observe this everywhere, so why is it just on kube-node servers? And does it occur when the kube-node is the sender or the receiver? Luckily, this is also pretty easy to narrow down by using a host outside Kubernetes as a sender, but with the same “known bad” target host (from a staff shell host to the same kube-node). We can observe this is still an issue in that direction:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered &apos;rtt=[0-9]{3}\.&apos;
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then perform the same from the previous source kube-node to a staff shell host (which rules out the source host, since a ping has both an RX and TX component):&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered &apos;rtt=[0-9]{3}\.&apos;
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking into the packet captures of the latency we observed, we get some more information. Specifically, that the “sender” host (bottom) observes this timeout while the “receiver” host (top) does not—see the Delta column (in seconds):&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;sender-observes&quot; src=&quot;/images/debugging-network-latency-kubernetes/sender-observes.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Additionally, by looking at the difference between the ordering of the packets (based on the sequence numbers) on the receiver side of the TCP and ICMP results above, we can observe that ICMP packets always arrive in the same sequence they were sent, but with uneven timing, while TCP packets are sometimes interleaved, but a subset of them stall. Notably, we observe that if you count the ports of the SYN packets, the ports are not in order on the receiver side, while they’re in order on the sender side.&lt;/p&gt;

&lt;p&gt;There is a subtle difference between how modern server &lt;a href=&quot;https://en.wikipedia.org/wiki/Network_interface_controller&quot;&gt;NICs&lt;/a&gt; – like we have in our data centers—handle packets containing TCP vs ICMP. When a packet arrives, the NIC hashes the packet “per connection” and tries to divvy up the connections across receive queues, each (approximately) delegated to a given CPU core. For TCP, this hash includes both source and destination IP and port. In other words, each connection is hashed (potentially) differently. For ICMP, just the IP source and destination are hashed, since there are no ports.&lt;/p&gt;

&lt;p&gt;Another new observation is that we can tell that ICMP observes stalls on all communications between the two hosts during this period from the sequence numbers in ICMP vs TCP, while TCP does not. This tells us that the RX queue hashing is likely in play, almost certainly indicating the stall is in processing RX packets, not in sending responses.&lt;/p&gt;

&lt;p&gt;This rules out kube-node transmits, so we now know that it’s a stall in processing packets, and that it’s on the receive side on some kube-node servers.&lt;/p&gt;

&lt;h2 id=&quot;deep-dive-into-linux-kernel-packet-processing&quot;&gt;Deep dive into Linux kernel packet processing&lt;/h2&gt;

&lt;p&gt;To understand why the problem could be on the receiving side on some kube-node servers, let’s take a look at how the Linux kernel processes packets.&lt;/p&gt;

&lt;p&gt;Going back to the simplest traditional implementation, the network card receives a packet and sends an &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; to the Linux kernel stating that there’s a packet that should be handled. The kernel stops other work, switches context to the interrupt handler, processes the packet, then switches back to what it was doing.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;irq&quot; src=&quot;/images/debugging-network-latency-kubernetes/irq.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This context switching is slow, which may have been fine on a 10Mbit NIC in the 90s, but on modern servers where the NIC is 10G and at maximal line rate can bring in around 15 million packets per second, on a smaller server with eight cores that could mean the kernel is interrupted millions of times per second per core.&lt;/p&gt;

&lt;p&gt;Instead of constantly handling interrupts, many years ago Linux added &lt;a href=&quot;https://en.wikipedia.org/wiki/New_API&quot;&gt;NAPI&lt;/a&gt;, the networking API that modern drivers use for improved performance at high packet rates. At low rates, the kernel still accepts interrupts from the NIC in the method we mentioned. Once enough packets arrive and cross a threshold, it disables interrupts and instead begins polling the NIC and pulling off packets in batches. This processing is done in a “softirq”, or &lt;a href=&quot;https://www.kernel.org/doc/htmldocs/kernel-hacking/basics-softirqs.html&quot;&gt;software interrupt context&lt;/a&gt;. This happens at the end of syscalls and hardware interrupts, which are times that the kernel (as opposed to userspace) is already running.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;napi&quot; src=&quot;/images/debugging-network-latency-kubernetes/napi.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This is much faster, but brings up another problem. What happens if we have so many packets to process that we spend all our time processing packets from the NIC, but we never have time to let the userspace processes actually drain those queues (read from TCP connections, etc.)? Eventually the queues would fill up, and we’d start dropping packets. To try and make this fair, the kernel limits the amount of packets processed in a given softirq context to a certain budget. Once this budget is exceeded, it wakes up a separate thread called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ksoftirqd&lt;/code&gt; (you’ll see one of these in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ps&lt;/code&gt; for each core) which processes these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which already tries to be fair.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;napi-ksoftirqd&quot; src=&quot;/images/debugging-network-latency-kubernetes/napi-ksoftirqd.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;With an overview of the way the kernel is processing packets, we can see there is definitely opportunity for this processing to become stalled. If the time between softirq processing calls grows, packets could sit in the NIC RX queue for a while before being processed. This could be something deadlocking the CPU core, or it could be something slow preventing the kernel from running softirqs.&lt;/p&gt;

&lt;h2 id=&quot;narrow-down-processing-to-a-coremethod&quot;&gt;Narrow down processing to a core/method&lt;/h2&gt;

&lt;p&gt;At this point, it makes sense that this could happen, and we know we’re observing something that looks a lot like it. The next step is to confirm this theory, and if we do, understand what’s causing it.&lt;/p&gt;

&lt;p&gt;Let’s revisit the slow round trip packets we saw before:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms
len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms
len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms
len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As discussed previously, these ICMP packets are hashed to a single NIC RX queue and processed by a single CPU core. If we want to understand what the kernel is doing, it’s helpful to know where (cpu core) and how (softirq, ksoftirqd) it’s processing these packets so we can catch them in action.&lt;/p&gt;

&lt;p&gt;Now it’s time to use the tools that allow live tracing of a running Linux kernel - &lt;a href=&quot;https://github.com/iovisor/bcc&quot;&gt;bcc&lt;/a&gt; is what was used here. This allows you to write small C programs that hook arbitrary functions in the kernel, and buffer events back to a userspace Python program which can summarize and return them to you. The “hook arbitrary functions in the kernel” is the difficult part, but it actually goes out of its way to be as safe as possible to use, because it’s designed for tracing exactly this type of production issue that you can’t simply reproduce in a testing or dev environment.&lt;/p&gt;

&lt;p&gt;The plan here is simple: we know the kernel is processing those ICMP ping packets, so let’s hook the kernel function &lt;a href=&quot;https://github.com/torvalds/linux/blob/v4.19/net/ipv4/icmp.c#L925&quot;&gt;icmp_echo&lt;/a&gt; which takes an incoming ICMP “echo request” packet and initiates sending the ICMP “echo response” reply. We can identify the packet using the incrementing icmp_seq shown by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hping3&lt;/code&gt; above.&lt;/p&gt;

&lt;p&gt;The code for this &lt;a href=&quot;https://gist.github.com/theojulienne/9d78a0cb68dbe56f19a2ae6316bc6846&quot;&gt;bcc script&lt;/a&gt; looks complex, but breaking it down it’s not as scary as it sounds. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;icmp_echo&lt;/code&gt; function is passed a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct sk_buff *skb&lt;/code&gt;, which is the packet containing the ICMP echo request. We can delve into this live and pull out the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;echo.sequence&lt;/code&gt; (which maps to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;icmp_seq&lt;/code&gt; shown by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hping3&lt;/code&gt; above), and send that back to userspace. Conveniently, we can also grab the current process name/id as well. This gives us results like the following, live as the kernel processes these packets:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;TGID    PID     PROCESS NAME    ICMP_SEQ
0       0       swapper/11      770
0       0       swapper/11      771
0       0       swapper/11      772
0       0       swapper/11      773
0       0       swapper/11      774
20041   20086   prometheus      775
0       0       swapper/11      776
0       0       swapper/11      777
0       0       swapper/11      778
4512    4542    spokes-report-s 779
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One thing to note about this process name is that in a post-syscall &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;softirq&lt;/code&gt; context, you see the process that made the syscall show as the “process”, even though really it’s the kernel processing it safely within the kernel context.&lt;/p&gt;

&lt;p&gt;With that running, we can now correlate back from the stalled packets observed with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hping3&lt;/code&gt; to the process that’s handling it. A simple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grep&lt;/code&gt; on that capture for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;icmp_seq&lt;/code&gt; values with some context shows what happened before these packets were processed. The packets that line up with the above &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hping3&lt;/code&gt; icmp_seq values have been marked along with the rtt’s we observed above (and what we’d have expected if &amp;lt;50ms rtt’s weren’t filtered out):&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;TGID    PID     PROCESS NAME    ICMP_SEQ ** RTT
--
10137   10436   cadvisor        1951
10137   10436   cadvisor        1952
76      76      ksoftirqd/11    1953 ** 99ms
76      76      ksoftirqd/11    1954 ** 89ms
76      76      ksoftirqd/11    1955 ** 79ms
76      76      ksoftirqd/11    1956 ** 69ms
76      76      ksoftirqd/11    1957 ** 59ms
76      76      ksoftirqd/11    1958 ** (49ms)
76      76      ksoftirqd/11    1959 ** (39ms)
76      76      ksoftirqd/11    1960 ** (29ms)
76      76      ksoftirqd/11    1961 ** (19ms)
76      76      ksoftirqd/11    1962 ** (9ms)
--
10137   10436   cadvisor        2068
10137   10436   cadvisor        2069
76      76      ksoftirqd/11    2070 ** 75ms
76      76      ksoftirqd/11    2071 ** 65ms
76      76      ksoftirqd/11    2072 ** 55ms
76      76      ksoftirqd/11    2073 ** (45ms)
76      76      ksoftirqd/11    2074 ** (35ms)
76      76      ksoftirqd/11    2075 ** (25ms)
76      76      ksoftirqd/11    2076 ** (15ms)
76      76      ksoftirqd/11    2077 ** (5ms)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The results tells us a few things. First, these packets are being processed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ksoftirqd/11&lt;/code&gt; which conveniently tells us this particular pair of machines have their ICMP packets hashed to core 11 on the receiving side. We can also see that every time we see a stall, we always see some packets processed in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cadvisor&lt;/code&gt;’s syscall softirq context, followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ksoftirqd&lt;/code&gt; taking over and processing the backlog, exactly the number we’d expect to work through the backlog.&lt;/p&gt;

&lt;p&gt;The fact that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cadvisor&lt;/code&gt; is always running just prior to this immediately also implicates it in the problem. Ironically, &lt;a href=&quot;https://github.com/google/cadvisor&quot;&gt;cadvisor&lt;/a&gt; “analyzes resource usage and performance characteristics of running containers”, yet it’s triggering this performance problem. As with many things related to containers, it’s all relatively bleeding-edge tooling which can result in some somewhat expected corner cases of bad performance.&lt;/p&gt;

&lt;h2 id=&quot;what-is-cadvisor-doing-to-stall-things&quot;&gt;What is cadvisor doing to stall things?&lt;/h2&gt;

&lt;p&gt;With the understanding of how the stall can happen, the process causing it, and the CPU core it’s happening on, we now have a pretty good idea of what this looks like. For the kernel to hard block and not schedule &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ksoftirqd&lt;/code&gt; earlier, and given we see packets processed under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cadvisor&lt;/code&gt;’s softirq context, it’s likely that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cadvisor&lt;/code&gt; is running a slow syscall which ends with the rest of the packets being processed:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;stalling-syscall&quot; src=&quot;/images/debugging-network-latency-kubernetes/stalling-syscall.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;That’s a theory but how do we validate this is actually happening? One thing we can do is trace what’s running on the CPU core throughout this process, catch the point where the packets are overflowing budget and processed by ksoftirqd, then look back a bit to see what was running on the CPU core. Think of it like taking an x-ray of the CPU every few milliseconds. It would look something like this:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;sampling-probing-syscall&quot; src=&quot;/images/debugging-network-latency-kubernetes/sampling-probing-syscall.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Conveniently, this is something that’s already mostly supported. The &lt;a href=&quot;https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;perf record&lt;/code&gt;&lt;/a&gt; tool samples a given CPU core at a certain frequency and can generate a call graph of the live system, including both userspace and the kernel. Taking that recording and manipulating it using a quick fork of a tool from &lt;a href=&quot;https://github.com/brendangregg/FlameGraph&quot;&gt;Brendan Gregg’s FlameGraph&lt;/a&gt; that retained stack trace ordering, we can get a one-line stack trace for each 1ms sample, then get a sample of the 100ms before &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ksoftirqd&lt;/code&gt; is in the trace:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2&amp;gt;/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in the following:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;(hundreds of traces that look similar)
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There’s a lot there, but looking through it you can see it’s the cadvisor-then-ksoftirqd pattern we saw from the ICMP tracer above. What does it mean?&lt;/p&gt;

&lt;p&gt;Each line is a trace of the CPU at a point in time. Each call down the stack is separated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;;&lt;/code&gt; on that line. Looking at the middle of the lines we can see the syscall being called is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read()&lt;/code&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.... ;do_syscall_64;sys_read; ...&lt;/code&gt; So cadvisor is spending a lot of time in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read()&lt;/code&gt; syscall relating to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mem_cgroup_*&lt;/code&gt; functions (the top of the call stack / end of line).&lt;/p&gt;

&lt;p&gt;The call stack trace isn’t convenient to see what’s being read, so let’s use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strace&lt;/code&gt; to see what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cadvisor&lt;/code&gt; is doing and find 100ms-or-slower syscalls:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2&amp;gt;&amp;amp;1 | egrep &apos;&amp;lt;0\.[1-9]&apos;
[pid 10436] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.156784&amp;gt;
[pid 10432] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.258285&amp;gt;
[pid 10137] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.678382&amp;gt;
[pid 10384] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.762328&amp;gt;
[pid 10436] &amp;lt;... read resumed&amp;gt; &quot;cache 154234880\nrss 507904\nrss_h&quot;..., 4096) = 658 &amp;lt;0.179438&amp;gt;
[pid 10384] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.104614&amp;gt;
[pid 10436] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.175936&amp;gt;
[pid 10436] &amp;lt;... read resumed&amp;gt; &quot;cache 0\nrss 0\nrss_huge 0\nmapped_&quot;..., 4096) = 577 &amp;lt;0.228091&amp;gt;
[pid 10427] &amp;lt;... read resumed&amp;gt; &quot;cache 0\nrss 0\nrss_huge 0\nmapped_&quot;..., 4096) = 577 &amp;lt;0.207334&amp;gt;
[pid 10411] &amp;lt;... epoll_ctl resumed&amp;gt; )   = 0 &amp;lt;0.118113&amp;gt;
[pid 10382] &amp;lt;... pselect6 resumed&amp;gt; )    = 0 (Timeout) &amp;lt;0.117717&amp;gt;
[pid 10436] &amp;lt;... read resumed&amp;gt; &quot;cache 154234880\nrss 507904\nrss_h&quot;..., 4096) = 660 &amp;lt;0.159891&amp;gt;
[pid 10417] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.917495&amp;gt;
[pid 10436] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.208172&amp;gt;
[pid 10417] &amp;lt;... futex resumed&amp;gt; )       = 0 &amp;lt;0.190763&amp;gt;
[pid 10417] &amp;lt;... read resumed&amp;gt; &quot;cache 0\nrss 0\nrss_huge 0\nmapped_&quot;..., 4096) = 576 &amp;lt;0.154442&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Sure enough, we see the slow&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read()&lt;/code&gt; calls. From the content being read and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mem_cgroup&lt;/code&gt; context above, these &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read()&lt;/code&gt; calls are to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.stat&lt;/code&gt; file which shows the memory usage and limits of a cgroup (the resource isolation technology used by Docker). cadvisor is polling this file to get resource utilization details for the containers. Let’s see if it’s the kernel or cadvisor that’s doing something unexpected by attempting the read ourselves:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat &amp;gt;/dev/null

real    0m0.153s
user    0m0.000s
sys    0m0.152s
theojulienne@kube-node-bad ~ $ 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Since we can reproduce it, this indicates that it’s the kernel hitting a pathologically bad case.&lt;/p&gt;

&lt;h2 id=&quot;what-causes-this-read-to-be-so-slow&quot;&gt;What causes this read to be so slow&lt;/h2&gt;

&lt;p&gt;At this point it’s much more simple to find similar issues reported by others. As it turns out, this has been reported to cadvisor as an &lt;a href=&quot;https://github.com/google/cadvisor/issues/1774&quot;&gt;excessive CPU usage problem&lt;/a&gt;, it just hadn’t been observed that latency was also being introduced to the network stack randomly as well. In fact, some folks internally had noticed cadvisor was consuming more CPU than expected, but it didn’t seem to be causing an issue since our servers had plenty of CPU capacity, and so the CPU usage hadn’t yet been investigated.&lt;/p&gt;

&lt;p&gt;The overview of the issue is that the memory cgroup is accounting for memory usage inside a namespace (container). When all processes in that cgroup exit, the memory cgroup is released by Docker. However, “memory” isn’t just process memory, and although processes memory usage itself is gone, it turns out the kernel also assigns cached content like dentries and inodes (directory and file metadata) that are cached to the memory cgroup. From that issue:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“zombie” cgroups: cgroups that have no processes and have been deleted but still have memory charged to them (in my case, from the dentry cache, but it could also be from page cache or tmpfs).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Rather than the kernel iterating over every page in the cache at cgroup release time, which could be very slow, they choose to wait for those pages to be reclaimed and then finally clean up the cgroup once all are reclaimed when memory is needed, lazily. In the meantime, the cgroup still needs to be counted during stats collection.&lt;/p&gt;

&lt;p&gt;From a performance perspective, they are trading off time on a slow process by amortizing it over the reclamation of each page, opting to make the initial cleanup fast in return for leaving some cached memory around. That’s fine, when the kernel reclaims the last of the cached memory, the cgroup eventually gets cleaned up, so it’s not really a “leak”. Unfortunately the search that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.stat&lt;/code&gt; performs, the way it’s implemented on the kernel version (4.9) we’re running on some servers, combined with the huge amount of memory on our servers, means it can take a significantly long time for the last of the cached data to be reclaimed and for the zombie cgroup to be cleaned up.&lt;/p&gt;

&lt;p&gt;It turns out we had nodes that had such a large number of zombie cgroups that some had reads/stalls of over a second.&lt;/p&gt;

&lt;p&gt;The workaround on that cadvisor issue, to immediately free the dentries/inodes cache systemwide, immediately stopped the read latency, and also the network latency stalls on the host, since the dropping of the cache included the cached pages in the “zombie” cgroups and so they were also freed. This isn’t a solution, but it does validate the cause of the issue.&lt;/p&gt;

&lt;p&gt;As it turns out newer kernel releases (4.19+) have improved the performance of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory.stat&lt;/code&gt; call and so this is no longer a problem after moving to that kernel. In the interim, we had existing tooling that was able to detect problems with nodes in our Kubernetes clusters and gracefully drain and reboot them, which we used to detect the cases of high enough latency that would cause issues, and treat them with a graceful reboot. This gave us breathing room while OS and kernel upgrades were rolled out to the remainder of the fleet.&lt;/p&gt;

&lt;h2 id=&quot;wrapping-up&quot;&gt;Wrapping up&lt;/h2&gt;

&lt;p&gt;Since this problem manifested as NIC RX queues not being processed for hundreds of milliseconds, it was responsible for both high latency on short connections and latency observed mid-connection such as between MySQL query and response packets. Understanding and maintaining performance of our most foundational systems like Kubernetes is critical to the reliability and speed of all services that build on top of them. As we invest in and improve on this performance, every system we run benefits from those improvements.&lt;/p&gt;
</description>
        <pubDate>Sat, 06 Jul 2019 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2019/07/06/debugging-network-latency-kubernetes.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2019/07/06/debugging-network-latency-kubernetes.html</guid>
        
        
      </item>
    
      <item>
        <title>GLB: GitHub&apos;s open source load balancer</title>
        <description>&lt;p&gt;At GitHub, we serve tens of thousands of requests every second out of our network edge, operating on &lt;a href=&quot;http://githubengineering.com/githubs-metal-cloud/&quot;&gt;GitHub’s metal cloud&lt;/a&gt;. We’ve previously &lt;a href=&quot;https://githubengineering.com/introducing-glb/&quot;&gt;introduced GLB&lt;/a&gt;, our scalable load balancing solution for bare metal datacenters, which powers the majority of GitHub’s public web and git traffic, as well as fronting some of our most critical internal systems such as &lt;a href=&quot;https://githubengineering.com/mysql-high-availability-at-github/&quot;&gt;highly available MySQL clusters&lt;/a&gt;. Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.&lt;/p&gt;

&lt;p&gt;GLB Director is a &lt;a href=&quot;https://en.wikipedia.org/wiki/Transport_layer&quot;&gt;Layer 4&lt;/a&gt; load balancer which scales a single IP address across a large number of physical machines while attempting to minimise connection disruption during any change in servers. GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB&quot; src=&quot;/images/introducing-glb/glb-logo-dark.png&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;scaling-an-ip-using-ecmp&quot;&gt;Scaling an IP using ECMP&lt;/h2&gt;

&lt;p&gt;The basic property of a Layer 4 load balancer is the ability to take a single IP address and spread inbound connections across multiple servers. To scale a single IP address to handle more traffic than any single machine can process, we need to not only split amongst backend servers, but also need to be able to scale up the servers that handle the load balancing themselves. This is essentially another layer of load balancing.&lt;/p&gt;

&lt;p&gt;Typically we think of an IP address as referencing a single physical machine, and routers as moving a packet to the next closest router to that machine. In the simplest case where there’s always a single best next hop, routers pick that hop and forward all packets there until the destination is reached.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;Next Hop Routing&quot; src=&quot;/images/glb-director-open-source-load-balancer/simple-nexthop-routing.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;In reality, most networks are far more complicated. There is often more than a single path available between two machines, for example where multiple ISPs are available or even when two routers are joined together with more than one physical cable to increase capacity and provide redundancy. This is where &lt;a href=&quot;https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing&quot;&gt;Equal-Cost Multi-Path (ECMP) routing&lt;/a&gt; comes in to play - rather than routers picking a single best next hop, where they have multiple hops with the same cost (usually defined as the number of &lt;a href=&quot;https://en.wikipedia.org/wiki/Autonomous_system_(Internet)&quot;&gt;ASes&lt;/a&gt; to the destination), they instead hash traffic so that connections are balanced across all available paths of equal cost.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;ECMP with the same destination server&quot; src=&quot;/images/glb-director-open-source-load-balancer/ecmp-same-destination.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;ECMP is implemented by hashing each packet to determine a relatively consistent selection of one of the available paths. The hash function used here varies by device, but typically it’s a &lt;a href=&quot;https://en.wikipedia.org/wiki/Consistent_hashing&quot;&gt;consistent hash&lt;/a&gt; based on the source and destination IP address as well as the source and destination port for TCP traffic. This means that multiple packets for the same ongoing TCP connection will typically traverse the same path, meaning that packets will arrive in the same order even when paths have different latencies. Notably in this case, the paths can change without any disruption to connections because they will always end up at the same destination server, and at that point the path it took is mostly irrelevant.&lt;/p&gt;

&lt;p&gt;An alternative use of ECMP can come in to play when we want to shard traffic across multiple &lt;em&gt;servers&lt;/em&gt; rather than to the same server over multiple &lt;em&gt;paths&lt;/em&gt;. Each server can announce the same IP address with &lt;a href=&quot;https://en.wikipedia.org/wiki/Border_Gateway_Protocol&quot;&gt;BGP&lt;/a&gt; or another similar network protocol, causing connections to be sharded across those servers, with the routers blissfully unaware that the connections are being handled in different places, not all ending on the same machine as would traditionally be the case.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;ECMP with multiple destination servers&quot; src=&quot;/images/glb-director-open-source-load-balancer/ecmp-shard-traffic.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;While this shards traffic as we had hoped, it has one huge drawback: when the set of servers that are announcing the same IP change (or any path or router along the way changes), connections must rebalance to maintain an equal balance of connections on each server. Routers are typically stateless devices, simply making the best decision for each packet without consideration to the connection it is a part of, which means some connections will break in this scenario.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;ECMP redistribution breaking connections&quot; src=&quot;/images/glb-director-open-source-load-balancer/ecmp-redist-break.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;In the above example on the left, we can imagine that each colour represents an active connection. A new proxy server is added to announce the same IP. The router diligently adjusts the consistent hash to move 1/3 connections to the new server while keeping 2/3 connections where they were. Unfortunately for those 1/3 connections that were already in progress, the packets are now arriving on a server that doesn’t know about the connection, and so they fail.&lt;/p&gt;

&lt;h2 id=&quot;split-directorproxy-load-balancer-design&quot;&gt;Split director/proxy load balancer design&lt;/h2&gt;

&lt;p&gt;The issue with the previous ECMP-only solution is that it isn’t aware of the full context for a given packet, nor is it able to store data for each packet/connection. As it turns out, there are commonly used patterns to help out with this situation by implementing some stateful tracking in software, typically using a tool like &lt;a href=&quot;https://en.wikipedia.org/wiki/Linux_Virtual_Server&quot;&gt;Linux Virtual Server (LVS)&lt;/a&gt;. We create a new tier of “director” servers that take packets from the router via ECMP, but rather than relying on the router’s ECMP hashing to choose the backend proxy server, we instead control the hashing and store state (which backend was chosen) for all in-progress connections. When we change the set of proxy tier servers, the director tier hopefully hasn’t changed, and our connection will continue.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;ECMP redistribution with LVS director missing state&quot; src=&quot;/images/glb-director-open-source-load-balancer/ecmp-redist-lvs-no-state.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Although this works well in many cases, it does have some drawbacks. In the above example, we add both a LVS director and backend proxy server at the same time. The new director receives some set of packets, but doesn’t have any state yet (or has delayed state), so hashes it as a new connection and may get it wrong (and cause the connection to fail). A typical workaround with LVS is to use &lt;a href=&quot;http://www.linuxvirtualserver.org/docs/sync.html&quot;&gt;multicast connection syncing&lt;/a&gt; to keep the connection state shared amongst all LVS director servers. This still requires connection state to propagate, and also still requires duplicate state - not only does each proxy need state for each connection in the Linux kernel network stack, but &lt;em&gt;every&lt;/em&gt; LVS director also needs to store a mapping of connection to backend proxy server.&lt;/p&gt;

&lt;h2 id=&quot;removing-all-state-from-the-director-tier&quot;&gt;Removing all state from the director tier&lt;/h2&gt;

&lt;p&gt;When we were designing GLB, we decided we wanted to improve on this situation and not duplicate state at all. GLB takes a different approach to that described above, by using the flow state already stored in the proxy servers as part of maintaining established Linux TCP connections from clients.&lt;/p&gt;

&lt;p&gt;For each incoming connection, we pick a primary and secondary server that could handle that connection. When a packet arrives on the primary server and isn’t valid, it is forwarded to the secondary server. The hashing to choose the primary/secondary server is done once, up front, and is stored in a lookup table, and so doesn’t need to be recalculated on a per-flow or per-packet basis. When a new proxy server is added, for 1/N connections it becomes the new primary, and the old primary becomes the secondary. This allows existing flows to complete, because the proxy server can make the &lt;a href=&quot;#second-chance-on-proxies-with-iptables&quot;&gt;decisions with its local state&lt;/a&gt;, the single source of truth. Essentially this gives packets a “second chance” at arriving at the expected server that holds their state.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;ECMP redistribution with GLB&quot; src=&quot;/images/glb-director-open-source-load-balancer/ecmp-redist-glb.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Even though the director will still send connections to the wrong server, that server will then know how to forward on the packet to the correct server. The GLB director tier is completely stateless in terms of TCP flows: director servers can come and go at any time, and will always pick the same primary/secondary server providing their forwarding tables match (but they rarely change). To change proxies, some care needs to be taken, which we describe below.&lt;/p&gt;

&lt;h2 id=&quot;maintaining-invariants-rendezvous-hashing&quot;&gt;Maintaining invariants: rendezvous hashing&lt;/h2&gt;

&lt;p&gt;The core of the GLB Director design comes down to picking that primary and secondary server consistently, and to allow the proxy tier servers to drain and fill as needed. We consider each proxy server to have a state, and carefully adjust the state as a way of adding and removing servers.&lt;/p&gt;

&lt;p&gt;We create a static binary forwarding table, which is generated identically on each director server, to map incoming flows to a given primary and secondary server. Rather than having complex logic to pick from all available servers at packet processing time, we instead use some indirection by creating a table (65k rows), with each row containing a primary and secondary server IP address. This is stored in memory as flat array of binary data, taking about 512kb per table. When a packet arrives, we consistently hash it (based on packet data alone) to the same row in that table (using the hash as an index into the array), which provides a consistent primary and secondary server pair.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB Forwarding Table with active servers&quot; src=&quot;/images/glb-director-open-source-load-balancer/forwarding-table-active.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;We want each server to appear approximately equally in both the primary and secondary fields, and to never appear in both in the same row. When we add a new server, we desire some rows to have their primary become secondary, and the new server become primary. Similarly, we desire the new server to become secondary in some rows. When we remove a server, in any rows where it was primary, we want the secondary to become primary, and another server to pick up secondary.&lt;/p&gt;

&lt;p&gt;This sounds complex, but can be summarised succinctly with a couple of &lt;a href=&quot;https://en.wikipedia.org/wiki/Invariant_(computer_science)&quot;&gt;invariants&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;As we change the set of servers, the relative order of existing servers should be maintained.&lt;/li&gt;
  &lt;li&gt;The order of servers should be computable without any state other than the list of servers (and maybe some predefined seeds).&lt;/li&gt;
  &lt;li&gt;Each server should appear at most once in each row.&lt;/li&gt;
  &lt;li&gt;Each server should appear approximately an equal number of times in each column.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reading the problem that way, &lt;a href=&quot;https://en.wikipedia.org/wiki/Rendezvous_hashing&quot;&gt;Rendezvous hashing&lt;/a&gt; is an ideal choice, since it can trivially satisfy these invariants. Each server (in our case, the IP) is hashed along with the row number, the servers are sorted by that hash (which is just a number), and we get a unique order for servers for that given row. We take the first two as the primary and secondary respectively.&lt;/p&gt;

&lt;p&gt;Relative order will be maintained because the hash for each server will be the same regardless of which other servers are included. The only information required to generate the table is the IPs of the servers. Since we’re just sorting a set of servers, the servers only appear once. Finally, if we use a good hash function that is pseudo-random, the ordering will be pseudo-random, and so the distribution will be even as we expect.&lt;/p&gt;

&lt;h2 id=&quot;draining-filling-adding-and-removing-proxies&quot;&gt;Draining, filling, adding and removing proxies&lt;/h2&gt;

&lt;p&gt;Adding or removing proxy servers require some care in our design. This is because a forwarding table entry only defines a primary/secondary proxy, so the draining/failover only works with at most a single proxy host in draining. We define the following valid states and state transitions for a proxy server:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB Proxy server state machine&quot; src=&quot;/images/glb-director-open-source-load-balancer/glb-proxy-state-machine.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;When a proxy server is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;draining&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filling&lt;/code&gt;, it is included in the forwarding table entries. In a stable state, all proxy servers are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active&lt;/code&gt;, and the rendezvous hashing described above will have an approximately even and random distribution of each proxy server in both the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;secondary&lt;/code&gt; columns.&lt;/p&gt;

&lt;p&gt;As a proxy server transitions to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;draining&lt;/code&gt;, we adjust the entries in the forwarding table by swapping the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;secondary&lt;/code&gt; entries we would have otherwise included:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB Forwarding Table with a draining server&quot; src=&quot;/images/glb-director-open-source-load-balancer/forwarding-table-draining.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This has the effect of sending packets to the server that was previously &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;secondary&lt;/code&gt; first. Since it receives the packets first, it will accept SYN packets and therefore take any new connections. For any packet it doesn’t understand as relating to a local flow, it forwards it to the other server (the previous &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary&lt;/code&gt;), which allows existing connections to complete.&lt;/p&gt;

&lt;p&gt;This has the effect of draining the desired server of connections gracefully, after which point it can be removed completely, and proxies can shuffle in to fill the empty &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;secondary&lt;/code&gt; slots:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB Forwarding Table with removed server&quot; src=&quot;/images/glb-director-open-source-load-balancer/forwarding-table-removed.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;A node in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filling&lt;/code&gt; looks just like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active&lt;/code&gt;, since the table inherently allows a second chance:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB Forwarding Table with filling server&quot; src=&quot;/images/glb-director-open-source-load-balancer/forwarding-table-filling.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This implementation requires that no more than one proxy server at a time is in any state other than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active&lt;/code&gt;, which in practise has worked well at GitHub. The state changes to proxy servers can happen as quickly as the longest connection duration that needs to be maintained. We’re working on extensions to the design that support more than just a primary and secondary, and some components (like the header listed below) already include initial support for arbitrary server lists.&lt;/p&gt;

&lt;h2 id=&quot;encapsulation-within-the-datacenter&quot;&gt;Encapsulation within the datacenter&lt;/h2&gt;

&lt;p&gt;We now have an algorithm to consistently pick backend proxy servers and operate on them, but how do we actually move packets around the datacenter? How do we encode the secondary server inside the packet so the primary can forward a packet it doesn’t understand?&lt;/p&gt;

&lt;p&gt;Traditionally in the LVS setup, an &lt;a href=&quot;https://en.wikipedia.org/wiki/IP_in_IP&quot;&gt;IP over IP (IPIP)&lt;/a&gt; tunnel is used. The client IP packet is encapsulated inside an internal datacenter IP packet and forwarded on to the proxy server, which decapsulates it. We found that it was difficult to encode the additional server metadata inside IPIP packets, as the only standard space available was the &lt;a href=&quot;https://en.wikipedia.org/wiki/Internet_Protocol_Options&quot;&gt;IP Options&lt;/a&gt;, and our datacenter routers passed packets with unknown IP options to software for processing (which they called “Layer 2 slow path”), taking speeds from millions to thousands of packets per second.&lt;/p&gt;

&lt;p&gt;To avoid this, we needed to hide the data inside a different packet format that the router wouldn’t try to understand. We initially adopted raw &lt;a href=&quot;https://lwn.net/Articles/614348/&quot;&gt;Foo-over-UDP (FOU)&lt;/a&gt; with a custom &lt;a href=&quot;https://en.wikipedia.org/wiki/Generic_Routing_Encapsulation&quot;&gt;Generic Route Encapsulation (GRE)&lt;/a&gt; payload, essentially encapsulating everything inside a UDP packet. We recently transitioned to &lt;a href=&quot;https://tools.ietf.org/html/draft-ietf-nvo3-gue-05&quot;&gt;Generic UDP Encapsulation (GUE)&lt;/a&gt;, which is a layer on top FOU which provides a standard for encapsulating IP protocols inside a UDP packet. We place our secondary server’s IP inside the private data of the GUE header. From a router’s perspective, these packets are all internal datacenter UDP packets between two normal servers.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+\
|          Source port          |        Destination port       | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ UDP
|             Length            |            Checksum           | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/
| 0 |C|   Hlen  |  Proto/ctype  |             Flags             | GUE
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Private data type (0)     |  Next hop idx |   Hop count   |\
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
|                             Hop 0                             | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ GLB
|                              ...                              | private
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ data
|                             Hop N                             | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Another benefit to using UDP is that the source port can be filled in with a per-connection hash so that they are flow within the datacenter over different paths (where ECMP is used within the datacenter), and received on different RX queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header fields). This is not possible with IPIP because most commodity datacenter NICs are only able to understand plain IP, TCP/IP and UDP/IP (and a few others). Notably, the NICs we use cannot look inside IP/IP packets.&lt;/p&gt;

&lt;p&gt;When the proxy server wants to send a packet back to the client, it doesn’t need to be encapsulated or travel back through our director tier, it can be sent directly to the client (often called “Direct Server Return”). This is typical of this sort of load balancer design and is especially useful for content providers where the majority of traffic flows &lt;em&gt;outbound&lt;/em&gt; with a relatively small amount of traffic &lt;em&gt;inbound&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This leaves us with a packet flow that looks like the following:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB second chance packet flow&quot; src=&quot;/images/glb-director-open-source-load-balancer/second-chance.png&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;dpdk-for-10g-line-rate-packet-processing&quot;&gt;DPDK for 10G+ line rate packet processing&lt;/h2&gt;

&lt;p&gt;Since we first &lt;a href=&quot;https://githubengineering.com/introducing-glb/&quot;&gt;publicly discussed our initial design&lt;/a&gt;, we’ve completely rewritten &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;glb-director&lt;/code&gt; to use &lt;a href=&quot;https://www.dpdk.org/&quot;&gt;DPDK&lt;/a&gt;, an open source project that allows &lt;em&gt;very&lt;/em&gt; fast packet processing from userland by bypassing the Linux kernel. This has allowed us to achieve NIC line rate processing on commodity NICs with commodity CPUs, and allows us to trivially scale our director tier to handle as much inbound traffic as our public connectivity requires. This is particularly important during DDoS attacks, where we do not want our load balancer to be a bottleneck.&lt;/p&gt;

&lt;p&gt;One of our initial goals with GLB was that our load balancer could run on commodity datacenter hardware without any server-specific physical configuration. Both GLB director and proxy servers are provisioned like normal servers in our datacenter. Each server has a &lt;a href=&quot;https://en.wikipedia.org/wiki/Link_aggregation&quot;&gt;bonded pair of network interfaces&lt;/a&gt;, and those interfaces are shared between DPDK and Linux on GLB director servers.&lt;/p&gt;

&lt;p&gt;Modern NICs support &lt;a href=&quot;https://en.wikipedia.org/wiki/Single-root_input/output_virtualization&quot;&gt;SR-IOV&lt;/a&gt;, a technology that enables a single NIC to act like multiple NICs from the perspective of the operating system. This is typically used by virtual machine hypervisors to ask the real NIC (“Physical Function”) to create multiple pretend NICs for each VM (“Virtual Functions”). To enable DPDK and the Linux kernel to share NICs, we use &lt;a href=&quot;https://doc.dpdk.org/guides/howto/flow_bifurcation.html&quot;&gt;flow bifurcation&lt;/a&gt;, which sends specific traffic (destined to GLB-run IP addresses) to our DPDK process on a Virtual Function while leaving the rest of the packets with the Linux kernel’s networking stack on the Physical Function.&lt;/p&gt;

&lt;p&gt;We’ve found that the packet processing rates of DPDK on a Virtual Function are acceptable for our requirements. GLB Director uses a &lt;a href=&quot;https://doc.dpdk.org/guides/prog_guide/packet_distrib_lib.html&quot;&gt;DPDK Packet Distributor&lt;/a&gt; pattern to spread the work of encapsulating packets across any number of CPU cores on the machine, and since it is stateless this can be highly parallelised.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB Flow Paths&quot; src=&quot;/images/glb-director-open-source-load-balancer/flow-paths.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;GLB Director supports matching and forwarding inbound IPv4 and IPv6 packets containing TCP payloads, as well as inbound ICMP Fragmentation Required messages used as part of &lt;a href=&quot;https://en.wikipedia.org/wiki/Path_MTU_Discovery&quot;&gt;Path MTU Discovery&lt;/a&gt;, by peeking into the inner layers of the packet during matching.&lt;/p&gt;

&lt;h2 id=&quot;bringing-test-suites-to-dpdk-with-scapy&quot;&gt;Bringing test suites to DPDK with Scapy&lt;/h2&gt;

&lt;p&gt;One problem that typically arises in creating (or using) technologies that operate at high speeds due to using low-level primitives (like communicating with the NIC directly) is that they become significantly more difficult to test. As part of creating the GLB Director, we also created a test environment that supports simple end-to-end packet flow testing of our DPDK application, by leveraging the way DPDK provides an &lt;a href=&quot;https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html&quot;&gt;Environment Abstraction Layer (EAL)&lt;/a&gt; that allows a physical NIC and a libpcap-based local interface to appear the same from the view of the application.&lt;/p&gt;

&lt;p&gt;This allowed us to write tests in &lt;a href=&quot;https://scapy.net/&quot;&gt;Scapy&lt;/a&gt;, a wonderfully simple Python library for reading, manipulating and writing packet data. By creating a Linux &lt;a href=&quot;http://man7.org/linux/man-pages/man4/veth.4.html&quot;&gt;Virtual Ethernet Device&lt;/a&gt;, with Scapy on one side and DPDK on the other, we were able to pass in custom crafted packets and validate what our software would provide on the other side, being a fully GUE-encapsulated packet directed to the expected backend proxy server.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB&apos;s Scapy test setup&quot; src=&quot;/images/glb-director-open-source-load-balancer/scapy-setup.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This allows us to test more complex behaviours such as traversing layers of ICMPv4/ICMPv6 headers to retrieve the original IPs and TCP ports for correct forwarding of ICMP messages from external routers.&lt;/p&gt;

&lt;h2 id=&quot;healthchecking-of-proxies-for-auto-failover&quot;&gt;Healthchecking of proxies for auto-failover&lt;/h2&gt;

&lt;p&gt;Part of the design of GLB is to handle server failure gracefully. The current design of having a designated primary/secondary for a given forwarding table entry / client means that we can work around single-server failure by running health checks from the perspective of each director. We run a service called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;glb-healthcheck&lt;/code&gt; which continually validates each backend server’s GUE tunnel and arbitrary HTTP port.&lt;/p&gt;

&lt;p&gt;When a server fails, we swap the primary/secondary entries anywhere that server is primary. This performs a “soft drain” of the server, which provides the best chance for connections to gracefully fail over. If the healthcheck failure is a false positive, connections won’t be disrupted, they will just traverse a slightly different path.&lt;/p&gt;

&lt;h2 id=&quot;second-chance-on-proxies-with-iptables&quot;&gt;Second chance on proxies with iptables&lt;/h2&gt;

&lt;p&gt;The final component that makes up GLB is a &lt;a href=&quot;https://en.wikipedia.org/wiki/Netfilter&quot;&gt;Netfilter&lt;/a&gt; module and &lt;a href=&quot;https://en.wikipedia.org/wiki/Iptables&quot;&gt;iptables&lt;/a&gt; target that runs on every proxy server and allows the “second chance” design to function.&lt;/p&gt;

&lt;p&gt;This module provides a simple task deciding whether the inner TCP/IP packet inside every GUE packet is valid locally according to the Linux kernel TCP stack, and if it isn’t, forwards it to the next proxy server (the secondary) rather than decapsulating it locally.&lt;/p&gt;

&lt;p&gt;In the case where a packet is a SYN (new connection) or is valid locally for an established connection, it simply accepts it locally. We then use the Linux kernel 4.x GUE support provided as part of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fou&lt;/code&gt; module to receive the GUE packet and process it locally.&lt;/p&gt;

&lt;h2 id=&quot;available-today-as-open-source&quot;&gt;Available today as open source&lt;/h2&gt;

&lt;p&gt;When we started down the path of writing a better datacenter load balancer, we decided that we wanted to release it open source so that others could benefit from and share in our work. We’re excited to be releasing all the components discussed here as open source at &lt;a href=&quot;https://github.com/github/glb-director&quot;&gt;github/glb-director&lt;/a&gt;. We hope this will allow others to reuse our work and contribute to a common standard software load balancing solution that runs on commodity hardware in physical datacenter environments.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB component overview&quot; src=&quot;/images/glb-director-open-source-load-balancer/glb-component-overview.png&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;also-were-hiring&quot;&gt;Also, we’re hiring!&lt;/h2&gt;

&lt;p&gt;GLB and the GLB Director has been an ongoing project designed, authored, reviewed and supported by various members of GitHub’s Production Engineering organisation, including &lt;a href=&quot;https://github.com/joewilliams&quot;&gt;@joewilliams&lt;/a&gt;, &lt;a href=&quot;https://github.com/nautalice&quot;&gt;@nautalice&lt;/a&gt;, &lt;a href=&quot;https://github.com/ross&quot;&gt;@ross&lt;/a&gt;, &lt;a href=&quot;https://github.com/theojulienne&quot;&gt;@theojulienne&lt;/a&gt; and many others. If you’re interested in joining us in building great infrastructure projects like GLB, our Data Center team is hiring production engineers specialising in &lt;a href=&quot;https://boards.greenhouse.io/github/jobs/1138733&quot;&gt;Traffic Systems&lt;/a&gt;, &lt;a href=&quot;https://boards.greenhouse.io/github/jobs/1141785&quot;&gt;Network&lt;/a&gt; and &lt;a href=&quot;https://boards.greenhouse.io/github/jobs/1134999&quot;&gt;Facilities&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Wed, 08 Aug 2018 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2018/08/08/glb-director-open-source-load-balancer.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2018/08/08/glb-director-open-source-load-balancer.html</guid>
        
        
      </item>
    
      <item>
        <title>GLB part 2: HAProxy zero-downtime, zero-delay reloads with multibinder</title>
        <description>&lt;p&gt;Recently we &lt;a href=&quot;http://githubengineering.com/introducing-glb/&quot;&gt;introduced GLB&lt;/a&gt;, the GitHub Load Balancer that powers GitHub.com. The GLB proxy tier, which handles TCP connection and TLS termination is powered by &lt;a href=&quot;http://www.haproxy.org/&quot;&gt;HAProxy&lt;/a&gt;, a reliable and high performance TCP and HTTP proxy daemon. As part of the design of GLB, we set out to solve a few of the common issues found when using HAProxy at scale.&lt;/p&gt;

&lt;p&gt;Prior to GLB, each host ran a single monolithic instance of HAProxy for all our public services, with frontends for each external IP set, and backends for each backing service. With the number of services we run, this became unwieldy, our configuration was over one thousand lines long with many interdependent ACLs and no modularization. Migrating to GLB we decided to split the configuration per-service and support running multiple isolated load balancer instances on a single machine. Additionally, we wanted to be able to update a single HAProxy configuration easily without any downtime, additional latency on connections or disrupting any other HAProxy instance on the host. Today we are releasing our solution to this problem, &lt;a href=&quot;https://github.com/github/multibinder&quot;&gt;multibinder&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;haproxy-almost-safe-reloads&quot;&gt;HAProxy almost-safe reloads&lt;/h2&gt;

&lt;p&gt;HAProxy uses the SO_REUSEPORT socket option, which allows multiple processes to create LISTEN sockets on the same IP/port combination. The Linux kernel then balances new connections between all available LISTEN sockets. In this diagram, we see the initial stage of an HAProxy reload starting with a single process (left) and then causing a second process to start (right) which binds to the same IP and port, but with a different socket:&lt;/p&gt;

&lt;div style=&quot;text-align:center; padding: 10px 0px;&quot;&gt;
&lt;img alt=&quot;Forking a second HAProxy by default&quot; src=&quot;/images/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/1-fork.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This works great so far, until the original process terminates. HAProxy sends a signal to the original process stating that the new process is now &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt;ing and handling connections (left), which causes it to stop accepting new connections and close its own socket before eventually exiting once all connections complete (right):&lt;/p&gt;

&lt;div style=&quot;text-align:center; padding: 10px 0px;&quot;&gt;
&lt;img alt=&quot;Lost connections on termination&quot; src=&quot;/images/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/2-lost-conns.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Unfortunately there’s a small period between when this process last calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; and when it calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;close()&lt;/code&gt; where the kernel will still route some new connections to the original socket. The code then blindly continues to close the socket, and all connections that were queued up in that LISTEN socket get discarded (because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; is never called for them):&lt;/p&gt;

&lt;div style=&quot;text-align:center; padding: 10px 0px;&quot;&gt;
&lt;img alt=&quot;Dropped connections between accept() and close()&quot; src=&quot;/images/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/3-accept-close.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;For small scale sites, the chance of a new connection arriving in the few microseconds between these calls is very low. Unfortunately at the scale we run HAProxy, a customer impacting number of connections would hit this issue each and every time we reload HAProxy. Previously we used the official solution offered by HAProxy, dropping SYN packets during this small window, causing the client to retry the SYN packet shortly afterwards. Other &lt;a href=&quot;https://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html&quot;&gt;potential solutions&lt;/a&gt; to the same problem include using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tc qdisc&lt;/code&gt; to stall the SYN packets as they come in, and then un-stall the queue once the reload is complete. During development of GLB, we weren’t satisfied with either solution and sought out one without any queue delays and sharing of the same LISTEN socket.&lt;/p&gt;

&lt;h2 id=&quot;supporting-zero-downtime-zero-delay-reloads&quot;&gt;Supporting zero-downtime, zero-delay reloads&lt;/h2&gt;

&lt;p&gt;The way other services typically support zero-downtime reloads is to share a LISTEN socket, usually by having a parent process that holds the socket open and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork()&lt;/code&gt;s the service when it needs to reload, leaving the socket open for the new process to consume. This creates a slightly different situation, where the kernel has a single LISTEN socket and clients are queued for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; by either process. The file descriptors in each process may be different, but they will point to the same in-kernel socket structure.&lt;/p&gt;

&lt;p&gt;In this scenario, a new process would be started that inherits the same LISTEN socket (left), and when the original pid stops calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt;, connections remain queued for the new process to process because the kernel LISTEN socket and queue are shared (right):&lt;/p&gt;

&lt;div style=&quot;text-align:center; padding: 10px 0px;&quot;&gt;
&lt;img alt=&quot;Ideal socket sharing method&quot; src=&quot;/images/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/4-share-socket.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Unfortunately, HAProxy doesn’t support this method directly. We considered patching HAProxy to add built-in support but found that the architecture of HAProxy favours process isolation and non-dynamic configuration, making it a non-trivial architectural change. Instead, we created &lt;a href=&quot;https://github.com/github/multibinder&quot;&gt;multibinder&lt;/a&gt; to solve this problem generically for any daemon that needs zero-downtime reload capabilities, and integrated it with HAProxy by using a few tricks with existing HAProxy configuration directives to get the same result.&lt;/p&gt;

&lt;p&gt;Multibinder is similar to other file-descriptor sharing services such as &lt;a href=&quot;https://github.com/stripe/einhorn&quot;&gt;einhorn&lt;/a&gt;, except that it runs as an isolated service and process tree on the system, managed by your usual process manager. The actual service, in this case HAProxy, runs separately as another service, rather than as a child process. When HAProxy is started, a small wrapper script calls out to multibinder and requests the existing LISTEN socket to be sent using &lt;a href=&quot;http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch14lev1sec6.html&quot;&gt;Ancillary Data&lt;/a&gt; over an UNIX Domain Socket. The flow looks something like the following:&lt;/p&gt;

&lt;div style=&quot;text-align:center; padding: 10px 0px;&quot;&gt;
&lt;img alt=&quot;Multibinder reload flow&quot; src=&quot;/images/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/5-ancillary.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Once the socket is provided to the HAProxy wrapper, it leaves the LISTEN socket in the file descriptor table and writes out the HAProxy configuration file from an ERB template, injecting the file descriptors using &lt;a href=&quot;http://cbonte.github.io/haproxy-dconv/1.6/configuration.html#bind&quot;&gt;file descriptor binds&lt;/a&gt; like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd@N&lt;/code&gt; (where N is the file descriptor received from multibinder), then calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exec()&lt;/code&gt; to launch HAProxy which uses the provided file descriptor rather than creating a new socket, thus inheriting the same LISTEN socket. From here, we get the ideal setup where the original HAProxy process can stop calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accept()&lt;/code&gt; and connections simply queue up for the new process to handle.&lt;/p&gt;

&lt;div style=&quot;text-align:center; padding: 10px 0px;&quot;&gt;
&lt;img alt=&quot;Multibinder LISTEN socket sharing diagram&quot; src=&quot;/images/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/6-success.png&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;example--multiple-instances&quot;&gt;Example &amp;amp; multiple instances&lt;/h2&gt;

&lt;p&gt;Along with the release of multibinder, we’re also providing examples of &lt;a href=&quot;https://github.com/github/multibinder/tree/master/haproxy&quot;&gt;running multiple HAProxy instances with multibinder&lt;/a&gt; leveraging systemd service templates. Following these instructions you can launch a set of HAProxy servers using separate configuration files, each using the same system-wide multibinder instance to request their binds and having true zero-downtime, zero-delay reloads.&lt;/p&gt;
</description>
        <pubDate>Thu, 01 Dec 2016 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2016/12/01/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2016/12/01/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder.html</guid>
        
        
      </item>
    
      <item>
        <title>Introducing the GitHub Load Balancer</title>
        <description>&lt;p&gt;At GitHub we serve billions of HTTP, Git and SSH connections each day. To get the best performance we run on &lt;a href=&quot;http://githubengineering.com/githubs-metal-cloud/&quot;&gt;bare metal hardware&lt;/a&gt;. Historically one of the more complex components has been our load balancing tier. Traditionally we scaled this vertically, running a small set of very large machines running &lt;a href=&quot;http://www.haproxy.org/&quot;&gt;haproxy&lt;/a&gt;, and using a very specific hardware configuration allowing dedicated 10G link failover. Eventually we needed a solution that was scalable and we set out to create a load balancer solution that would run on commodity hardware in our typical data center configuration.&lt;/p&gt;

&lt;p&gt;Over the last year we’ve developed our new load balancer, called GLB (GitHub Load Balancer). Today, and over the next few weeks, we will be sharing the design and releasing its components as open source software.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;GLB&quot; src=&quot;/images/introducing-glb/glb-logo-dark.png&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;out-with-the-old-in-with-the-new&quot;&gt;Out with the old, in with the new&lt;/h2&gt;

&lt;p&gt;GitHub is growing and our monolithic, vertically scaled load balancer tier had met its match and a new approach was required. Our original design was based around a small number of large machines each with dedicated links to our network spine. This design tied networking gear, the load balancing hosts and load balancer configuration together in such a way that scaling horizontally was deemed too difficult. We set out to find a better way.&lt;/p&gt;

&lt;p&gt;We first identified the goals of the new system, design pitfalls of the existing system and prior art that we could draw &lt;a href=&quot;http://www.linuxvirtualserver.org/&quot;&gt;experience&lt;/a&gt; and &lt;a href=&quot;https://www.youtube.com/watch?v=dKsOvc73gQk&quot;&gt;inspiration&lt;/a&gt; from. After some time we determined that the following would produce a successful load balancing tier that we could maintain into the future:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Runs on commodity hardware&lt;/li&gt;
  &lt;li&gt;Scales horizontally&lt;/li&gt;
  &lt;li&gt;Supports high availability, avoids breaking TCP connections during normal operation and failover&lt;/li&gt;
  &lt;li&gt;Supports connection draining&lt;/li&gt;
  &lt;li&gt;Per service load balancing, with support for multiple services per load balancer host&lt;/li&gt;
  &lt;li&gt;Can be iterated on and deployed like normal software&lt;/li&gt;
  &lt;li&gt;Testable at each layer, not just integration tests&lt;/li&gt;
  &lt;li&gt;Built for multiple POPs and data centers&lt;/li&gt;
  &lt;li&gt;Resilient to typical DDoS attacks, and tools to help mitigate new attacks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;design&quot;&gt;Design&lt;/h2&gt;

&lt;p&gt;To achieve these goals we needed to rethink the relationship between IP addresses and hosts, the constituent layers of our load balancing tier and how connections are routed, controlled and terminated.&lt;/p&gt;

&lt;h3 id=&quot;stretching-an-ip&quot;&gt;Stretching an IP&lt;/h3&gt;

&lt;p&gt;In a typical setup, you assign a single public facing IP address to a single physical machine. DNS can then be used to split traffic over multiple IPs, letting you shard traffic across multiple servers. Unfortunately, DNS entries are cached fairly aggressively (often ignoring the TTL), and some of our users may specifically whitelist or hardcode IP addresses. Additionally, we offer a certain set of IPs for our Pages service which customers can use directly for their apex domain. Rather than relying on adding additional IPs to increase capacity, and having an IP address fail when the single server failed, we wanted a solution that would allow a single IP address to be served by multiple physical machines.&lt;/p&gt;

&lt;p&gt;Routers have a feature called Equal-Cost Multi-Path (ECMP) routing, which is designed to split traffic destined for a single IP across multiple links of equal cost. ECMP works by hashing certain components of an incoming packet such as the source and destination IP addresses and ports. By using a consistent hash for this, subsequent packets that are part of the same TCP flow will hash to the same path, avoiding out of order packets and maintaining session affinity.&lt;/p&gt;

&lt;p&gt;This works great for routing packets across multiple paths to the same physical destination server. Where it gets interesting is when you use ECMP to split traffic destined for a single IP across multiple physical servers, each of which terminate TCP connections but share no state, like in a load balancer. When one of these servers fails or is taken out of rotation and is removed from the ECMP server set a &lt;a href=&quot;https://en.wikipedia.org/wiki/Consistent_hashing&quot;&gt;rehash event occurs&lt;/a&gt;. 1/N connections will get reassigned to the remaining servers. Since these servers don’t share connection state these connections get terminated. Unfortunately, these connections may not be the same 1/N connections that were mapped to the failing server. Additionally, there is no way to gracefully remove a server for maintenance without also disrupting 1/N active connections.&lt;/p&gt;

&lt;h3 id=&quot;l4l7-split-design&quot;&gt;L4/L7 split design&lt;/h3&gt;

&lt;p&gt;A pattern that has been used by other projects is to split the load balancers into a L4 and L7 tier. At the L4 tier, the routers use ECMP to shard traffic using consistent hashing to a set of L4 load balancers - typically using software like &lt;a href=&quot;http://www.linuxvirtualserver.org/software/ipvs.html&quot;&gt;ipvs/LVS&lt;/a&gt;. LVS keeps connection state, and optionally syncs connection state with multicast to other L4 nodes, and forwards traffic to the L7 tier which runs software such as haproxy. We call the L4 tier “director” hosts since they direct traffic flow, and the L7 tier “proxy” hosts, since they proxy connections to backend servers.&lt;/p&gt;

&lt;p&gt;This L4/L7 split has an interesting benefit: the proxy tier nodes can now be removed from rotation by gracefully draining existing connections, since the connection state on the director nodes will keep existing connections mapped to their existing proxy server, even after they are removed from rotation for new connections. Additionally, the proxy tier tends to be the one that requires more upkeep due to frequent configuration changes, upgrades and scaling so this works to our advantage.&lt;/p&gt;

&lt;p&gt;If the multicast connection syncing is used, then the L4 load balancer nodes handle failure slightly more gracefully, since once a connection has been synced to the other L4 nodes, the connection will no longer be disrupted. Without connection syncing, providing the director nodes hash connections the same way and have the same backend set, connections may successfully continue over a director node failure. In practise, most installations of this tiered design just accept connection disruption under node failure or node maintenance.&lt;/p&gt;

&lt;p&gt;Unfortunately, using LVS for the director tier has some significant drawbacks. Firstly, multicast was not something we wanted to support, so we would be relying on the nodes having the same view of the world, and having consistent hashing to the backend nodes. Without connection syncing, certain events, including planned maintenance of nodes, could cause connection disruption. Connection disruption is something we wanted to avoid due to how git cannot retry or resume if the connection is severed mid-flight. Finally, the fact that the director tier requires connection state at all adds an extra complexity to DDoS mitigation such as &lt;a href=&quot;http://githubengineering.com/syn-flood-mitigation-with-synsanity/&quot;&gt;synsanity&lt;/a&gt; - to avoid resource exhaustion, syncookies would now need to be generated on the director nodes, despite the fact that the connections themselves are terminated on the proxy nodes.&lt;/p&gt;

&lt;h3 id=&quot;designing-a-better-director&quot;&gt;Designing a better director&lt;/h3&gt;

&lt;p&gt;We decided early on in the design of our load balancer that we wanted to improve on the common pattern for the director tier. We set out to design a new director tier that was stateless and allowed both director and proxy nodes to be gracefully removed from rotation without disruption to users wherever possible. Users live in countries with less than ideal internet connectivity, and it was important to us that long running clones of reasonably sized repositories would not fail during planned maintenance within a reasonable time limit.&lt;/p&gt;

&lt;p&gt;The design we settled on, and now use in production, is a variant of &lt;a href=&quot;https://en.wikipedia.org/wiki/Rendezvous_hashing&quot;&gt;Rendezvous hashing&lt;/a&gt; that supports constant time lookups. We start by storing each proxy host and assign a state. These states handle the connection draining aspect of our design goals and will be discussed further in a future post. We then generate a single, fixed-size forwarding table and fill each row with a set of proxy servers using the ordering component of Rendezvous hashing. This table, along with the proxy states, are sent to all director servers and kept in sync as proxies come and go. When a TCP packet arrives on the director, we hash the source IP to generate consistent index into the forwarding table. We then encapsulate the packet inside another IP packet (actually &lt;a href=&quot;https://lwn.net/Articles/614348/&quot;&gt;Foo-over-UDP&lt;/a&gt;) destined to the internal IP of the proxy server, and send it over the network. The proxy server receives the encapsulated packet, decapsulates it, and processes the original packet locally. Any outgoing packets use Direct Server Return, meaning packets destined to the client egress directly to the client, completely bypassing the director tier.&lt;/p&gt;

&lt;h2 id=&quot;stay-tuned&quot;&gt;Stay tuned&lt;/h2&gt;

&lt;p&gt;Now that you have a taste of the system that processed and routed the request to this blog post we hope you stay tuned for future posts describing our director design in depth, improving haproxy hot configuration reloads and how we managed to migrate to the new system without anyone noticing.&lt;/p&gt;
</description>
        <pubDate>Thu, 22 Sep 2016 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2016/09/22/introducing-glb.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2016/09/22/introducing-glb.html</guid>
        
        
      </item>
    
      <item>
        <title>SYN Flood Mitigation with synsanity</title>
        <description>&lt;p&gt;GitHub hosts a wide range of user content, and like all large websites this often causes us to become a target of denial of service attacks. Around a year ago, GitHub was on the receiving end of a large, unusual and very well publicised attack involving both application level and volumetric attacks against our infrastructure.&lt;/p&gt;

&lt;p&gt;Our users rely on us to be highly available and we take this seriously. Although the attackers are doing the wrong thing, there’s no use blaming the attacker for their attacks being successful. Our commitment is to own our own availability, and that we have a responsibility to mitigate these sorts of attacks to the maximum extent technically possible.&lt;/p&gt;

&lt;p&gt;In an effort to reduce the impact of these attacks, we began work on a series of additional mitigation strategies and systems to better prepare us for a future attack of a similar nature. Today we’re sharing our mitigation for one of the attacks we received: synsanity, a SYN flood DDoS mitigation module for Linux 3.x.&lt;/p&gt;

&lt;h2 id=&quot;what-is-a-syn-flood-anyway&quot;&gt;What is a SYN flood anyway?&lt;/h2&gt;

&lt;p&gt;SYN floods are one of the oldest and most common attacks, so common that the Linux kernel includes some built in support for mitigating them. When a client connects to a server using TCP, it uses the &lt;a href=&quot;https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Connection_establishment&quot;&gt;three-way handshake&lt;/a&gt; to synchronise:&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;TCP Three-way Handshake&quot; src=&quot;/images/syn-flood-mitigation-with-synsanity/tcp-3whs.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;A SYN packet is essentially the client telling the server “I’d like to connect”. During this handshake, both client and server generate random Initial Sequence Numbers (ISNs), which are used to synchronise the TCP connection between the two parties. These sequence numbers let TCP keep track of which messages have been sent and acknowledged by the other party.&lt;/p&gt;

&lt;p&gt;A SYN flood abuses this handshake by only going part way through the handshake. Rather than progressing through the normal sequence, an attacker floods the target server with as many SYN packets as they can muster, from as many different hosts as they can, and spoofing the origin IP as much as they can.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;SYN Flood&quot; src=&quot;/images/syn-flood-mitigation-with-synsanity/syn-flood.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;The host receiving the SYN flood must respond to each and every packet with a SYN-ACK, but unfortunately the source IP was likely spoofed, so they go nowhere (or worse, come back as rejected). These packets are almost indistinguishable from real SYN packets from real clients, which means it’s hard or impossible to filter out the bad ones on the server. Even external DDoS scrubbing services can only guess whether a packet is legitimate or part of a flood, making it difficult to mitigate an attack without impacting legitimate traffic.&lt;/p&gt;

&lt;p&gt;To make matters worse, when the server is handling normal connections and receives the ACK from a real client, it still needs to know that it came from a SYN packet it sent, so it must also keep a list of connections (in state &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SYN_RECV&lt;/code&gt;) for which a SYN has been received and an ACK has not yet been received.&lt;/p&gt;

&lt;p&gt;During a SYN flood, this behaviour is undesirable. If the queue of connections in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SYN_RECV&lt;/code&gt; has no size limit, memory will get exhausted pretty quickly. If it does have a size limit, as is the case in Linux, then there’s no more space to store state and the connections will simply fail as the packets are dropped.&lt;/p&gt;

&lt;h2 id=&quot;syn-cookies&quot;&gt;SYN cookies&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/SYN_cookies&quot;&gt;SYN cookies&lt;/a&gt; are a clever way of avoiding the storage of TCP connection state during the initial handshake, deferring that storage until a valid ACK has been received. It works by crafting the Initial Sequence Number (ISN) in the SYN-ACK packet sent by the server in such a way that it cryptographically hashes details about the initial SYN packet and its TCP options, so that when the ACK is received (with a sequence number 1 larger than the ISN), the server can validate that it generated the SYN-ACK packet for which an ACK is now being received. The server stores no state for the connection until the ACK (containing the validated SYN cookie) is received, and only at that point is state regenerated and stored.&lt;/p&gt;

&lt;p&gt;Since this hash is calculated with a secret that only the server knows, it doesn’t significantly weaken the sequence number selection and it’s still difficult for someone to forge an ACK (or other packet) for a different connection without having seen the SYN-ACK from the real server.&lt;/p&gt;

&lt;p&gt;SYN cookies have been around for a while, and they have fairly minimal impact on the reliability and spoof-protection of TCP. Rather than enabling them constantly, the Linux kernel by default automatically enables SYN cookies only when the SYN receive queue is full. This means that under normal circumstances when no SYN flood is occurring, you get no impact at all, but during a SYN flood, you accept the minimal impact of SYN cookies (in return for not dropping connections). The extra CPU cost of creating SYN cookies is offset by the fact that you no longer have a limited resource, and in practise this is an excellent trade-off.&lt;/p&gt;

&lt;p&gt;In Linux 3.x, SYN cookies are generated inside a machine-wide lock on the LISTEN socket that the packet was destined for. This implementation causes all SYN cookies to be generated serially across all cores, defeating the benefits of a multi-processor system. To make matters worse, all cores spin waiting for the lock to become available. This was fine back in the days when an average attacker could only send a few MBits of SYN packets your way, mostly thanks to the networks being much slower. These days however, with servers attached to transit providers with multiple 10GB+ links the whole way down the line, it’s now possible to completely saturate CPU resources.&lt;/p&gt;

&lt;p&gt;While Linux 4.x has a patch to send SYN cookies under a per-CPU-core socket lock, which does fix the problem, we wanted a solution that allowed us to use an existing, maintained kernel with upstream security patches. We didn’t want to roll and maintain an entire custom kernel and all related future security patches just to mitigate this form of attack. Patching Linux 3.x to backport the socket lock change was also a similar maintenance burden we wanted to avoid.&lt;/p&gt;

&lt;h2 id=&quot;synproxy&quot;&gt;SYNPROXY&lt;/h2&gt;

&lt;p&gt;One solution to get the best of both worlds was the SYNPROXY iptables module. It sits in &lt;a href=&quot;http://www.netfilter.org/&quot;&gt;netfilter&lt;/a&gt; in the kernel, before the Linux TCP stack, and as the name suggests, proxies all connections while generating SYN cookies. When a SYN packet comes in, it responds with a SYN-ACK and throws away all state. On receipt of a valid ACK packet matching the SYN cookie, it then sends a SYN downstream and completes the usual TCP handshake. For every subsequent packet in each direction, it modifies the sequence numbers so that it is transparent to both sides.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;SYNPROXY packet flow&quot; src=&quot;/images/syn-flood-mitigation-with-synsanity/synproxy.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;This is quite an intrusive way of solving the problem since it touches every packet during the entire connection, but it does successfully mitigate SYN floods. Unfortunately we found that in practise under our load and with the amount of malformed packets we receive, it quickly broke down and caused a kernel panic. Additionally, it had to be enabled all the time, since there was no simple way to activate it only when under attack. This meant that we would have to accept the minimal impact of SYN cookies constantly, and at our scale this still would likely cause issues for some of our users.&lt;/p&gt;

&lt;p&gt;We decided that it was more complicated than it needed to be for our use case, and we wanted a simpler solution that would only touch the packets that needed to be touched to mitigate a SYN flood. We also decided that a mitigation should only cause potential (even if minimal) impact during mitigation, and not under normal operation.&lt;/p&gt;

&lt;h2 id=&quot;synsanity&quot;&gt;synsanity&lt;/h2&gt;

&lt;p&gt;Enter synsanity, our solution to mitigate SYN floods on Linux 3.x. synsanity is inspired by SYNPROXY, in that it is an iptables module that sits inside iptables between the Linux TCP stack and the network card. The major difference is that rather than touch all packets, synsanity simply generates a SYN cookie identically to the way the Linux kernel would generate one if the SYN queue was full, and once it validates the ACK packet, it allows it through to the standard Linux SYN cookie code, which creates and completes the connection. After this point, synsanity doesn’t touch any further packet in the TCP connection.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;synsanity packet flow&quot; src=&quot;/images/syn-flood-mitigation-with-synsanity/synsanity.png&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Similar to the way that Linux only enables SYN cookies when the SYN queue overflows, we only enable synsanity when the SYN queue overflows as well. We match the core Linux code exactly, except that we do it in an iptables module, outside the LISTEN lock. Since an iptables module can be compiled and maintained outside the Linux kernel source tree itself, we don’t need to use a custom Linux kernel, and can instead just maintain and deploy a single module to our servers.&lt;/p&gt;

&lt;p&gt;synsanity has allowed us to mitigate multiple attacks that would have previously caused a partial or complete service outage, both long running attacks and large volume attacks.&lt;/p&gt;

&lt;div style=&quot;text-align:center&quot;&gt;
&lt;img alt=&quot;synsanity syncookie graph&quot; src=&quot;/images/syn-flood-mitigation-with-synsanity/graph-300kpps-syn-flood.png&quot; /&gt;
synsanity sending SYN cookies during a 300kpps SYN flood
&lt;/div&gt;

&lt;h2 id=&quot;open-source&quot;&gt;Open Source&lt;/h2&gt;

&lt;p&gt;We believe that if you need to hide your mitigation to keep it secure, it’s not designed well enough. The best and most secure tools are shared, open and subject to community scrutiny, so today we’re open sourcing &lt;a href=&quot;https://github.com/github/synsanity&quot;&gt;synsanity&lt;/a&gt; so that everyone can benefit from this work.&lt;/p&gt;
</description>
        <pubDate>Tue, 12 Jul 2016 00:00:00 +0000</pubDate>
        <link>http://theojulienne.io/2016/07/12/syn-flood-mitigation-with-synsanity.html</link>
        <guid isPermaLink="true">http://theojulienne.io/2016/07/12/syn-flood-mitigation-with-synsanity.html</guid>
        
        
      </item>
    
  </channel>
</rss>
