Theo Julienne
Software & Infrastructure Engineer
LinkedIn  Twitter  Facebook  Facebook 
  • Ethernet MTU and TCP MSS: Why connections stall

    MTU and MSS are two terms that are easily mistaken and their misconfiguration is often the cause of networking problems. Spending enough time working on production systems that interface with large networks of computers or the Internet almost guarantees coming across situations where an interface was configured with the wrong MTU, or a firewall was filtering ICMP. This results in a client being unable to transfer large amounts of data when smaller transfers work fine. This post will walk through MTU, MSS and packet size negotiation for TCP connections, and the common situations where it breaks down. This post was inspired by multiple discussions during the course of investigating errors on production systems as part of my role at GitHub.

  • Scaling Linux Services: Before accepting connections

    When writing services that accept TCP connections, we tend to think of our work as starting from the point where our service accepts a new client connection and finishing when we complete the request and close the socket. For services at scale, operations can happen at such a high rate that some of the default resource limits of the Linux kernel can break this abstraction and start causing impact to incoming connections outside of that connection lifecycle. This post focuses on some standard resource limitations that exist before the client socket is handed to the application - all of which came up during the course of investigating errors on production systems as part of my role at GitHub (in some cases, multiple times across different applications).

  • Debugging network stalls on Kubernetes

    Originally posted to the GitHub Engineering Blog

    We’ve talked about Kubernetes before, and over the last couple of years it’s become the standard deployment pattern at GitHub. We now run a large portion of both internal and public-facing services on Kubernetes. As our Kubernetes clusters have grown, and our targets on the latency of our services have become more stringent, we began to notice that certain services running on Kubernetes in our environment were experiencing sporadic latency that couldn’t be attributed to the performance characteristics of the application itself.

  • GLB: GitHub's open source load balancer

    Originally posted to the GitHub Engineering Blog

    At GitHub, we serve tens of thousands of requests every second out of our network edge, operating on GitHub’s metal cloud. We’ve previously introduced GLB, our scalable load balancing solution for bare metal datacenters, which powers the majority of GitHub’s public web and git traffic, as well as fronting some of our most critical internal systems such as highly available MySQL clusters. Today we’re excited to share more details about our load balancer’s design, as well as release the GLB Director as open source.

  • GLB part 2: HAProxy zero-downtime, zero-delay reloads with multibinder

    Originally posted to the GitHub Engineering Blog

    Recently we introduced GLB, the GitHub Load Balancer that powers GitHub.com. The GLB proxy tier, which handles TCP connection and TLS termination is powered by HAProxy, a reliable and high performance TCP and HTTP proxy daemon. As part of the design of GLB, we set out to solve a few of the common issues found when using HAProxy at scale.

  • Introducing the GitHub Load Balancer

    Originally posted to the GitHub Engineering Blog

    At GitHub we serve billions of HTTP, Git and SSH connections each day. To get the best performance we run on bare metal hardware. Historically one of the more complex components has been our load balancing tier. Traditionally we scaled this vertically, running a small set of very large machines running haproxy, and using a very specific hardware configuration allowing dedicated 10G link failover. Eventually we needed a solution that was scalable and we set out to create a load balancer solution that would run on commodity hardware in our typical data center configuration.

  • SYN Flood Mitigation with synsanity

    Originally posted to the GitHub Engineering Blog

    GitHub hosts a wide range of user content, and like all large websites this often causes us to become a target of denial of service attacks. Around a year ago, GitHub was on the receiving end of a large, unusual and very well publicised attack involving both application level and volumetric attacks against our infrastructure.

subscribe via RSS