This post was co-authored by Vinay Shivananda, Sr. Software Engineer.

When moving applications from monolithic to microservices architecture, you run into surprises. One such surprise? Intermittent failures in DNS infrastructure, which bubble up as breaks in application access that can affect user productivity.

Service discovery is a fundamental piece of any microservice architecture. A call to a service-discovery function to retrieve the service instance IP precedes any service-to-service communication. This function is performed by leveraging DNS, including with Kubernetes. And each service lookup amplifies the load on the DNS infrastructure.

In online technical forums, you’ll find engineers discussing app failures due to DNS timeouts at higher scale. Some of the tunables available in resolv.conf aren’t sufficient to address the DNS failures, which often happen when the Kubernetes service-to-service traffic increases. Here, an author on the Tinder engineering blog describes a DNS timeout issue where, despite CoreDNS scaling to 1,000 instances, application timeouts weren’t mitigated:

As we onboarded more and more services to Kubernetes, we found ourselves running a DNS service that was answering 250,000 requests per second. We were encountering intermittent and impactful DNS lookup timeouts within our applications. This occurred despite an exhaustive tuning effort and a DNS provider switch to a CoreDNS deployment that at one time peaked at 1,000 pods consuming 120 cores.

Because of overlay networking, any traffic between two pods across Kubernetes nodes results in the insertion of source and destination NAT rules in the conntrack table to ensure the packets are routed correctly to pods in both directions. In the case of certain race conditions in the Linux conntrack kernel module, DNS requests on UDP can get dropped, leading either to application timeouts or an increase in application latency.

How to Scale the DNS Layer

Authoritative DNS servers serve DNS records with a shelf life that’s implied by the time-to-live (TTL) value associated with the record. In the context of service discovery, TTL is often kept very low (10 seconds to a minute) because DNS records represent entities that are dynamic in nature. DNS is inherently hierarchical, and multiple layers of DNS caches often precede DNS authoritative servers. The caches absorb a large proportion of the load. DNS caches work by reusing DNS records to serve other clients until the TTL expires.

Similarly, to help DNS scale in a microservice architecture, one design pattern that has evolved is to have a layer of cache in front of the internal authoritative DNS servers.

Kubernetes is the de facto platform to deploy any microservice-based application, and CoreDNS is the DNS service the Kubernetes platform provides natively. As the microservices scale within Kubernetes, apps will face frequent timeouts because of CoreDNS scaling limits and conntrack module race conditions.

To scale DNS and to reduce application failures due to DNS timeouts in Kubernetes, you should have a DNS cache in each of the Kubernetes nodes, as shown in the image below. Interaction with the conntrack model is greatly reduced because Citrix ADC CPX pods are local to the node, and most of the DNS requests are sourced from the local cache.

Why Citrix ADC CPX?

Citrix ADC CPX is a Citrix ADC offering in container form factor. Citrix ADC has rich support for DNS protocol in proxy and authoritative modes across all its platforms (hardware/virtual/container). Below are some of the benefits of going with CPX as a node-level cache for the above use case.

  1. Citrix ADC DNS performance: Citrix ADC delivers a rich set of features with high scale and performance and has a scalable DNS implementation with memory footprint comparable to modern-day proxies.
  2. Proactive DNS cache update: For frequently accessed DNS records, Citrix ADC will asynchronously query (not in the context of the actual client request) the backend authoritative kubeDNS servers for the updated DNS records before the TTL expires. So, DNS clients are always served the latest DNS record, without dealing with additional latency to the backend.
  3. DNS request switching: Pipelined DNS requests are common in DNS. In the case of a cache miss, each of the pipelined DNS requests are load balanced individually to different backend DNS pods.
  4. Reduced DNS latency: With the introduction of Citrix ADC CPX as node local DNS cache, we see steep improvements in DNS latency. Here are the performance readings for the 90th, 95th, and 99th percentile with CPX.
Percentile Latency with CPX as cache in microseconds Latency with CoreDNS as cache in microseconds
99 793 9740
95 155 6075
90 89 4805

Please note, to compare the proposed solution with CoreDNS cach, we carried out performance tests on Google Kubernetes Engine (GKE) using the DNS performance tool. The details of the test environment follow:

Kubernetes version: 1.15.9-gke.2
Machine type: n1-standard-1 (1vCPU, 3.75GB memory)
Number of Nodes: 3
Queries per Second: 15k
Time duration: 30 secs

CPX version : 13.0-52.24
CPX Pod CPU: 25m

CPX Pod memory: 25Mi

CoreDNS cache version : 1.15.7
CoreDNS pod CPU: 25m

CoreDNS pod memory: 5Mi

DNS performance tool version : 1.1

The test was carried out by sending queries for three different domain names. The DNS performance tool is a pod in one of the nodes. The latency measurements is as given by the DNS performance tool. In GCP, the performance numbers might vary marginally with each run.

Why Not a Sidecar Deployment?

A DNS cache that’s local to a node helps in consolidating DNS access and improved cache usage. That enables reduced load on the authoritative server. Sidecar deployments won’t help cache consolidation because it’s less likely that a pod will query for the same domain within the TTL period of time. Node local cache deployment eases maintainence, and, at higher scale, it’s easier to push the configuration update to a per-node cache than to push it to a per-pod cache.

Conclusion

Deploying a node local DNS cache helps in reducing the application latency and reduces application failures due to DNS timeouts. A Citrix ADC CPX with a rich DNS feature set and good performance numbers can be leveraged as a node-local cache to improve application performance in kubernetes environment.

Learn more about Citrix ADC CPX as node local cache and DNS features supported here. Get more information about CPX for service mesh and Ingress Controller for Citrix ADC.