In a cloud-native ecosystem, containerized apps are usually distributed across clusters and nodes. Sometimes these microservices misbehave or even fail, so it’s important to incorporate resiliency techniques in the architecture of cloud-native apps.

In a Kubernetes environment, an app is deployed as a collection of services, which run inside a pod and can autoscale based on load. But sometimes they can become slow or unresponsive because of factors such as disk pressure on the cluster node. In these situations, we should not send traffic to these faulty services, and we would want to be able to identify such “outlier” service instances automatically.

Outlier Detection

Outlier detection is a process for identifying unusual or abnormal behavior of application pods and evicting them from the load-balancing pool of healthy service instances. Look at it as a passive health check.

You can implement this resiliency strategy by tracking errors generated by the application pod while serving requests. For example, an HTTP service frequently responding with 5xx error would be considered unhealthy and would be evicted from the LB pool.

Eviction implies that the outlier service won’t be used to send traffic from the client. This outlier instance is evicted from the LB pool for a specific duration. In other words, packets won’t be sent to the outlier service for this specified duration. Once the eviction period expires, the service is brought back to the LB pool and undergoes the outlier detection process again.

Figure 1. Citrix ADC CPX proxy doesn’t send packets to outlier-pod of backend-v1.

Outlier Detection in Istio Service Mesh

Istio provides a Destination Rule Custom Resource Definition (CRD), which defines policies that apply to traffic intended for a service after routing is done. In addition to configuration for load balancing, outlier detection settings can also be provided in this destination rule. Here’s the sample yaml file with outlier detection settings. Let’s have a look at these outlierDetection fields.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpserver
spec:
  host: httpserver
   trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutiveErrors: 3
      interval: 30s
      baseEjectionTime: 5m

  • ConsecutiveErrors: Number of consecutive server errors (i.e. HTTP error 502, 503, 504) before a host is evicted from the LB pool.
  • Interval: Time interval between eviction analyses.
  • BaseEjectionTime: Time duration for which the outlier host is evicted from the LB pool.

In above example, the upstream httpserver host will be scanned every 30 seconds. If it has returned three consecutive gateway errors, it will be marked as an outlier and evicted from the pool of healthy instances for five minutes. This gives the upstream host a breather to recover itself.

Outlier Detection on Citrix ADC

Citrix-istio-adaptor enables deployment of Citrix ADC as a proxy (both ingress and sidecar) in Istio Service mesh. You can learn more here.

The implementation of outlier detection is totally transparent to the application deployed in service mesh. Indeed, it is the Citrix ADC proxy that takes care of handling all the intricacies of the outlier detection. Under the hood, it uses HTTP Inline Monitors to achieve outlier detection functionality. An inline monitor determines that the service to which it is bound is active by validating its responses to the requests that are sent to it. When no client requests are sent to the service, the inline monitor probes the service by using the configured URL.

Let’s have a look at the working example. First, ensure that automatic sidecar injection of Citrix ADC CPX is enabled.

I have developed a simple httpserver running on port 4040 for outlier demonstration. When a request is sent to the httpserver, it returns a 200OK response along with the pod IP-address of the service instance. But when a request is sent to ‘/misbehave’ path, it will return a 503 Service Unavailable error along with message “MISBEHAVING! HTTP status code returned from <IP-address-of-httpserver>!”.

We’ll deploy two versions of httpserver. Version v1 will be handling all requests except those with the ‘/misbehave’ path. These misbehaving requests will be handled by a second version of httpserver. Figure 2 shows the packet flow between client and server.

Figure 2. Packet flow

Steps

Here are the steps to deploy a working example of outlier detection using Citrix ADC CPX. All yaml files used in this blog are available in this github gist.

1. Deploy two versions, v1 and v2, of httpserver. Deploy three replicas of httpserver:v2.

kubectl create ns httpserver
kubectl apply -f httpserver.yaml -n httpserver

2. Deploy the http-based client in another namespace. In this blog, we are using ‘sleep’ pod as the client deployed in the outlier namespace. Label namespace with ‘cpx-injection=enabled’. This will ensure that the Citrix ADC CPX will be deployed as a sidecar in the client pod. This CPX proxy will maintain the LB pool of httpserver instances.

kubectl create ns outlier
kubectl label ns outlier cpx-injection=enabled
kubectl apply -f sleep.yaml -n outlier

3. Apply the virtual service and destination rule for httpserver. Outlier detection settings are specified in the outlierDetection section of the destination rule. Here, we have configured httpserver:v2 to be identified as outlier if it returns 5xx error continuous two times in a 10-second interval. If it is detected as an outlier, then the given instance of httpserver will be evicted for one minute from the LB pool configured in the Citrix ADC CPX sidecar of the sleep pod.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpserver
spec:
  host: httpserver
  subsets:
  - labels:
      version: v2
    name: v2
    trafficPolicy:
      connectionPool:
        tcp:
          maxConnections: 1
        http:
          http1MaxPendingRequests: 1
          maxRequestsPerConnection: 1
      outlierDetection:
        consecutiveErrors: 2
        interval: 10s
        baseEjectionTime: 1m

kubectl apply -f httpserver-vs.yaml -n httpserver
kubectl apply -f httpserver-dr.yaml -n httpserver

4. We have configured VirtualService to send packets to path ‘/misbehave’ to v2 of httpserver, and the rest of all packets will be forwarded to the v1 of httpserver.

  http:
  - match:
    - uri:
        prefix: "/misbehave"
    route:
    - destination:
        host: httpserver
        subset: v2
  - route:
    - destination:
        host: httpserver
        subset: v1

5. Initially all three instances of httpserver:v2 are part of the LB pool in the Citrix ADC CPX sidecar proxy running in the sleep pod. These requests are load balanced in round-robin. Now send misbehave requests to the httpserver service using the curl command from the sleep pod.

curl http://httpserver.httpserver:4040/misbehave

6. As shown in the Figure 2 Packet flow above, the first packet goes to pod-1 of httpserver:v2 pod. It returns 503 Service Unavailable. The second request goes to pod-2 and returns a 503 error. The third request goes to pod-3, and the same story repeats. The fourth request again lands on pod-1, and, as expected, it returns a 503 error. But this time the CPX proxy shows that pod-1 has returned two consecutive 5xx errors, so it categorizes pod-1 of httpserver:v2 as “outlier” and marks it as DOWN in the LB pool of httpserver.

This behavior can be seen in Figure 3.

Figure 3. Outlier Detection Demo with Citrix ADC CPX

7. After one minute (i.e. baseEjectionTime), the inline monitor checks health of the pod-1. If found healthy, it is marked as active or UP in the LB pool.

Please note, Citrix ADC has support for HTTP-based outlier detection but not for TCP-based outlier detection.

Conclusion

Outlier detection is a resiliency strategy to improve the reliability and overall availability of services by ensuring that only healthy pods respond to client requests. Applications can self-heal with outlier detection because it suspends the sending of requests to unhealthy pods, giving them time to recover. In the meantime, requests are forwarded only to healthy pods. There is no service downtime with this approach, which ensures that users do not experience any negative impact.

To learn more about Citrix ADC in Istio Service Mesh, check out our this github repository.