Advance traffic management in Canary using Istio, Argo Rollouts, and HPA

As enterprises mature in their CI/CD journey, they tend to ship code faster, safely, and securely. One essential strategy the DevOps team applies is releasing code progressively to production, also known as canary deployment. Canary deployment is a bulletproof mechanism that safely releases application changes and provides flexibility for business experiments. It can be implemented using software like Argo Rollouts and Flagger. However advanced DevOps teams would like to gain granular control over their traffic and pod scaling while performing canary deployment to reduce overall cost. We have helped a few of our customers (Istio users) to achieve advanced traffic management of canary deployment at scale. 

We want to share our knowledge with the DevOps community through this blog. 

If you want to jump straight into the demo video, our CTO, Ravi Verma, has created a walkthrough on advanced traffic management in Canary for enterprises at scale. 

If you are interested, you can check our previous blog: how to implement canary deployment for Kubernetes workloads using Argo Rollouts and Istio

Before we get started, let us discuss the canary architecture implemented by Argo Rollouts and Istio. 

Recap of Canary implementation architecture with Argo Rollouts and Istio 

If you use Istio service mesh, all of your meshed workloads will have an Envoy proxy sidecar attached to the application container in the pod. You can perhaps have an API or Istio ingress gateway to receive the incoming traffic from outside. In such a case, you can use Argo Rollouts to handle canary deployment. To implement canary deployment Argo Rollouts provides a CRD called Rollout, which is similar to Deployment object and responsible for creating, scaling, and deleting ReplicaSets in K8s. 

The canary deployment strategy starts by redirecting a small amount of traffic (say 5%) to the newly deployed app. Based on specific criteria, such as optimized resource utilization of new canary pods, you can gradually increase the traffic to 100%. The traffic handling for the baseline and canary is carried by Istio sidecar as per the rules defined in the Virtual Service resource. Since Argo Rollouts provides native integration with Istio, it would override the Virtual Service resource to increase the traffic to the canary pods. 

Canary implementation architecture with Argo Rollouts and Istio

Canary can be implemented by two methods: Deploying new changes as a service, or Deploying new changes as a deployment

  1. Deploying new changes as a service 

In this method, we can create a new service (called canary) and split the traffic from the Istio ingress gateway between the stable and canary services. Refer to the image below. 

Deploying new changes as a service

You can refer to the yaml file for a sample implementation of deploying a canary with multiple services here. We have created two services called: rollouts-demo-stable and rollouts-demo-canary. Each service will listen to HTTP traffic for the Argo Rollout resource called rollouts-demo. In the rollouts-demo yaml, we have specified the Istio virtual service resource and the logic to gradually improve the traffic weightage from 20% to 40%, 60%, 80%, and eventually 100%. 

If you want to know more about this method, you can watch the detailed video

  1. Deploying new changes as a subset

In this method, you can have one service but create a new Deployment subset (canary version) pointing to the same service. Traffic can be split between the stable and canary deployment sets using Istio Virtual service and Destination rule resources. 

In this blog, we will thoroughly discuss the second method. 

Implementing Canary using Istio and Argo Rollouts without changing Deployment resource

Since there is a misunderstanding among DevOps professionals that Argo Rollouts is a replacement for Deployment resource, and the services considered for canary deployment have to refer to the Argo Rollouts with Deployment configuration rewritten. 

Well, that’s not true. 

The Argo Rollout resource provides a section called workloadRef, where existing Deployments can be referred to without making any significant changes to Deployment or service yamls. 

If you use the Deployments resource for a service in Kubernetes, you can provide a reference in the Rollout CRD, after which Argo Rollouts will manage the ReplicaSet for that service. Refe the below image. 

Deployments resource for a service in Kubernetes

We will use the same concept to deploy a canary version using the second method- Deploying new changes using a Deployment. 

Argo Rollouts configuration for deploying new changes using a subset

Let’s say you have a Kubernetes service called rollout-demo-svc and a deployment resource called rollouts-demo-deployment (code below); you need to follow the three steps to configure the canary deployment. 

Code for Service.yaml

Code for deployment.yaml

Step1- Setup virtual service and destination rule in Istio

Set up the virtual service by specifying the back-end destination for the HTTP traffic coming from the Istio gateway. In our virtual service rollouts-demo-vs2, we have mentioned the back-end service as rollouts-demo-svc, but we have created two subsets (stable and canary) for the respective deployment sets. We have set the traffic weightage rule as such that 100% of the traffic goes to the stable version, and 0% of the traffic goes to the canary version. 

As Istio is responsible for the traffic split, we will see how Argo updates this Virtual service resource with the new traffic configuration specified in the canary specification. 

Now, we have to define the subsets in the Destination rules. In the below rollout-destrule, we have defined the subsets canary and stable and referred to the Argo Rollout resource called rollouts-demo.

In the next step, we will set up the Argo Rollout resource. 

Step 2- Setup Argo Rollout resource 

Two important items in the canary strategy should be noted in the Rollout spec: declare the Istio virtual service and destination rule and provide the traffic increment strategy.  

You can learn more about the Argo Rollout spec.

In our Argo rollout resource, rollouts-demo, we have provided the deployment (rollouts-demo-deployment) in the workloadRef spec. In the canary spec, we have referred to the virtual resource(rollouts-demo-vs2) and destination rule (rollout-destrule) created in the earlier step. 

We have also specified the traffic rules to redirect 20% of the traffic to the canary pods and then pause for manual direction. 

We have given this manual pause so that in the production environment, the Ops team can verify whether all the vital metrics and KPIs, such as CPU, memory, latency, and the throughput of the canary pods, are in an acceptable range. 

Once we manually promote the release, the canary pod traffic will increase to 40%. We will wait 10 seconds before increasing the traffic to 60%. The process will continue until the traffic to the canary pods increases to 100% and the stable pods are deleted. 

Once you have deployed all the resources in steps 1 and 2 and accessed them through the Istio ingress IP from the browser, you will see an output like the one below. 

Istio ingress IP from the browser

You can run the command below to understand how the pods are handled by Argo Rollouts. 

how the pods are handled by Argo Rollouts

Validating canary deployment

Let’s say developers have made new changes and created a new image that is supposed to be tested. For our case, we will make the Deployment manifest file (rollouts-demo-deployment) by modifying the image value from blue to red. (refer the image below).

Once you deploy the rollouts-demo-deployment, Argo Rollout will understand that new changes have been introduced to the environment. It would then start making new ‘canary’ pods and allow 20% of the traffic. Refer the image below:

Validating canary deployment

Now, if you analyze the virtual service spec by running the following command, you will realize Argo has updated the traffic percentage to canary from 0% to 20% (as per the Rollouts spec). 

updated the traffic percentage to canary

Gradually, 100% of the traffic will be shifted to the new version, and older/stable pods will be terminated. 

traffic shifted to the older and stable pods

In advanced cases, the DevOps team is required to control the scaling of canary pods. The idea is not to create all the pods as per the replica at each gradual shifting of the canary but to create the number of pods based on specific criteria. In those cases, we need to HorizontalPodAutoscaler (HPA) to handle the scaling of canary pods. 

Scaling of pods during canary deployment using HPA

Kubernetes HPA is used to increase or decrease pods based on load. HPA can also be used to control the scaling of pods during canary deployment. HorizontalPosAutoscaler overrides the Rollouts behaviour for scaling of pods. 

We have created and deployed the following HPA resource: hpa-rollout-example. In this resource, we have referenced the Argo Rollout resource rollouts-demo. That means HPA will be responsible for creating two replicas at the start. If the CPU utilization is more than 10%, more pods will be created. A maximum of six replicas will be created. 

In our case, when we deployed a canary, only two replicas were created at first (instead of the five mentioned in the Rollouts). 

Scaling of pods during canary deployment using HPA

Validating scaling of pods by HPA by increasing synthetic loads 

We can run the following command to increase the loads to a certain pod. 

You use the following command to observe the CPU utilization of the pods created by HPA. 

Once the load increases more than 10%, in our case to 14% (refer to the image below), new pods will be created. 

CPU utilization of the pods created by HPA
CPU utilization of the pods created by HPA1

Many metrics such as latency or throughput, can be used by HPA as criteria for scaling up or down the pods. 

Conclusion

As the pace of releasing software increases with the maturity of the CI/CD process, new complications will emerge. And so will new requirements by the DevOps team to tackle these challenges. Similarly, when the canary deployment strategy is adopted rapidly by the DevOps team, new challenges of scale and traffic management emerge to gain granular control over the rapid release process and infrastructure cost. 

Debasree Panda

Debasree Panda

Debasree is the CEO of IMESH. He understands customer pain points in cloud and microservice architecture. Previously, he led product marketing and market research teams at Digitate and OpsMx, where he had created a multi-million dollar sales pipeline. He has helped open-source solution providers- Tetrate, OtterTune, and Devtron- design GTM from scratch and achieve product-led growth. He firmly believes serendipity happens to diligent and righteous people.

Leave a Reply