As enterprises mature in their CI/CD journey, they tend to ship code faster, safely, and securely. One essential strategy the DevOps team applies is releasing code progressively to production, also known as canary deployment. Canary deployment is a bulletproof mechanism that safely releases application changes and provides flexibility for business experiments. It can be implemented using software like Argo Rollouts and Flagger. However advanced DevOps teams would like to gain granular control over their traffic and pod scaling while performing canary deployment to reduce overall cost. We have helped a few of our customers (Istio users) to achieve advanced traffic management of canary deployment at scale.
We want to share our knowledge with the DevOps community through this blog.
If you want to jump straight into the demo video, our CTO, Ravi Verma, has created a walkthrough on advanced traffic management in Canary for enterprises at scale.
If you are interested, you can check our previous blog: how to implement canary deployment for Kubernetes workloads using Argo Rollouts and Istio.
Before we get started, let us discuss the canary architecture implemented by Argo Rollouts and Istio.
Recap of Canary implementation architecture with Argo Rollouts and Istio
If you use Istio service mesh, all of your meshed workloads will have an Envoy proxy sidecar attached to the application container in the pod. You can perhaps have an API or Istio ingress gateway to receive the incoming traffic from outside. In such a case, you can use Argo Rollouts to handle canary deployment. To implement canary deployment Argo Rollouts provides a CRD called Rollout, which is similar to Deployment object and responsible for creating, scaling, and deleting ReplicaSets in K8s.
The canary deployment strategy starts by redirecting a small amount of traffic (say 5%) to the newly deployed app. Based on specific criteria, such as optimized resource utilization of new canary pods, you can gradually increase the traffic to 100%. The traffic handling for the baseline and canary is carried by Istio sidecar as per the rules defined in the Virtual Service resource. Since Argo Rollouts provides native integration with Istio, it would override the Virtual Service resource to increase the traffic to the canary pods.
Canary can be implemented by two methods: Deploying new changes as a service, or Deploying new changes as a deployment.
- Deploying new changes as a service
In this method, we can create a new service (called canary) and split the traffic from the Istio ingress gateway between the stable and canary services. Refer to the image below.
You can refer to the yaml file for a sample implementation of deploying a canary with multiple services here. We have created two services called: rollouts-demo-stable and rollouts-demo-canary. Each service will listen to HTTP traffic for the Argo Rollout resource called rollouts-demo. In the rollouts-demo yaml, we have specified the Istio virtual service resource and the logic to gradually improve the traffic weightage from 20% to 40%, 60%, 80%, and eventually 100%.
If you want to know more about this method, you can watch the detailed video.
- Deploying new changes as a subset
In this method, you can have one service but create a new Deployment subset (canary version) pointing to the same service. Traffic can be split between the stable and canary deployment sets using Istio Virtual service and Destination rule resources.
In this blog, we will thoroughly discuss the second method.
Implementing Canary using Istio and Argo Rollouts without changing Deployment resource
Since there is a misunderstanding among DevOps professionals that Argo Rollouts is a replacement for Deployment resource, and the services considered for canary deployment have to refer to the Argo Rollouts with Deployment configuration rewritten.
Well, that’s not true.
The Argo Rollout resource provides a section called workloadRef, where existing Deployments can be referred to without making any significant changes to Deployment or service yamls.
If you use the Deployments resource for a service in Kubernetes, you can provide a reference in the Rollout CRD, after which Argo Rollouts will manage the ReplicaSet for that service. Refe the below image.
We will use the same concept to deploy a canary version using the second method- Deploying new changes using a Deployment.
Argo Rollouts configuration for deploying new changes using a subset
Let’s say you have a Kubernetes service called rollout-demo-svc and a deployment resource called rollouts-demo-deployment (code below); you need to follow the three steps to configure the canary deployment.
Code for Service.yaml
apiVersion: v1
kind: Service
metadata:
name: rollouts-demo-svc
namespace: istio-argo-rollouts
spec:
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: rollouts-demo
Code for deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rollouts-demo-deployment
namespace: istio-argo-rollouts
spec:
replicas: 0 # this has to be made 0 once Argo rollout is active and functional.
selector:
matchLabels:
app: rollouts-demo
template:
metadata:
labels:
app: rollouts-demo
spec:
containers:
- name: rollouts-demo
image: argoproj/rollouts-demo:blue
ports:
- name: http
containerPort: 8080
resources:
requests:
memory: 32Mi
cpu: 5m
Step1- Setup virtual service and destination rule in Istio
Set up the virtual service by specifying the back-end destination for the HTTP traffic coming from the Istio gateway. In our virtual service rollouts-demo-vs2, we have mentioned the back-end service as rollouts-demo-svc, but we have created two subsets (stable and canary) for the respective deployment sets. We have set the traffic weightage rule as such that 100% of the traffic goes to the stable version, and 0% of the traffic goes to the canary version.
As Istio is responsible for the traffic split, we will see how Argo updates this Virtual service resource with the new traffic configuration specified in the canary specification.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: rollouts-demo-vs2
namespace: istio-argo-rollouts
spec:
gateways:
- istio-system/rollouts-demo-gateway
hosts:
- "*"
http:
- name: route-one
route:
- destination:
host: rollouts-demo-svc
port:
number: 80
subset: stable
weight: 100
- destination:
host: rollouts-demo-svc
port:
number: 80
subset: canary
weight: 0
Now, we have to define the subsets in the Destination rules. In the below rollout-destrule, we have defined the subsets canary and stable and referred to the Argo Rollout resource called rollouts-demo.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: rollout-destrule
namespace: istio-argo-rollouts
spec:
host: rollouts-demo-svc
subsets:
- name: canary # referenced in canary.trafficRouting.istio.destinationRule.canarySubsetName
labels: # labels will be injected with canary rollouts-pod-template-hash value
app: rollouts-demo
- name: stable # referenced in canary.trafficRouting.istio.destinationRule.stableSubsetName
labels: # labels will be injected with stable rollouts-pod-template-hash value
app: rollouts-demo
In the next step, we will set up the Argo Rollout resource.
Step 2- Setup Argo Rollout resource
Two important items in the canary strategy should be noted in the Rollout spec: declare the Istio virtual service and destination rule and provide the traffic increment strategy.
You can learn more about the Argo Rollout spec.
In our Argo rollout resource, rollouts-demo, we have provided the deployment (rollouts-demo-deployment) in the workloadRef spec. In the canary spec, we have referred to the virtual resource(rollouts-demo-vs2) and destination rule (rollout-destrule) created in the earlier step.
We have also specified the traffic rules to redirect 20% of the traffic to the canary pods and then pause for manual direction.
We have given this manual pause so that in the production environment, the Ops team can verify whether all the vital metrics and KPIs, such as CPU, memory, latency, and the throughput of the canary pods, are in an acceptable range.
Once we manually promote the release, the canary pod traffic will increase to 40%. We will wait 10 seconds before increasing the traffic to 60%. The process will continue until the traffic to the canary pods increases to 100% and the stable pods are deleted.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rollouts-demo
namespace: istio-argo-rollouts
spec:
replicas: 5
strategy:
canary:
trafficRouting:
istio:
virtualService:
name: rollouts-demo-vs2 # required
routes:
- route-one # optional if there is a single route in VirtualService, required otherwise
destinationRule:
name: rollout-destrule # required
canarySubsetName: canary # required
stableSubsetName: stable # required
steps:
- setWeight: 20
- pause: {}
- setWeight: 40
- pause: {duration: 10}
- setWeight: 60
- pause: {duration: 10}
- setWeight: 80
- pause: {duration: 10}
revisionHistoryLimit: 2
selector:
matchLabels:
app: rollouts-demo
workloadRef:
apiVersion: apps/v1
kind: Deployment
name: rollouts-demo-deployment
Once you have deployed all the resources in steps 1 and 2 and accessed them through the Istio ingress IP from the browser, you will see an output like the one below.
You can run the command below to understand how the pods are handled by Argo Rollouts.
kubectl get pods -n <<namespace>>
Validating canary deployment
Let’s say developers have made new changes and created a new image that is supposed to be tested. For our case, we will make the Deployment manifest file (rollouts-demo-deployment) by modifying the image value from blue to red. (refer the image below).
spec:
containers:
- name: rollouts-demo
image: argoproj/rollouts-demo:blue
Once you deploy the rollouts-demo-deployment, Argo Rollout will understand that new changes have been introduced to the environment. It would then start making new ‘canary’ pods and allow 20% of the traffic. Refer the image below:
Now, if you analyze the virtual service spec by running the following command, you will realize Argo has updated the traffic percentage to canary from 0% to 20% (as per the Rollouts spec).
kubectl get vs rollouts-demo-vs2 -n <<namespace>> -o yaml
Gradually, 100% of the traffic will be shifted to the new version, and older/stable pods will be terminated.
In advanced cases, the DevOps team is required to control the scaling of canary pods. The idea is not to create all the pods as per the replica at each gradual shifting of the canary but to create the number of pods based on specific criteria. In those cases, we need to HorizontalPodAutoscaler (HPA) to handle the scaling of canary pods.
Scaling of pods during canary deployment using HPA
Kubernetes HPA is used to increase or decrease pods based on load. HPA can also be used to control the scaling of pods during canary deployment. HorizontalPosAutoscaler overrides the Rollouts behaviour for scaling of pods.
We have created and deployed the following HPA resource: hpa-rollout-example. In this resource, we have referenced the Argo Rollout resource rollouts-demo. That means HPA will be responsible for creating two replicas at the start. If the CPU utilization is more than 10%, more pods will be created. A maximum of six replicas will be created.
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-rollout-example
namespace: istio-argo-rollouts
spec:
maxReplicas: 6
minReplicas: 2
scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: rollouts-demo
targetCPUUtilizationPercentage: 10
In our case, when we deployed a canary, only two replicas were created at first (instead of the five mentioned in the Rollouts).
Validating scaling of pods by HPA by increasing synthetic loads
We can run the following command to increase the loads to a certain pod.
kubectl run -i -tty load-generator-1 -rm -image=busybox:1.28 -restart=Never - /bin/sh -c "while sleep 0.01; do wget -q -O- http://<<service name>>.<<namespace>>; done;"
You use the following command to observe the CPU utilization of the pods created by HPA.
kubectl get hpa hpa-rollout-example -n <<namespace>> -watch
Once the load increases more than 10%, in our case to 14% (refer to the image below), new pods will be created.
Many metrics such as latency or throughput, can be used by HPA as criteria for scaling up or down the pods.
Conclusion
As the pace of releasing software increases with the maturity of the CI/CD process, new complications will emerge. And so will new requirements by the DevOps team to tackle these challenges. Similarly, when the canary deployment strategy is adopted rapidly by the DevOps team, new challenges of scale and traffic management emerge to gain granular control over the rapid release process and infrastructure cost.