Many DevOps and cloud engineering teams in enterprises (including our clients) want to consider Cilium to improve network performance. But will adding Cilium help if you are using Istio service mesh?
In our previous blog, we discussed Cilium’s offerings —CNI, Service Mesh, and Gateway API—and how its eBPF-based technology helps increase network speed, security, and observability.
In this blog, we shall explore whether using Cilium, especially CNI with Istio service mesh, will lead to any network performance gain.
First, look at the driver for this DevOps moot point (“Let us try Cilium; I heard in an event it is a great platform for networks”).
Istio degrades the network performance in Kubernetes. Does it?
Let’s face it: every DevOps and cloud team complains or is concerned about the latency degradation due to Istio. There is a performance load test report of Istio 1.21, from the CNCF community, which states that P99 latency is 0.248 ms with a load of 70,000 mesh-wide requests per second on a mesh consisting of 1000 services and 2000 sidecars.
Sources: https://istio.io/latest/docs/ops/deployment/performance-and-scalability/
But in reality, when the workload is huge and mesh consists of services with intricate networks, Ops team observes their latency shooting up more than the threshold limits, because of additional hops due to the sidecar and the inherent nature of a sidecar to store global (meshed service) information for service discovery. Besides that, when mTLS is used, the load increases while encrypting and decrypting the data in transit.
ERO Framework for improving Istio performance
We believe that DevOps and the cloud team should try to optimize the Istio service mesh to improve latency instead of adding more network software like Cilium to the already crowded tech space.
Our CTO, Ravi Verma, proposes a new framework called ERO—Expectations(E), Reality(R), and Optimization(O)—which guides DevOps and cloud teams to think straight and have efficient and reliable software networks in the Kubernetes space.
- Expectation (E):
- While using Istio or any other software in general, architects must adhere to the CAP theorem. The CAP theorem, or Brewer’s theorem, states that architects using new software in a distributed system can ensure two of the three guarantees: Consistency, Availability, and Partition tolerance. This means using Istio service mesh for your microservices comes with utility and tradeoffs. If you want to manage the network and security without hassle, you should expect the performance to decrease.
- Yes, you can optimize the Istio software to improve your latency, but you cannot expect a 40% or more latency gain in complex and large environments. (After all, we are governed by laws- CAP theorem). The same applies to any lightweight service mesh such as Linkerd or Cilium. As their features grow, they will add pressure to network performance. If you want to optimize your Istio service mesh, talk to one of our Istio experts.
- Reality (R):
- With many companies using Cilium and widely speaking about it, DevOps and SREs using Istio service mesh feel FOMO and eagerly want to try it out. But there is a reality that many organizations are missing. First, Cilium is famous for its CNI (which is more powerful than popular CNIs such as Kubernetes CNI or Calico CNI) and not popular for its service mesh. Second, Cilium CNI in tandem with Istio DOES NOT improve the network performance. Yes, it does not, and we will talk about our experimentation in the next half of the blog.
- Optimization (O):
- The DevOps team using the Istio service mesh should optimize it to improve network latency. They can use strategies like vertical scaling, sidecar optimization, namespace isolation, and geo-selection to optimize their Istio performance.
- Vertical scaling involves adding more CPUs to increase performance, but we must also accept finite resources (kernel ticks and cycles), which may not improve the latency after a certain point.
- In the sidecar optimization, you can restrict or allow required services.
- In the namespace isolation, you can keep all the related services in one namespace and restrict sidecars from looking up services in that particular namespace.
- Similarly, geo-selection involves putting microservices in a different cluster, closer to other connected services “geographically” to improve response time and network latency.
- The DevOps team using the Istio service mesh should optimize it to improve network latency. They can use strategies like vertical scaling, sidecar optimization, namespace isolation, and geo-selection to optimize their Istio performance.
Load Testing with Cilium CNI with Istio service mesh
We performed a rigorous load-test experiment to understand the impact of Cilium CNI on the network performance.
For testing, we have set the following parameters:
- Services: We set two services, A and B, with one pod each. The external requests are received by service A and forwarded to B.
- Cluster Configuration: We have used Azure clusters with three nodes
- Node Configuration: We have used 2 VCPU and 7GB memory
- Load testing software: Fortio
- Total requests fired: 1000 queries per second, 10 connections, and for 10 seconds.
Test Scenarios for load testing of Cilium and Istio
Our intent of load testing was to test and compare the network latency for all the scenarios including and excluding- Istio service mesh, Cilium CNI, and Istio Ambient mesh. We also considered specific scenarios by combining Istio service mesh and Cilium CNI and also enabling and disabling mTLS.
We tested the load on services A and B by keeping service mesh enabled and disabled and then using Cilium CNI with and without service mesh. We also used Ambient Mesh (both eBPF enabled and disabled) for our testing. Please note that eBPF was supported until Ambient Mesh 1.20, but the latest versions of Ambient Mesh don’t support eBPF.
In total, we tested 17 scenarios (Various text colors represent cohorts for testing)
- Kube CNI
- Istio + Kube CNI
- Istio + Kube CNI (mTLS enabled)
- Istio + Cilium CNI
- Istio + Cilium CNI (mTLS enabled by Istio)
- Istio + Cilium CNI (mTLS enabled by Cilium CNI)
- Cilium CNI
- Cilium CNI (mTLS enabled)
- Ambient (without eBPF) + Kube CNI
- Ambient (without eBPF) + Kube CNI (mTLS enabled by Ambient)
- Ambient (without eBPF) + Cilium CNI
- Ambient (without-eBPF) + Cilium CNI (mTLS enabled by Ambient)
- Ambient (eBP-enabled) + Cilium CNI (mTLS enabled by Cilium CNI)
- Ambient (eBP-enabled) + Cilium CNI (mTLS enabled by Istio Ambient Mesh)
- Ambient (eBP-enabled) + Azure Kube CNI
- Ambient (eBP-enabled) + Azure Kube CNI (mTLS enabled by Ambient Mesh)
- Ambient (eBP-enabled) + Cilium CNI
Load Test results of Istio and Cilium CNI
We generated multiple (4-5 times) response time histograms for each scenario using Fortio. For Cilium CNI without mTLS, for example, the multiple response time histogram suggested that the p99 latency ranges between 3-6 ms.
After performing test cases for all the test scenarios, we came up with the results below (refer to the table below). The last column represents the range of P99 latency. A quick observation from these results highlights that using Kube CNI alone provides better latency and improves by almost 50% when replaced with Cilium CNI.
Observations of our experimentation
- The best performant (p99 latency) is Cilium CNI, followed by Kube CNI, and then eBPF-enabled Ambient mesh with Cilium CNI. However, the latest version of the Istio Ambient mesh doesn’t support eBPF.
- When mTLS is DISABLED, there is no performance improvement on latency when Istio is used with Kube CNI or Cilium CNI.
- When mTLS is ENABLED, then
- Cilium CNI demonstrates the worst performance in terms of latency
- Similarly, using Cilium CNI for mTLS along with Istio or Ambient degrades the performance
- We have observed that if Istio is used for mTLS along with Cilium CNI, then there is a minor degradation in performance as compared to Kube CNI. Kube CNI is better than Cilium CNI for Istio.
Conclusion
If you are a small team with a small infrastructure that started its Kubernetes journey recently, using Cilium CNI will provide network performance gains. But if you have hundreds of microservices spread across clusters and the cloud with thousands of workloads, DevOps and cloud teams must evaluate their short-term and long-term goals. Using Cilium CNI may give better results, but if security is a top concern, then enabling mTLS with Cilium will degrade the performance.
If your goals revolve around security compliance, zero trust, and network resiliency or reliability, you can consider Istio service mesh over Cilium CNI. Note: Not only is Istio faster than Cilium, but it also helps you implement many security, network, and observability use cases that are out of the box. Istio service mesh can also be tuned to optimize the performance. Besides, the Istio community is also releasing a “production-ready” stable Ambient Istio mesh soon, which will be lighter, faster, and more feature-rich than any other service mesh in the market.
If you want support for Istio implementation for network, security, and observability use cases or want to optimize Istio service mesh for best network performance, please contact us.