Open-source Istio service mesh is a powerful tool for ensuring resiliency and improving the reliability of your multicloud and multicluster infrastructure. Some issues can cause SREs and the DevOps team to search for solutions incessantly. That’s why we at the IMESH team would start to post about various topics and their resolutions to help adopt the Istio service mesh.
Issue Summary wrt Istio
Recently, we encountered one of our clients’ instances where their SREs observed sporadic 503 UC Errors on the Istio sidecar. This issue persisted for all the inbound requests across environments such as production, integration, and staging.
The team encountered the following symptoms:
- 503 UC Errors: The upstream connection terminated before the response started.
- High-load Errors: Errors occur intermittently across multiple services, primarily under high-load conditions.
Once the IMESH team was notified, we quickly acknowledged and started diagnosing the issue.
Root Cause Analysis of 503 UC Errors
We analyzed application logs, proxy logs, and tcpdump of the loopback interface to identify the root cause accurately.
The issue stems from reusing expired or inactive HTTP connections by the Istio proxy when communicating with upstream services.
Below is a quick technical breakdown of our observation.
- Istio Proxy Behaviour:
- Istio proxies establish multiple TCP connections to upstream services.
- These connections are reused for various requests to enhance performance and reduce connection overhead.
- By default, Istio proxies maintain these connections for up to 1 hour.
- Upstream HTTP Keep-Alive Timeout:
- Upstream services enforce their HTTP keep-alive timeout, often configured to shorter durations (e.g., 5 seconds) to close idle connections and free up resources.
- After the keep-alive timeout elapses, the upstream service sends a FIN packet to close the connection. We can observe this in the below tcpdump.
Note: HTTP Keep-alive timeout: 10s
- Reuse of Expired Connections:
- Istio sidecar may attempt to reuse these expired idle connections for new requests, unaware that the connection is no longer.
- Race Condition Between FIN and New Requests:
- A race condition arises when:
- The FIN packet from the upstream service (indicating connection closure) is delayed(or in process) or not yet processed by the proxy.
- The Istio proxy sends a new request on the same expired connection.
- In this scenario, the upstream service rejects the new request with RST packet, terminating the connection and leading to a 503 UC error.
- We can observe the race condition in below tcpdump.
- A race condition arises when:
Note: HTTP Keep-alive timeout: 5s
The impact of such configurations is:
- Increased error rates (503 UC errors in Istio), particularly during high load or traffic spikes.
- Reduced reliability of services, with potential degradation in the overall user experience.
- Operational overhead due to debugging and incident handling.
Proposed Solutions by IMESH
- Connection Timeout Synchronization:
- Reduce the Istio proxy’s default connection timeout (currently 1 hour) to align with upstream services’ HTTP keep-alive timeout (e.g., 60 seconds).
- This ensures the proxy proactively closes connections before the upstream server does.
OR
- Increase application HTTP keep-alive timeout (e.g., 60 seconds or higher). The higher the HTTP keep-alive value, the lower the chances of idle connection timeout expiry.
- Note: The above two can be combined to eliminate race conditions.
- Disable Keep-Alive
- Altogether disable HTTP KeepAlive (tested with Sanic), causing the application to close the connection after each request. This may degrade performance, as a new connection will be established for every request.
- Error Handling and Retries:
- Enhance Istio retry policies to gracefully retry failed requests caused by connection reuse issues with the help of Envoy Filter. This poses risk of idempotency issues, however, depending on the particular server framework.
- Implement circuit breakers to reduce the load on problematic upstream services during traffic spikes.
- Monitoring Enhancements:
- Set up alerts to monitor for spikes in 503 UC errors and respond promptly.
Hassle-free Istio support and maintenance by IMESH
As you adopt open-source Istio more in an enterprise setup, you may encounter some issues or bugs. The Istio community is already doing a fantastic job creating new features and solving bugs and issues. However, there can be instances in which problems are critical and can disrupt services; feel free to contact us for dedicated and quick help for enterprise Istio support.