Limit Istio Sidecar Scope to reduce memory and make cluster more scalable #3052
+12
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Template for Kubeflow Manifests
✏️ Summary of Changes
Adding a new istio sidecar resource to limit the sidecar's egress visibility to unnecessary services.
We (Roblox) have been running Kubeflow in production for a long time, and we are noticing that the istio sidecar memory is almost 1GB now due to the amount of services in the cluster that has to be cached in each sidecar. This adds up to over 2 TB of memory in total. This change limits the caching of cluster services in each sidecar, thus helping the scalability of the cluster.
This change can save TBs of memory and spare our DNS services. But I want to ask the community to see if there are any istio-enabled egress communication from kubeflow pods that we haven't considered. As far as we know
Communications to Notebook and Pipeline backends go through the ingress gateway instead of directly inside the cluster, so that won't matter
Communications to kserve models go through cluster ingress gateway
All other CRD-based workloads don't need any egress communication
🐛 Related Issues
knative/serving#12917 We are facing this issue where each sidecar is pinging DNS to resolve the cluster ingressgateway ip, essentially DDOSing our DNS. Removing the ExternalName service for cluster ingress gateway from the sidecars would resolve this problem.
✅ Contributor Checklist
Slack message link: https://cloud-native.slack.com/archives/C073W572LA2/p1741893411623659