Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Istio Sidecar Scope to reduce memory and make cluster more scalable #3052

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

han-steve
Copy link

Pull Request Template for Kubeflow Manifests

✏️ Summary of Changes

Describe the changes you have made, including any refactoring or feature additions.

Adding a new istio sidecar resource to limit the sidecar's egress visibility to unnecessary services.

We (Roblox) have been running Kubeflow in production for a long time, and we are noticing that the istio sidecar memory is almost 1GB now due to the amount of services in the cluster that has to be cached in each sidecar. This adds up to over 2 TB of memory in total. This change limits the caching of cluster services in each sidecar, thus helping the scalability of the cluster.

This change can save TBs of memory and spare our DNS services. But I want to ask the community to see if there are any istio-enabled egress communication from kubeflow pods that we haven't considered. As far as we know
Communications to Notebook and Pipeline backends go through the ingress gateway instead of directly inside the cluster, so that won't matter
Communications to kserve models go through cluster ingress gateway
All other CRD-based workloads don't need any egress communication

🐛 Related Issues

Link any issues that are resolved or affected by this PR.

knative/serving#12917 We are facing this issue where each sidecar is pinging DNS to resolve the cluster ingressgateway ip, essentially DDOSing our DNS. Removing the ExternalName service for cluster ingress gateway from the sidecars would resolve this problem.

✅ Contributor Checklist


You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Slack message link: https://cloud-native.slack.com/archives/C073W572LA2/p1741893411623659

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

juliusvonkohout and others added 2 commits March 14, 2025 11:23
* istio proxy version 1.24.3

Signed-off-by: Julius von Kohout <[email protected]>

* Update install.yaml

Signed-off-by: Julius von Kohout <[email protected]>

---------

Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Steve Han <[email protected]>
Signed-off-by: Steve Han <[email protected]>
Signed-off-by: Steve Han <[email protected]>
@han-steve han-steve force-pushed the steve-patch/limit-istio-scope branch from 58bbc19 to c021c15 Compare March 14, 2025 18:23
@juliusvonkohout
Copy link
Member

Thank you for the PR.

/ok-to-test

@juliusvonkohout juliusvonkohout added this to the 1.10.1 milestone Mar 16, 2025
spec:
egress:
- hosts:
- "./*" # use mTLS within the namespace
Copy link
Member

@juliusvonkohout juliusvonkohout Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that block stuff that is not on the servicemesh by default ? E.g. talking from your jupyterlab to a deployment in the same namespace with istio-injection disabled. Or egress to a non-istio native kubernetes service in another namespaces or just the internet in general. Because that must be still allowed. Maybe we could just block inter kubeflow-profile namespace communication by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants