Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube/controller: Prevent leaking stale endpoints #55171

Merged
merged 2 commits into from
Feb 20, 2025

Conversation

dwj300
Copy link
Member

@dwj300 dwj300 commented Feb 19, 2025

Please provide a description of this PR:
When selecting pods via a ServiceEntry, there are times when the pod may not have an IP address:

  1. When it is first created
  2. When it is Evicted/Failed.

The code today handles (1), however there exists a case when the IP is removed from the Pod in the same event that marks it with a terminal (Failed) status. Due to the cross-controller interaction (between the kube registry and serviceentry registry), the WorkloadInstance needs to have a valid IP, as the podCache and endpointShards are both keyed by address. This PR adds logic to fetch the oldIP from the ipByPods cache, and uses that when building the WorkloadInstance.

Fixes #54997

@istio-testing
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Feb 19, 2025
@istio-policy-bot
Copy link

😊 Welcome @dwj300! This is either your first contribution to the Istio istio repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 19, 2025
@dwj300
Copy link
Member Author

dwj300 commented Feb 19, 2025

/test all

@dwj300 dwj300 force-pushed the dwj--stale-endpoints branch from f508451 to e3b0ba5 Compare February 19, 2025 00:31
@@ -148,8 +148,18 @@ func (pc *PodCache) onEvent(old, pod *v1.Pod, ev model.Event) error {
ip := pod.Status.PodIP
// PodIP will be empty when pod is just created, but before the IP is assigned
// via UpdateStatus.
// In the case of a pod being created, we should not add it to the cache until it is ready.
// However, if the pod *used to* have an IP, we do need to actually delete it.
if len(ip) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the ip is missing when being evicted. Then you can allow ip unset if this is a deletion event

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the problem is that all the caches (podCache, and endpointShardz) are indexed by podIP, not pod name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a map ipByPods map[types.NamespacedName]string

Copy link
Member

@hzxuzhonghu hzxuzhonghu Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever this effect is same, but checking delete may reduce map quering

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this! But yes, this PR checks in ipByPods, but even for "non deletions" (i.e. Failed), we need to check in the map. So we only do so if len(ip) == 0 already.

Change-Id: Ibe1264bfc48b4dee7e52964ad19cee9659631f1c
@dwj300 dwj300 force-pushed the dwj--stale-endpoints branch 3 times, most recently from b92c37b to adf530a Compare February 19, 2025 17:38
@dwj300 dwj300 force-pushed the dwj--stale-endpoints branch from adf530a to 71de47f Compare February 19, 2025 17:54
@dwj300 dwj300 marked this pull request as ready for review February 19, 2025 20:04
@dwj300 dwj300 requested a review from a team as a code owner February 19, 2025 20:04
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Feb 19, 2025
@dwj300 dwj300 force-pushed the dwj--stale-endpoints branch from 71de47f to 14e3aa6 Compare February 19, 2025 20:08
Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I think a test covering this would go a long way. We have a lot of similar tests+infra for this exact type of test in pilot/pkg/serviceregistry/serviceregistry_test.go already that would be a good fit

@@ -149,7 +149,14 @@ func (pc *PodCache) onEvent(old, pod *v1.Pod, ev model.Event) error {
// PodIP will be empty when pod is just created, but before the IP is assigned
// via UpdateStatus.
if len(ip) == 0 {
return nil
// However, in the case of an Eviction, the event that marks the pod as Failed may *also* have removed the IP.
// If the pod *used to* have an IP, then we need to actually delete it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking to myself...

So the event may be "update" but then we hit deleteIP anyways due to shouldPodBeInEndpoints and translate to a delete.

deleteIP already handles the case of "IP was removed" since its a map of ip->[]pods.

We then fallthrough to notifyWorkloadHandlers, build an EP with no IP but WorkloadInstance is identified by name not IP. This looks good ✔️

@dwj300 dwj300 force-pushed the dwj--stale-endpoints branch from 14e3aa6 to 250ea78 Compare February 19, 2025 21:32
Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

Fixes istio#54997

Change-Id: I59d90f775d86821060588f446e4d50d77ab97921
@dwj300 dwj300 force-pushed the dwj--stale-endpoints branch from 250ea78 to c3bfbc0 Compare February 20, 2025 00:36
@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 20, 2025
@istio-testing istio-testing merged commit 064cba1 into istio:master Feb 20, 2025
28 checks passed
dwj300 added a commit to airbnb/istio that referenced this pull request Feb 20, 2025
Backported from istio#55171

* pod.go: unexport ipByPods
* kube/controller: Prevent leaking stale endpoints

Fixes istio#54997

Change-Id: If9d45d7563e9e94ec0bbd477dfb0f57e7608e9a9
@dwj300 dwj300 added cherrypick/release-1.23 Set this label on a PR to auto-merge it to the release-1.23 branch cherrypick/release-1.24 Set this label on a PR to auto-merge it to the release-1.24 branch labels Feb 20, 2025
@istio-testing
Copy link
Collaborator

In response to a cherrypick label: #55171 failed to apply on top of branch "release-1.23":

Applying: pod.go: unexport ipByPods
Using index info to reconstruct a base tree...
M	pilot/pkg/serviceregistry/kube/controller/pod.go
Falling back to patching base and 3-way merge...
Auto-merging pilot/pkg/serviceregistry/kube/controller/pod.go
Applying: kube/controller: Prevent leaking stale endpoints
Using index info to reconstruct a base tree...
M	pilot/pkg/serviceregistry/kube/controller/pod.go
M	pilot/pkg/serviceregistry/serviceregistry_test.go
Falling back to patching base and 3-way merge...
Auto-merging pilot/pkg/serviceregistry/serviceregistry_test.go
Auto-merging pilot/pkg/serviceregistry/kube/controller/pod.go
CONFLICT (content): Merge conflict in pilot/pkg/serviceregistry/kube/controller/pod.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 kube/controller: Prevent leaking stale endpoints

@istio-testing
Copy link
Collaborator

In response to a cherrypick label: new issue created for failed cherrypick: #55195

@istio-testing
Copy link
Collaborator

In response to a cherrypick label: #55171 failed to apply on top of branch "release-1.23":

Applying: pod.go: unexport ipByPods
Using index info to reconstruct a base tree...
M	pilot/pkg/serviceregistry/kube/controller/pod.go
Falling back to patching base and 3-way merge...
Auto-merging pilot/pkg/serviceregistry/kube/controller/pod.go
Applying: kube/controller: Prevent leaking stale endpoints
Using index info to reconstruct a base tree...
M	pilot/pkg/serviceregistry/kube/controller/pod.go
M	pilot/pkg/serviceregistry/serviceregistry_test.go
Falling back to patching base and 3-way merge...
Auto-merging pilot/pkg/serviceregistry/serviceregistry_test.go
Auto-merging pilot/pkg/serviceregistry/kube/controller/pod.go
CONFLICT (content): Merge conflict in pilot/pkg/serviceregistry/kube/controller/pod.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 kube/controller: Prevent leaking stale endpoints

@istio-testing
Copy link
Collaborator

In response to a cherrypick label: new issue created for failed cherrypick: #55196

@istio-testing
Copy link
Collaborator

In response to a cherrypick label: #55171 failed to apply on top of branch "release-1.24":

Applying: pod.go: unexport ipByPods
Using index info to reconstruct a base tree...
M	pilot/pkg/serviceregistry/kube/controller/pod.go
Falling back to patching base and 3-way merge...
Auto-merging pilot/pkg/serviceregistry/kube/controller/pod.go
Applying: kube/controller: Prevent leaking stale endpoints
Using index info to reconstruct a base tree...
M	pilot/pkg/serviceregistry/kube/controller/pod.go
M	pilot/pkg/serviceregistry/serviceregistry_test.go
Falling back to patching base and 3-way merge...
Auto-merging pilot/pkg/serviceregistry/serviceregistry_test.go
Auto-merging pilot/pkg/serviceregistry/kube/controller/pod.go
CONFLICT (content): Merge conflict in pilot/pkg/serviceregistry/kube/controller/pod.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 kube/controller: Prevent leaking stale endpoints

@istio-testing
Copy link
Collaborator

In response to a cherrypick label: new issue created for failed cherrypick: #55197

dwj300 added a commit to airbnb/istio that referenced this pull request Feb 20, 2025
* pod.go: unexport ipByPods

Change-Id: Ibe1264bfc48b4dee7e52964ad19cee9659631f1c

* kube/controller: Prevent leaking stale endpoints

Fixes istio#54997

Change-Id: I59d90f775d86821060588f446e4d50d77ab97921
dwj300 added a commit to airbnb/istio that referenced this pull request Feb 20, 2025
Backported from istio#55171

* pod.go: unexport ipByPods
* kube/controller: Prevent leaking stale endpoints

Fixes istio#54997

Change-Id: If9d45d7563e9e94ec0bbd477dfb0f57e7608e9a9
dwj300 added a commit to airbnb/istio that referenced this pull request Feb 21, 2025
* pod.go: unexport ipByPods

Change-Id: Ibe1264bfc48b4dee7e52964ad19cee9659631f1c

* kube/controller: Prevent leaking stale endpoints

Fixes istio#54997

Change-Id: I59d90f775d86821060588f446e4d50d77ab97921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/perf and scalability cherrypick/release-1.23 Set this label on a PR to auto-merge it to the release-1.23 branch cherrypick/release-1.24 Set this label on a PR to auto-merge it to the release-1.24 branch feature/Multi-cluster issues related with multi-cluster support size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stale endpoint persisting in EDS for 24+hours
6 participants