Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in route update status with different gateway controllers #1629

Open
ivanmatmati opened this issue Jan 9, 2023 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@ivanmatmati
Copy link

What happened:
I was in the process of implementing the tcproute status management in our gateway controller and I realized that as any route can refer to multiple and different gateways there could be a race condition in the update of status.

What you expected to happen:
A merge operation should happen to deal with multiple and different sources of status updates (= different controllers).

How to reproduce it (as minimally and precisely as possible):
For example, suppose we deal with this route:

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TCPRoute
metadata:
  name: route1
  namespace: ns1
spec:
  parentRefs:
    - name: mygateway
    - name: othergateway
  rules:
    - backendRefs:
        - kind: Service
          name: tcp-echo-server
          port: 9000

mygateway is managed by my controller but not othergateway. At the creation of this route, my controller and the controller of the other vendor will try to update the status of this route with a likely race condition.

Anything else we need to know?:

@ivanmatmati ivanmatmati added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2023
@robscott
Copy link
Member

robscott commented Jan 9, 2023

cc @apelisse

@shaneutt shaneutt added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 9, 2023
@apelisse
Copy link

Pretty sure the docs have a page about this but I couldn't find it today, so I'll try to re-iterate quickly some of the points.

Updating status is not really different from updating spec (except you need to run the operations against the status sub-resource), the operations that you can do to deal with concurrent updates are the same:

  1. You can do an update, by doing a GET (which gets the resourceVersion of the object), make changes to that object, and UPDATE (which should still have the resourceVersion of the object that you got). If the object has changed in the mean time (and its resourceVersion has been updated), the request will automatically be rejected, forcing you to try again. That's the very basics of optimistic locking.
  2. You can send a patch and hope for the best, the patch will be merged (I'm not going to go into the depth of the semantics of this merge, but ...). Lists are tricky, they usually don't merge well, so you're very likely to still cause problems
  3. Use server-side apply, it will also merge in the server but will usually work better. Lists are still problematic to be honest, in general, try using the markers for your lists (https://kubernetes.io/docs/reference/using-api/server-side-apply/#merge-strategy).

@youngnick
Copy link
Contributor

As @robscott noted in the community meeting today, currently the Route status is designed so that each list item should be able to be updated independently by the owning controller (because the status is distinct by both controllerName and parentRef). However, we can't easily use the merge strategy markers, because one of the indicies are not a scalar (parentRef is a struct).

So it seems like we have two options:

  • add some field that disambiguates the RouteParentStatus struct in a single key (maybe some auto-generated concatenation of the fields of the ParentRef?), then use that as a merge key. This would ensure that the api-machinery mechanisms will solve this problem correctly, but is a fair amount of api churn.
  • Document that changes to status must always be done with a full update cycle (that is, GET the object, mutate the status to add your changes, then UPDATE the status, retrying if the resourceVersion causes the update to be rejected.). This puts more work onto maintainers, but requires no api changes (and so could be done more quickly).

I'm kind of in favor of option 2, personally (since that's what we've done in all the controllers that I've worked on anyway), but interested in other's thoughts here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2023
@robscott robscott modified the milestones: v0.8.0, v1.0.0 Apr 10, 2023
@shaneutt shaneutt removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2023
@shaneutt shaneutt moved this from Triage to Backlog in Gateway API: The Road to GA Apr 10, 2023
@shaneutt
Copy link
Member

We would like to have this improvement and would definitely welcome a contributor taking this on:

/help

We don't expect this needs to hold up our GA Release, but we've placed it in v1.0.0 as it would be nice to have for that release.

@k8s-ci-robot
Copy link
Contributor

@shaneutt:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We would like to have this improvement and would definitely welcome a contributor taking this on:

/help

We don't expect this needs to hold up our GA Release, but we've placed it in v1.0.0 as it would be nice to have for that release.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Apr 10, 2023
@shaneutt shaneutt removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 10, 2023
@shaneutt shaneutt removed this from the v1.0.0 milestone Jul 10, 2023
@shaneutt
Copy link
Member

We're getting closer to the GA release and need to reduce the things which we say are needed for the release in order to promote focus on release blockers, so we're pulling this from the milestone (but still want to get it done post-GA).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

7 participants