Race condition in route update status with different gateway controllers #1629

ivanmatmati · 2023-01-09T11:01:44Z

What happened:
I was in the process of implementing the tcproute status management in our gateway controller and I realized that as any route can refer to multiple and different gateways there could be a race condition in the update of status.

What you expected to happen:
A merge operation should happen to deal with multiple and different sources of status updates (= different controllers).

How to reproduce it (as minimally and precisely as possible):
For example, suppose we deal with this route:

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TCPRoute
metadata:
  name: route1
  namespace: ns1
spec:
  parentRefs:
    - name: mygateway
    - name: othergateway
  rules:
    - backendRefs:
        - kind: Service
          name: tcp-echo-server
          port: 9000

mygateway is managed by my controller but not othergateway. At the creation of this route, my controller and the controller of the other vendor will try to update the status of this route with a likely race condition.

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

robscott · 2023-01-09T20:53:34Z

cc @apelisse

apelisse · 2023-01-10T04:15:18Z

Pretty sure the docs have a page about this but I couldn't find it today, so I'll try to re-iterate quickly some of the points.

Updating status is not really different from updating spec (except you need to run the operations against the status sub-resource), the operations that you can do to deal with concurrent updates are the same:

You can do an update, by doing a GET (which gets the resourceVersion of the object), make changes to that object, and UPDATE (which should still have the resourceVersion of the object that you got). If the object has changed in the mean time (and its resourceVersion has been updated), the request will automatically be rejected, forcing you to try again. That's the very basics of optimistic locking.
You can send a patch and hope for the best, the patch will be merged (I'm not going to go into the depth of the semantics of this merge, but ...). Lists are tricky, they usually don't merge well, so you're very likely to still cause problems
Use server-side apply, it will also merge in the server but will usually work better. Lists are still problematic to be honest, in general, try using the markers for your lists (https://kubernetes.io/docs/reference/using-api/server-side-apply/#merge-strategy).

youngnick · 2023-01-10T07:39:35Z

As @robscott noted in the community meeting today, currently the Route status is designed so that each list item should be able to be updated independently by the owning controller (because the status is distinct by both controllerName and parentRef). However, we can't easily use the merge strategy markers, because one of the indicies are not a scalar (parentRef is a struct).

So it seems like we have two options:

add some field that disambiguates the RouteParentStatus struct in a single key (maybe some auto-generated concatenation of the fields of the ParentRef?), then use that as a merge key. This would ensure that the api-machinery mechanisms will solve this problem correctly, but is a fair amount of api churn.
Document that changes to status must always be done with a full update cycle (that is, GET the object, mutate the status to add your changes, then UPDATE the status, retrying if the resourceVersion causes the update to be rejected.). This puts more work onto maintainers, but requires no api changes (and so could be done more quickly).

I'm kind of in favor of option 2, personally (since that's what we've done in all the controllers that I've worked on anyway), but interested in other's thoughts here.

k8s-triage-robot · 2023-04-10T07:51:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

shaneutt · 2023-04-10T22:37:07Z

We would like to have this improvement and would definitely welcome a contributor taking this on:

/help

We don't expect this needs to hold up our GA Release, but we've placed it in v1.0.0 as it would be nice to have for that release.

k8s-ci-robot · 2023-04-10T22:37:08Z

@shaneutt:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We would like to have this improvement and would definitely welcome a contributor taking this on:

/help

We don't expect this needs to hold up our GA Release, but we've placed it in v1.0.0 as it would be nice to have for that release.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shaneutt · 2023-07-10T21:19:27Z

We're getting closer to the GA release and need to reduce the things which we say are needed for the release in order to promote focus on release blockers, so we're pulling this from the milestone (but still want to get it done post-GA).

k8s-triage-robot · 2024-01-24T04:57:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-23T05:05:06Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ivanmatmati added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2023

shaneutt added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 9, 2023

shaneutt added this to Gateway API: The Road to GA Mar 8, 2023

shaneutt moved this to Triage in Gateway API: The Road to GA Mar 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2023

robscott modified the milestones: v0.8.0, v1.0.0 Apr 10, 2023

shaneutt removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2023

shaneutt moved this from Triage to Backlog in Gateway API: The Road to GA Apr 10, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Apr 10, 2023

shaneutt removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 10, 2023

shaneutt removed this from the v1.0.0 milestone Jul 10, 2023

shaneutt removed this from Gateway API: The Road to GA Jul 10, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2024

arkodg mentioned this issue Oct 30, 2024

HTTPRoute w/ two parent gateways - Both get configured, only one is shown in status, and which switches envoyproxy/gateway#4264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in route update status with different gateway controllers #1629

Race condition in route update status with different gateway controllers #1629

ivanmatmati commented Jan 9, 2023

robscott commented Jan 9, 2023

apelisse commented Jan 10, 2023

youngnick commented Jan 10, 2023

k8s-triage-robot commented Apr 10, 2023

shaneutt commented Apr 10, 2023

k8s-ci-robot commented Apr 10, 2023

shaneutt commented Jul 10, 2023

k8s-triage-robot commented Jan 24, 2024

k8s-triage-robot commented Feb 23, 2024

Race condition in route update status with different gateway controllers #1629

Race condition in route update status with different gateway controllers #1629

Comments

ivanmatmati commented Jan 9, 2023

robscott commented Jan 9, 2023

apelisse commented Jan 10, 2023

youngnick commented Jan 10, 2023

k8s-triage-robot commented Apr 10, 2023

shaneutt commented Apr 10, 2023

k8s-ci-robot commented Apr 10, 2023

Guidelines

shaneutt commented Jul 10, 2023

k8s-triage-robot commented Jan 24, 2024

k8s-triage-robot commented Feb 23, 2024