Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEP: Unknown Field Management (aka Dead Fields) #3624

Open
shaneutt opened this issue Feb 17, 2025 · 7 comments
Open

GEP: Unknown Field Management (aka Dead Fields) #3624

shaneutt opened this issue Feb 17, 2025 · 7 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@shaneutt
Copy link
Member

shaneutt commented Feb 17, 2025

Definition

The "Unknown Fields" situation (colloquially we've often referred to this as the "Dead Fields" situation as well) is the situation where the API provides one or more fields that an implementation doesn't actually support, but that implementation is otherwise schema-compatible with the rest of the fields.

Note: In this context we're referring very specifically to the Gateway API CRDs.

In the best case the user doesn't need or use the dead fields, and the problem is non-impacting (at least for that moment).

In the bad case the user utilizes the fields and the API accepts them, but the implementation doesn't support them and so they are "dropped silently". This is not axiomatic for the user. The implementation problematically provisions something that is only a partial reflection of the users' intent.

The worst case is the same as the bad case, except those dropped fields are critical and it is broken and/or unsafe to provision the rest without them. For instance, if some security related fields like authentication were populated and the implementation drops them and exposes a route without its intended authentication (and the user has no way to tell).

What?

I'm advocating for an upstream solution to the "dead fields" problem to help Gateway API implementations and Kubernetes platforms create a safer and more consistent experience for users. This should include both documentation and tooling.

Note: This problem is not unique to Gateway API, and in fact I'm aware of it existing for at least Cluster API (CAPI) and potentially some other projects. However, we were early adopters of CRDs for an official API so the impact is particularly strong here. See more context below.

Why?

Note: Originally discussed in #3576 (comment)

In Gateway API, we have several dimensions where the "dead fields" problem comes into play:

  • Clusters housing multiple implementations
  • Extended feature support
  • Experimental channel

We'll explain each of these situations in more detail.

Clusters housing multiple implementations

The Gateway API CRDs have become ubiquitous, and as such the situation where multiple implementations on the same cluster is a thing that needs to be supported is becoming more and more common. Consequentially, platforms are inclined to provide the Gateway API CRDs as "core-like" APIs on these clusters.

Core-Like: meaning that in effect the API is like a "core API" - It's provided by default and can't be managed by the cluster admin or any user, instead the full life-cycle of the API is managed by the platform. The biggest differences is that you can't necessarily expect a specific version of Gateway API to match with a specific version of Kubernetes (today).

This gives implementations at least two obvious choices. They can either:

a) crash until the schema that matches is present
b) accept the CRDs as long as they are newer than the minimum required, and forward compatible (e.g. same major release)

A is nobody's favorite. The implication is that implementations might need to (across a matrix of platform versions and the corresponding Gateway API version) create specific releases that target a combination of platforms and their versions, and while doing so match the Gateway API version there exactly. This pushes implementations to B which is where the dead fields dragons lie.

Extended feature support

We have a concept in Gateway API called "Support Levels". This explicitly adds fields that implementations are not required to support. These are not exactly the same as "unknown fields", but relate.

A result of this enhancement should be to define the standard for what implementations do when they have a field populated which they can't use, whether that comes from an "unknown field" or an "known field" (i.e. extended field).

Experimental channel

This is probably one of the least problematic situations, since we anticipate if you're using something labeled Experimental then what you're doing is testing or something adjacent to that, so the impact of dropped fields is limited to non-production cases here. This is mainly a problem because we originally "layered" experimental features on top of the same GVK as standard and had users choose which one to deploy. A solution to that problem is being discussed.

How?

I'm open to suggestion on the "how".

Note: Though the "how" is very open for discussion, I do have some thoughts on what I think might not work or might work that seem worth sharing now as the basis for that discussion.

Telling implementations that they should schema check the CRDs and crash on additive schema, and target releases against specific combinations of platform versions and their Gateway API versions is... well... maybe on the table but it's sure going to be unpopular, people are going to prefer a more flexible solution since this isn't actually a core API.

I currently don't think Validating Webhooks or Validating Admission Policies (VAP) or anything like that is going to be particularly viable. We could in theory add something to the specification (e.g. something like supported features) to inform validation, but then we still have to contend with race conditions and the "best effort nature" of cross-resource validation since we'll have to follow the GatewayClass <-> Gateway <-> *Route resource chain to do that validation.

A solution I have not tested but I think might be viable is if we provide tooling for implementations to inject schema checking into object deserialization in the API. That is to say, we might be able to provide a custom decoder which would be a "dead fields detector": implementations could employ this in their controllers, and it could tell them whether the actual API schema is different from the one they expect, and furthermore if the user specified values in those "dead fields". Again this still needs to be tested out, but if it can work then implementations can just identify and bark at dead fields themselves and we can provide conformance tests to ensure they bark. One downside of this is that it's not a universal solution right out of the box. We can provide Golang and Rust tooling (via gateway-api-rs), but we simply can't feasibly provide full solutions for all languages and instead would have to provide guidance for them to add it themselves.

@shaneutt shaneutt added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. labels Feb 17, 2025
@shaneutt shaneutt self-assigned this Feb 17, 2025
@shaneutt shaneutt moved this to In Progress in Gateway API Pipeline Feb 17, 2025
@shaneutt shaneutt changed the title GEP: Dead Field Management GEP: Unknown Field Management (aka Dead Fields) Feb 17, 2025
@youngnick
Copy link
Contributor

#1804 may be relevant here as well, linking.

@mlavacca
Copy link
Member

I definitely agree this is a problem we need to address.
I'd like to highlight the double-folded nature of this problem though, as it can happen under two different circumstances:

  • The implementation acme is using Gateway API v1.x, which corresponds to the version installed in the cluster. This necessarily means that all the CRD fields are correctly unmarshaled into the go structs, and no field gets lost during the unmarshaling process.
  • The implementation acme uses Gateway API v1.x, but the Gateway API version installed in the cluster is v1.(x+1) (standard). The feature foo in the Gateway CRD just graduated to the standard channel. The implementation, when unmarshaling the Gateway resource, loses the foo field, and cannot be aware of the dead field in the object.

While the former scenario can be solved by adding a dedicated condition to the resource, the latter is the complex one and definitely requires some work and definition.

@howardjohn
Copy link
Contributor

The implementation, when unmarshaling the Gateway resource, loses the foo field, and cannot be aware of the dead field in the object.

One thing to consider is status writing as well. If an implementation uses Update instead of Patch for status (Istio does, fwiw), there is a risk with dead fields. The update call will not include them, thus overwriting them. This would impact multi-ownership of an object like a route.

The same has been seen with mutating webhooks before; they would accidentally remove fields due to using an older client-go version

@shaneutt shaneutt moved this from In Progress to Triage in Gateway API Pipeline Feb 18, 2025
@youngnick
Copy link
Contributor

The tip about using Patch rather than Update for status is one I had in mind for the implementer's guide, although I never got around to doing it.

@danwinship
Copy link

Note: This problem is note unique to Gateway API, and in fact I'm aware of it existing for

"is NOT unique"

at least Cluster API (CAPI) and potentially some other projects. However, we were early adopters of CRDs for an official API so the impact is particularly strong here.

This isn't even just a CRD thing: NetworkPolicy has had to worry about this too. If your NP implementation is compiled against an older version of k8s.io/api than the apiserver is, then users can create NetworkPolicies that use fields that your NP impl can't see.

A solution I have not tested but I think might be viable is if we provide tooling for implementations to inject schema checking into object deserialization in the API.

I can't find the discussion now (it's not in kubernetes/enhancements#2137 which is where I would have expected it) but I had thought about something very similar for NetworkPolicy.

("Schema checking" makes it sound more difficult than it really is; you just need to notice whether the JSON had any fields in it that couldn't be deserialized into the generated struct type.)

One downside of this is that it's not a universal solution right out of the box. We can provide Golang and Rust tooling (via gateway-api-rs), but we simply can't feasibly provide full solutions for all languages and instead would have to provide guidance for them to add it themselves.

A simple (if not efficient) way to do this without having a real API for it is to use the dynamic client rather than a generated type-specific client, and then do the comparison post-hoc; you get an unstructured.Unstructured from the client, and then convert it to your generated type, and then at that point you can check whether reserializing the object of the generated type back to JSON gives the same result as reserializing the unstructured object back to JSON. If not, then you lost something...

@danwinship
Copy link

danwinship commented Mar 6, 2025

This is also being discussed in sig-arch, specifically in the context of some DRA fields (https://docs.google.com/document/d/1sCfO4hJUUhZpzld-_wEZi85XqSclEVvjUfsEe2-duZs/edit). (More-generic discussion later on in the doc.)

@shaneutt
Copy link
Member Author

Sounds like we have significant support to at least see what can be done here, so we can move forward:

/triage accepted

I'll try and drive the first iteration of this (but welcome collaborators):

/assign @shaneutt

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Mar 11, 2025
@shaneutt shaneutt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Mar 11, 2025
@shaneutt shaneutt moved this from Triage to In Progress in Gateway API Pipeline Mar 11, 2025
@shaneutt shaneutt added this to the v1.4.0 milestone Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: In Progress
Development

No branches or pull requests

6 participants