Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEP 3388 Retry Budget API Design #3573

Merged

Conversation

ericdbishop
Copy link
Contributor

@ericdbishop ericdbishop commented Jan 28, 2025

What type of PR is this?

/kind gep

What this PR does / why we need it:

Following up on #3488, where GEP 3388 was moved to provisional and the general goals and some potential designs for retry budgets within Gateway API were agreed upon. This PR will present multiple API implementations based off of previous discussion.

Which issue(s) this PR fixes:

Fixes #3388

Does this PR introduce a user-facing change?:

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/gep PRs related to Gateway Enhancement Proposal(GEP) do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 28, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ericdbishop. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jan 28, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2025
@ericdbishop ericdbishop force-pushed the gep-3388-retry-budget-api-design branch from 75e5d0b to 78667d6 Compare January 29, 2025 16:38
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 29, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 30, 2025
@ericdbishop ericdbishop marked this pull request as ready for review January 30, 2025 17:17
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2025
@mikemorris
Copy link
Contributor

mikemorris commented Jan 30, 2025

Planning to expand on the doc comments further during implementation phase, but put together an initial draft of possible Go struct and YAML representations.

The tricky part i tried to illustrate in YAML examples is the potential complexity of applicability to both N/S (ingress) and E/W (mesh) traffic, indicating the need for some sort of source discriminator (the from clause - I'm expecting we'll likely want a similar pattern eventually for AuthZ policies @youngnick @LiorLieberman), and how bundling everything into BackendTrafficPolicy might make various combinations too unwieldy to configure or status too difficult to parse.

Open questions/comments:

  • Is there merit to including the HTTP retry codes/attempts/backoff fields in this policy? (Nested under an http stanza to allow an optional parallel grpc stanza in the future is desired)
  • Need to add some language on allowed/default values:
    • I think explicitly setting budgetInterval and minRetryRate.interval to 0s should be the way to configure Envoy's existing behavior for tracking concurrent in-flight requests only rather than measuring over a period of time. If we choose an over-time default value for each of these (Linkerd's defaults seem sensible and well-tested), I think there's a clear path for implementing that functionality in Envoy.

@robscott @kflynn @howardjohn @rajatsharma94

@mikemorris
Copy link
Contributor

/ok-to-test

@robscott
Copy link
Member

robscott commented Feb 3, 2025

@robscott I expect it should take at most a few days to get this sufficiently cleaned up - would like to request an extension until end of next week, Feburary 7th, 2025.

Thanks @mikemorris! I've talked with @mlavacca, @shaneutt, and @youngnick and we're approving this exception. Of course we would strongly prefer that this gets in sooner to leave time for API types, docs, conformance tests, etc. Let us know if there's anything we can help with!

@shaneutt
Copy link
Member

shaneutt commented Feb 4, 2025

Yes, just stay in touch! Feel free to reach out on Slack if you're feeling stuck 🫡

@tonya11en
Copy link

@mikemorris notes:

To clarify, I'm suggesting the unset/default behavior likely be a standard "count over time interval" (maybe 10s is the good default, which is not the stock Envoy behavior and will require what I expect will be a minor enhancement directly in Envoy upstream and configuration to set that in Envoy-based Gateway API implementations), and 0s is basically a "divide by zero" shorthand to express the existing Envoy behavior of "don't track requests over time, only active connections".

I kinda feel like only tracking active connections isn't actually implementing budgeted retries (it's honestly a little tough for me to come up with a "retry" semantic based on connections that even makes sense). I'd vote for disallowing "0s" and setting a minimum for the budget calculation window; if we want some funky connection-only mode, let's maybe call it something totally different.

@kflynn the Envoy retry budget doesn't care about connections. You're right to be confused by that. It only tracks concurrent requests and the ratio of those that are retries.

FWIW, I agree that modifying Envoy in the way folks here are suggesting (adding a time interval parameter) will work and be a fairly innocuous change. I said something similar in envoyproxy/envoy#30205 (comment). I can help with making that change in Envoy whenever you folks flesh out the specifics here.

@robscott
Copy link
Member

robscott commented Feb 6, 2025

@mikemorris @ericdbishop any updates on this one? Are you still trying to get this in to v1.3?

@mikemorris
Copy link
Contributor

mikemorris commented Feb 7, 2025

@mikemorris @ericdbishop any updates on this one? Are you still trying to get this in to v1.3?

Yes, I've just been a bit swamped with other tasks this week and I think @ericdbishop has been busy on-call. Opened ericdbishop#2 against Eric's branch with revisions in response to discussion in comments, after merging that I think @ericdbishop would just need to accept @kflynn's minor phrasing tweaks then we would be in good shape!

@ericdbishop
Copy link
Contributor Author

@mikemorris @robscott Done, it's been a low-bandwidth week for me, apologies.

@mikemorris
Copy link
Contributor

@robscott @kflynn @howardjohn I think this should be ready for final review now!

/cc @shaneutt @mlavacca @youngnick

@LiorLieberman
Copy link
Member

Looks good to me, thank you both!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2025
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ericdbishop and @mikemorris! A few follow up questions we can cover in the next phase, but otherwise LGTM.

/approve

// Support: Extended
//
// +optional
BudgetPercent *Int `json:"budgetPercent,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can save this for the follow up, but we'll need some more comprehensive validation here, at least a min and max


// CommonRetryPolicy defines the configuration for when to retry a request.
//
type CommonRetryPolicy struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the minimum viable set of fields here for an implementation to say that they support retry budgets?

Comment on lines +122 to +123
// Implementations SHOULD retry on connection errors (disconnect, reset, timeout,
// TCP failure) if a retry stanza is configured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll likely need to add some guidance in the spec for how this overlaps with the other retry config already in the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this might've just been an inadvertent copy-paste, was talking with @kflynn earlier about how we should add guidance on interaction with HTTPRoute retry clause - my initial thought is a retry budget is a constraint, but wouldn't on its own imply requests should be retried - thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, as a retry budget implies an overall constraint, and additionally application developers will decide which routes should require retries based on idempotency of requests, etc.


// RequestRate expresses a rate of requests over a given period of time.
//
type RequestRate struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both of the fields in here meant to be optional? Surely at least one of them needs to be set? Can they both be set at the same time?

Copy link
Contributor

@mikemorris mikemorris Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both would likely be set at the same time. I'm not sure if there's an obvious enough default for either field on its own, so maybe both should be required, and what should actually be optional and could have a reasonable default is just the MinRetryRate field of this type in CommonRetryPolicy?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ericdbishop, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2025
@robscott
Copy link
Member

robscott commented Feb 8, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2025
@k8s-ci-robot k8s-ci-robot merged commit b86c8ea into kubernetes-sigs:main Feb 8, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/gep PRs related to Gateway Enhancement Proposal(GEP) lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry Budgets in HTTPRouteRetry
9 participants