Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rule management to avoid sporadic 503 errors #4039

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

shraddhabang
Copy link
Collaborator

@shraddhabang shraddhabang commented Feb 4, 2025

Issue

#3816

Description

The root cause for the above issue was identified as a race condition in the ALB's rule management process. The controller uses ModifyRule API sequentially to add any new rule at the existing priority which results in loss of the existing older rule at that priority. When the first API call to modify rules is received, the ALB waits 10 seconds to batch all subsequent changes before deploying them to the data plane. If some rule modification APIs arrive after the initial 10-second window, they are implemented in the next batch, leading to a brief period where a random rule may not exist, causing 404 errors.

The proposed improvement in the rule management logic make use of SetRulePriorities API to first push the unwanted rules down in the listener instead of deleting them. This creates the gaps in the rules order to add any new/updated rule. Once the new rule is successfully added, the controller will then delete all the unwanted rules. This way we ensure that no rules will be lost before the new/updated rules are present in the controller. This will avoid any sporadic 404 errors caused due to race conditions.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shraddhabang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2025
"resourceID", resLR.ID(),
"arn", awssdk.ToString(sdkLR.ListenerRule.RuleArn))
return nil
func (m *defaultListenerRuleManager) SetRulePriorities(ctx context.Context, unmatchedSDKLRs []ListenerRuleWithTags, lastAvailablePriority int32) (int32, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unmatchedSDKLRs seems to be the wrong name. Aren't these listener rules that are matched but just at the wrong priority?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read further on in the implementation seems we use this function for matched and unmatched rules.

Copy link
Collaborator

@wweiwei-li wweiwei-li Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move "elasticloadbalancing:SetRulePriorities" to where we have condition, like other modify calls

"Condition": {
    "Null": {
        "aws:ResourceTag/elbv2.k8s.aws/cluster": "false"
    }
}


matchedResAndSDKLRs, unmatchedResLRs, unmatchedSDKLRs := matchResAndSDKListenerRules(resLRs, sdkLRs)
// matchedResAndSDKLRsBySettings : A slice of matched resLR and SDKLR rule pairs that have matching settings like actions and conditions
// unmatchedResLRs : A slice of resLR) that do not have a corresponding match in the sdkLRs. These rules need to be created on the load balancer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resLR) typo

if err != nil {
return err
}
resLR.SetStatus(lrStatus)
}
for _, resAndSDKLR := range matchedResAndSDKLRs {
// Update existing listener rules on the load balancer for their tags
for _, resAndSDKLR := range matchedResAndSDKLRsBySettings {
lsStatus, err := s.lrManager.Update(ctx, resAndSDKLR.resLR, resAndSDKLR.sdkLR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this lrManager.Update is used only for update tags with your change ? should we change the method name to make it clear

sdkLR: sdkLR,
})
}
unmatchedSDKLRs = append(unmatchedSDKLRs[:i], unmatchedSDKLRs[i+1:]...)
Copy link
Collaborator

@zac-nixon zac-nixon Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not a fan of modifying the input of a function like this. I see that we do this is in other parts of the code but overall I think it's an anti-pattern. Using this pattern also means you have to do multiple modification to i to get the index correct.

I would prefer to just allocate a new array to use.

break
}
}
if !found {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will still result in 503s (please correct me if I'm wrong)

Basically, from my read of the code, when a user changes their conditions or actions we will delete the ListenerRule and re-create the rule. If we're unlucky, it's possible for the delete to get propagated and the create to be delayed a bit which causes the 503 errors.

Ideally, we can use modify-rule https://docs.aws.amazon.com/cli/latest/reference/elbv2/modify-rule.html to detect these rules with the changed actions / conditions and modify the rule in place, rather than delete and create the rule.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 25, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants