Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover to replicas on 401&403 HTTP status codes from the SDK #568

Closed
amerjusupovic opened this issue Jul 2, 2024 · 10 comments · Fixed by #595
Closed

Failover to replicas on 401&403 HTTP status codes from the SDK #568

amerjusupovic opened this issue Jul 2, 2024 · 10 comments · Fixed by #595
Assignees

Comments

@amerjusupovic
Copy link
Member

Currently, the provider only fails over to available replicas on 3 specific HTTP status codes when a request returns a RequestFailedException. However, we can fail over on all status codes to have better resiliency and prevent throwing an exception when there is an unexpected status code that could potentially be resolved by requesting a replica instead.

@zhenlan
Copy link
Contributor

zhenlan commented Jul 3, 2024

Can you give one example of what else would be useful?

@amerjusupovic
Copy link
Member Author

@zhenlan The idea was to protect against a potential unexpected 401/403 error causing the provider to throw an exception. If there is a bug that causes failures for only a specific endpoint, then it would be unfortunate if the provider can't attempt to fail over to a replica. I'm not sure exactly how this 401/403 bug would manifest itself, but that's how it was described to me.

@zhenlan
Copy link
Contributor

zhenlan commented Jul 9, 2024

All replicas share the same auth model. Switching from one replica to another won't help the 401/403 errors.

@jimmyca15
Copy link
Member

jimmyca15 commented Jul 9, 2024

That's the happy path. Replicas are meant to kick in during disaster scenarios. In such cases it's not out of the question to consider a scenario where an issue in the service causes a given region/replica to return 401/403. Yes, it would be some kind of bug or regional issue that would cause it, but in that case we should fail over to replicas. I think we missed on this one. We should never have a case that the server can do something that causes the client to not attempt a replica. Otherwise, that specific scenario happening due to a disaster could shut down a client that has replication enabled.

@zhenlan
Copy link
Contributor

zhenlan commented Jul 9, 2024

Do we have a real scenario or is this more like "in case we have a bug" scenario?

@jimmyca15
Copy link
Member

Bug/disaster. yes.

@zhiyuanliang-ms zhiyuanliang-ms self-assigned this Jul 16, 2024
@jimmyca15
Copy link
Member

@zhenlan do you still have concerns on this one?

@zhenlan
Copy link
Contributor

zhenlan commented Jul 17, 2024

I'm still not convinced we should failover for potential bugs. Since bugs can be anything, that's like failover for everything (all 4xx status code... assuming we don't have bugs in the status code itself).

@zhiyuanliang-ms zhiyuanliang-ms changed the title Failover to replicas on all HTTP status codes from the SDK Failover to replicas on 401&403 HTTP status codes from the SDK Jul 25, 2024
@amerjusupovic
Copy link
Member Author

Discussed offline, the plan is to failover to any available replicas on 401 and 403 status codes due to potential regional failures of other services App Configuration depends on.

@zhenlan
Copy link
Contributor

zhenlan commented Sep 16, 2024

Discussed offline, the plan is to failover to any available replicas on 401 and 403 status codes due to potential regional failures of other services App Configuration depends on.

Particularly, the regional failures of the Microsoft Entra ID and RBAC services that App Configuration depends on. This is why we only add failovers for 401 and 403 (instead of any other arbitrary client errors) at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants