-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add metric for NNC init failures #3453
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Evan Baker <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
cns/service/metrics.go:9
- Typo in comment: 'monotic' should be 'monotonic'.
// managerStartFailures is a monotic counter which tracks the number of times the controller-runtime
/azp run Azure Container Networking PR |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This pull request updates the CNS initialization process to retry until successful, tracking failures and success via new metrics. Key changes include:
- Replacing a finite retry count with an infinite (until succeeded) exponential backoff retrier in the main service.
- Incrementing a failure metric (nncInitFailure) on each NNC init failure and setting a success gauge (hasNNCInitialized) after successful reconciliation.
- Adding two new Prometheus metrics in metrics.go for NNC initialization tracking.
Reviewed Changes
File | Description |
---|---|
cns/service/main.go | Updated retry logic and added metric instrumentation for CNS init state |
cns/service/metrics.go | Added two new metrics (nncInitFailure and hasNNCInitialized) with registration |
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
cns/service/main.go:1462
- The variable 'initCNSInitalDelay' appears to have a typo; consider renaming it to 'initCNSInitialDelay' for clarity.
}, retry.Context(ctx), retry.Delay(initCNSInitalDelay), retry.MaxDelay(time.Minute), retry.UntilSucceeded())
cns/service/metrics.go:29
- The word 'monotic' in the comment appears to be a typo; consider changing it to 'monotonic'.
// nncInitFailure is a monotic counter which tracks the number of times the initial NNC reconcile has failed.
Instead of crashing after 10 retries to initialize the CNS state, this change retries until it succeeds and increments a metric if it doesn't to count NNC init failures. Also adds a positive-signal metric "hasNNCInitialized" to signal that this process has completed succesfully.