Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metric for NNC init failures #3453

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rbtr
Copy link
Contributor

@rbtr rbtr commented Feb 25, 2025

Instead of crashing after 10 retries to initialize the CNS state, this change retries until it succeeds and increments a metric if it doesn't to count NNC init failures. Also adds a positive-signal metric "hasNNCInitialized" to signal that this process has completed succesfully.

@rbtr rbtr added cns Related to CNS. release/latest Change affects latest release train needs-backport Change needs to be backported to previous release trains labels Feb 25, 2025
@rbtr rbtr self-assigned this Feb 25, 2025
@Copilot Copilot bot review requested due to automatic review settings February 25, 2025 23:32
@rbtr rbtr requested a review from a team as a code owner February 25, 2025 23:32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

cns/service/metrics.go:9

  • Typo in comment: 'monotic' should be 'monotonic'.
    // managerStartFailures is a monotic counter which tracks the number of times the controller-runtime
@rbtr
Copy link
Contributor Author

rbtr commented Feb 26, 2025

/azp run Azure Container Networking PR

@rbtr rbtr enabled auto-merge February 26, 2025 17:38
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr requested a review from Copilot February 28, 2025 06:14

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request updates the CNS initialization process to retry until successful, tracking failures and success via new metrics. Key changes include:

  • Replacing a finite retry count with an infinite (until succeeded) exponential backoff retrier in the main service.
  • Incrementing a failure metric (nncInitFailure) on each NNC init failure and setting a success gauge (hasNNCInitialized) after successful reconciliation.
  • Adding two new Prometheus metrics in metrics.go for NNC initialization tracking.

Reviewed Changes

File Description
cns/service/main.go Updated retry logic and added metric instrumentation for CNS init state
cns/service/metrics.go Added two new metrics (nncInitFailure and hasNNCInitialized) with registration

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

cns/service/main.go:1462

  • The variable 'initCNSInitalDelay' appears to have a typo; consider renaming it to 'initCNSInitialDelay' for clarity.
}, retry.Context(ctx), retry.Delay(initCNSInitalDelay), retry.MaxDelay(time.Minute), retry.UntilSucceeded())

cns/service/metrics.go:29

  • The word 'monotic' in the comment appears to be a typo; consider changing it to 'monotonic'.
// nncInitFailure is a monotic counter which tracks the number of times the initial NNC reconcile has failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cns Related to CNS. needs-backport Change needs to be backported to previous release trains release/latest Change affects latest release train
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant