-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add prometheus metrics #4056
base: main
Are you sure you want to change the base?
add prometheus metrics #4056
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wweiwei-li The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc | |||
} | |||
|
|||
func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error { | |||
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works correctly. If the idea is to get the latency of the reconcile request, we need to get a timestamp at the start of the function. Currently, looking at the implementation of ObserveControllerReconcileLatency
it will always report 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is not working correctly. My original plan was something like this. Would this make more sense ?
reconcileAddFinalizerStartTime := time.Now()
....
r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers", time.Since(econcileAddFinalizerStartTime))
reconcileUpdateStatusStartTime := time.Now()
.....
r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "update status", time.Since(reconcileUpdateStatusStartTime))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion works but we can cut down on code duplication. I would suggest defining something like this:
func (mc *MetricsCollector) latencyHelper(resource string, step string, fn func()) {
start := time.Now()
defer mc.ObserveControllerReconcileLatency(resource, step, time.Now().Sub(start))
fn()
}
Then you can call it like:
var err error
finalizerFn := func () {
err = r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer)
}
r.metricsCollector.latencyHelper("targetGroupBinding", "add finalizers", finalizerFn)
... handle error here ...
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually forgot about the quirk with using time.Now and deferred statements.
https://stackoverflow.com/questions/72965657/measure-elapsed-time-with-a-defer-statement-in-go
it's a small change to the example.
@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc | |||
} | |||
|
|||
func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error { | |||
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I'm clear, the idea is to provide more granular reconcile latency metrics than what is already provided the controller-runtime? Have you looked into what constructs the controller-runtime provides so we don't need to roll our own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the idea is to provide more granular reconcile latency metrics. I checked controller-runtime . It only provided end to end reconcile time.
main.go
Outdated
@@ -107,6 +107,12 @@ func main() { | |||
os.Exit(1) | |||
} | |||
|
|||
var lbcMetricsCollector *lbcmetrics.Collector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this? The original goal of this implementation was to to avoid having to nil check when no metrics were requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's because I saw awsMetricsCollector
had the nil check, and I didn't see lbcmetrics.NewCollector()
handle nil metrics.Registry internally. So have a feeling we need to apply the same nil check. Let me know if I am wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should continue setting lbcMetricsCollector
. The factory knows whether or not the metric registry is nil. If it's nil, we return a no-op collector. If the metric registry is not nil, we return an actual collector. The users of the metrics collector don't need to know either way.
main.go
Outdated
@@ -206,6 +212,14 @@ func main() { | |||
setupLog.Error(err, "problem wait for podInfo repo sync") | |||
os.Exit(1) | |||
} | |||
|
|||
go func() { | |||
if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will currently cause a NPE if a registry is not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I will add check metrics.Registry
if metrics.Registry != nil {
go func() {
if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil {
setupLog.Error(err, "problem periodically collect cache size")
os.Exit(1)
}
}()
}
} | ||
|
||
func (s *loadBalancerSynthesizer) Synthesize(ctx context.Context) error { | ||
defer s.metricsCollector.ObserveControllerReconcileLatency("service/ingress", "synthesize load balancer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to distinguish between ALB / NLB here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, agree, will update it to distinguish between ALB and NLB
return elbv2model.LoadBalancerSpec{}, err | ||
} | ||
coIPv4Pool, err := t.buildLoadBalancerCOIPv4Pool(ctx) | ||
if err != nil { | ||
t.metricsCollector.ObserveControllerReconcileError("ingress", "build model error", "build customer owned IPv4 pool") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of return an error here, it might be good to define a custom error struct that embeds the error message and various metric fields into the struct. The plus side here is that
1/ It's impossible to forget add new metrics for new fields.
2/ It's hopefully a little more readable.
LMK what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I agree. will implement that
pkg/metrics/aws/collector.go
Outdated
// https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/CommonErrors.html | ||
if statusCode == "401" || statusCode == "403" { | ||
c.instruments.apiCallAuthErrorsTotal.With(map[string]string{ | ||
labelService: service, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you define this same string map for every case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix it
labelStatusCode: statusCode, | ||
labelErrorCode: errorCode, | ||
}).Inc() | ||
} else if errorCode == "ServiceLimitExceeded" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check that the current metrics don't cover these cases? If they don't, it would be good to make these cases a little more generic. Right now, all the cases are tailored to ELB specific error codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I know we are able to derive this from current API metrics. However, I was thinking if we want some aggregated metrics for important errors we care ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Sounds good.
pkg/metrics/aws/instruments.go
Outdated
apiCallAuthErrorsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
Subsystem: metricSubSystem, | ||
Name: metricAPIAuthErrorsTotal, | ||
Help: "Number of failed AWS API calls that due to auth or authrorization failures", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: remove that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for all these descriptions it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will remove them
pkg/metrics/lbc/collector.go
Outdated
c.ObserveControllerCacheSize("service", len(svcList.Items)) | ||
|
||
// Collect TargetGroupBlinding cache size | ||
tgbList := &elbv2api.TargetGroupBindingList{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you collect the deferred target queue cache size too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add.
} | ||
} | ||
|
||
func (c *Collector) CollectCacheSize(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this a bit more generic as ultimately it's the same code block repeated. Basically defining a "collectable" resource map and then allowing the type to be passed in might work. Let's talk offline about how to make this more extendable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, "collectable" resource map makes sense. will update it
pkg/metrics/lbc/instruments.go
Outdated
// MetricControllerCacheObjectCount tracks the total number of object in the controller runtime cache. | ||
MetricControllerCacheObjectCount = "controller_cache_object_total" | ||
// MetricPodReadinessGateFlipAboveX tracks readiness gate flips that are X seconds old | ||
MetricPodReadinessGateFlipAbove60Seconds = "readiness_gate_above_60_seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be derivable from the buckets we've defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying the MetricPodReadinessGateFlipAbove60Seconds are redundant ? yeah, I know we are able to derive how many of them exceed 60s and 90s using PromQL. However, I was thinking we want to some direct metrics ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think we're just duplicating the bucketing function that Prometheus supports out of the box. If we want these 2 specific buckets, adding them to this vector makes more sense. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/metrics/lbc/instruments.go#L32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I will remove MetricPodReadinessGateFlipAbove60Seconds metric and MetricPodReadinessGateFlipAbove90Seconds metric
@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc | |||
} | |||
|
|||
func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error { | |||
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have whitespaces in such "label" could cause issues when consuming those metrics in downstream services/code that cannot handle whitespace correctly.
maybe switch things like "add finalizers" to "add_finalizers". (make it a machine readable identifier)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will fix it
@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc | |||
} | |||
|
|||
func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error { | |||
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")
won't work, it will always be 0. you need to return a anonymous function and call it in defer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, like Zac suggested above,
I will have
var err error
finalizerFn := func() {
err = r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer)
}
r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add_finalizers", finalizerFn)
if err != nil {
return err
}
func (c *Collector) ObserveControllerReconcileLatency(controller string, stage string, fn func()) {
start := time.Now()
defer func() {
c.instruments.controllerReconcileLatency.With(prometheus.Labels{
labelController: controller,
labelReconcileStage: stage,
}).Observe(time.Since(start).Seconds())
}()
fn()
}
if err := r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer); err != nil { | ||
r.eventRecorder.Event(tgb, corev1.EventTypeWarning, k8s.TargetGroupBindingEventReasonFailedAddFinalizer, fmt.Sprintf("Failed add finalizer due to %v", err)) | ||
return err | ||
} | ||
|
||
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "reconcile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems not work due to how defer works in golang.
ideally we should collect latency metrics spend for each task, instead of "latency spend since X stage".
There is a difference between
- latency spent on adding finalizers
- latency since adding finalizer to finish reconcile.
The first one is clearly the better option and allows us to monitor on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, will collect latency spent on each task
pkg/metrics/aws/instruments.go
Outdated
metricAPIAuthErrorsTotal = "api_call_auth_errors_total" | ||
metricAPILimitExceededTotal = "api_call_limit_exceeded_total" | ||
metricAPIThrottledTotal = "api_call_throttled_total" | ||
metricAPIValidationErrorTotal = "api_call_validation_error_total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe to be consistent on using either Error
or Errors
for the metrics naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated it to use Errors
@@ -71,6 +72,38 @@ func WithSDKCallMetricCollector(c *Collector) func(stack *smithymiddleware.Stack | |||
labelStatusCode: statusCode, | |||
labelErrorCode: errorCode, | |||
}).Inc() | |||
|
|||
// https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/CommonErrors.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we only consider ELBv2 errors in this PR? how about other like EC2 API errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was only considered for ELB. I think EC2 API also matters. I can add some common errors based on https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html
c5d86dd
to
5a7af88
Compare
5a7af88
to
8f12952
Compare
if err != nil { | ||
return err | ||
return errmetrics.NewErrorWithMetrics("targetGroupBinding", "reconcile_targetgroupblinding_error", err, r.metricsCollector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo : blinding
-> binding
@@ -93,6 +101,7 @@ type targetGroupBindingReconciler struct { | |||
// +kubebuilder:rbac:groups="discovery.k8s.io",resources=endpointslices,verbs=get;list;watch | |||
|
|||
func (r *targetGroupBindingReconciler) Reconcile(ctx context.Context, req reconcile.Request) (ctrl.Result, error) { | |||
r.reconcileCounters.IncrementTGB(req.NamespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
updateTargetGroupBindingStatusFn := func() { | ||
err = r.updateTargetGroupBindingStatus(ctx, tgb) | ||
} | ||
r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "update_status", updateTargetGroupBindingStatusFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please define the resource as a const, it's re-used for every metric collection step. I think the step names are unique and don't need be consts.
dnsResolveAndUpdateStatus := func() { | ||
lbDNS, err := lb.DNSName().Resolve(ctx) | ||
if err != nil { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you still need to propagate this error to the parent function. Right now, it will silently fail.
if err != nil { | ||
return | ||
} | ||
if err := r.updateIngressGroupStatus(ctx, ingGroup, lbDNS); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment, when using this nested function approach we need to propagate the error to the parent function.
removeGroupFinalizerFn := func() { | ||
err = r.groupFinalizerManager.RemoveGroupFinalizer(ctx, ingGroupID, ingGroup.InactiveMembers) | ||
} | ||
r.metricsCollector.ObserveControllerReconcileLatency("ingress", "remove_group_finalizer", removeGroupFinalizerFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please define ingress
as a constant.
if err != nil { | ||
return err | ||
return errmetrics.NewErrorWithMetrics("service", "deploy_model_error", err, r.metricsCollector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, please define service
as a const.
@@ -206,6 +210,25 @@ func main() { | |||
setupLog.Error(err, "problem wait for podInfo repo sync") | |||
os.Exit(1) | |||
} | |||
|
|||
if metrics.Registry != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to just use the no-ops collector instead nil checking the registry. It makes the code easier to reason about.
Err: err, | ||
} | ||
|
||
if metricCollector != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, let's just utilize the no-op implementation so we don't need nil checks.
setupLog.Info("starting collect cache size") | ||
if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil { | ||
setupLog.Error(err, "problem periodically collect cache size") | ||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What startup error(s) do we expect here?
Description
Checklist
README.md
, or thedocs
directory)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯