add prometheus metrics #4056

wweiwei-li · 2025-02-14T16:45:05Z

Description

metricAPIPermissionErrorsTotal
metricAPILimitExceededTotal
metricAPIThrottledTotal
metricAPIValidationErrorTotal
MetricControllerReconcileErrors tracks the total number of controller errors by error type.
MetricControllerReconcileStageDuration tracks latencies of different reconcile stages.
MetricWebhookValidationFailure tracks the total number of validation errors by error type.
MetricWebhookMutationFailure tracks the total number of mutation errors by error type.
MetricControllerCacheObjectCount tracks the total number of object in the controller runtime cache.
MetricControllerTopTalker

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the docs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

k8s-ci-robot · 2025-02-14T16:45:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wweiwei-li

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wweiwei-li]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zac-nixon · 2025-02-17T21:50:17Z

controllers/elbv2/targetgroupbinding_controller.go

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
 }

 func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
+	defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")


I don't think this works correctly. If the idea is to get the latency of the reconcile request, we need to get a timestamp at the start of the function. Currently, looking at the implementation of ObserveControllerReconcileLatency it will always report 0.

Yeah, it is not working correctly. My original plan was something like this. Would this make more sense ?

reconcileAddFinalizerStartTime := time.Now() .... r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers", time.Since(econcileAddFinalizerStartTime)) reconcileUpdateStatusStartTime := time.Now() ..... r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "update status", time.Since(reconcileUpdateStatusStartTime))

Your suggestion works but we can cut down on code duplication. I would suggest defining something like this:

func (mc *MetricsCollector) latencyHelper(resource string, step string, fn func()) { start := time.Now() defer mc.ObserveControllerReconcileLatency(resource, step, time.Now().Sub(start)) fn() }

Then you can call it like:

var err error finalizerFn := func () { err = r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer) } r.metricsCollector.latencyHelper("targetGroupBinding", "add finalizers", finalizerFn) ... handle error here ...

Let me know what you think.

I actually forgot about the quirk with using time.Now and deferred statements.

https://stackoverflow.com/questions/72965657/measure-elapsed-time-with-a-defer-statement-in-go

it's a small change to the example.

zac-nixon · 2025-02-17T21:51:14Z

controllers/elbv2/targetgroupbinding_controller.go

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
 }

 func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
+	defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")


Just so I'm clear, the idea is to provide more granular reconcile latency metrics than what is already provided the controller-runtime? Have you looked into what constructs the controller-runtime provides so we don't need to roll our own?

Yeah, the idea is to provide more granular reconcile latency metrics. I checked controller-runtime . It only provided end to end reconcile time.

zac-nixon · 2025-02-17T21:53:46Z

main.go

@@ -107,6 +107,12 @@ func main() {
 		os.Exit(1)
 	}

+	var lbcMetricsCollector *lbcmetrics.Collector


Why did you change this? The original goal of this implementation was to to avoid having to nil check when no metrics were requested.

I think that's because I saw awsMetricsCollector had the nil check, and I didn't see lbcmetrics.NewCollector() handle nil metrics.Registry internally. So have a feeling we need to apply the same nil check. Let me know if I am wrong

We should continue setting lbcMetricsCollector. The factory knows whether or not the metric registry is nil. If it's nil, we return a no-op collector. If the metric registry is not nil, we return an actual collector. The users of the metrics collector don't need to know either way.

zac-nixon · 2025-02-17T21:54:17Z

main.go

@@ -206,6 +212,14 @@ func main() {
 		setupLog.Error(err, "problem wait for podInfo repo sync")
 		os.Exit(1)
 	}
+
+	go func() {
+		if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil {


This will currently cause a NPE if a registry is not defined.

Good catch, I will add check metrics.Registry

if metrics.Registry != nil { go func() { if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil { setupLog.Error(err, "problem periodically collect cache size") os.Exit(1) } }() }

zac-nixon · 2025-02-17T21:56:24Z

pkg/deploy/elbv2/load_balancer_synthesizer.go

 }

 func (s *loadBalancerSynthesizer) Synthesize(ctx context.Context) error {
+	defer s.metricsCollector.ObserveControllerReconcileLatency("service/ingress", "synthesize load balancer")


It might be useful to distinguish between ALB / NLB here.

yeah, agree, will update it to distinguish between ALB and NLB

zac-nixon · 2025-02-17T21:58:43Z

pkg/ingress/model_build_load_balancer.go

 		return elbv2model.LoadBalancerSpec{}, err
 	}
 	coIPv4Pool, err := t.buildLoadBalancerCOIPv4Pool(ctx)
 	if err != nil {
+		t.metricsCollector.ObserveControllerReconcileError("ingress", "build model error", "build customer owned IPv4 pool")


Instead of return an error here, it might be good to define a custom error struct that embeds the error message and various metric fields into the struct. The plus side here is that
1/ It's impossible to forget add new metrics for new fields.
2/ It's hopefully a little more readable.

LMK what you think.

Good point. I agree. will implement that

zac-nixon · 2025-02-17T22:00:05Z

pkg/metrics/aws/collector.go

+			// https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/CommonErrors.html
+			if statusCode == "401" || statusCode == "403" {
+				c.instruments.apiCallAuthErrorsTotal.With(map[string]string{
+					labelService:    service,


nit: you define this same string map for every case.

will fix it

zac-nixon · 2025-02-17T22:01:00Z

pkg/metrics/aws/collector.go

+					labelStatusCode: statusCode,
+					labelErrorCode:  errorCode,
+				}).Inc()
+			} else if errorCode == "ServiceLimitExceeded" {


Can you double check that the current metrics don't cover these cases? If they don't, it would be good to make these cases a little more generic. Right now, all the cases are tailored to ELB specific error codes.

Same here. I know we are able to derive this from current API metrics. However, I was thinking if we want some aggregated metrics for important errors we care ?

Agreed. Sounds good.

zac-nixon · 2025-02-17T22:01:37Z

pkg/metrics/aws/instruments.go

+	apiCallAuthErrorsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
+		Subsystem: metricSubSystem,
+		Name:      metricAPIAuthErrorsTotal,
+		Help:      "Number of failed AWS API calls that due to auth or authrorization failures",


typo: remove that

same for all these descriptions it looks like.

Good catch, will remove them

zac-nixon · 2025-02-17T22:02:40Z

pkg/metrics/lbc/collector.go

+	c.ObserveControllerCacheSize("service", len(svcList.Items))
+
+	// Collect TargetGroupBlinding cache size
+	tgbList := &elbv2api.TargetGroupBindingList{}


can you collect the deferred target queue cache size too?

Sure, will add.

zac-nixon · 2025-02-17T22:03:45Z

pkg/metrics/lbc/collector.go

+	}
+}
+
+func (c *Collector) CollectCacheSize(ctx context.Context) error {


We should make this a bit more generic as ultimately it's the same code block repeated. Basically defining a "collectable" resource map and then allowing the type to be passed in might work. Let's talk offline about how to make this more extendable.

Agree, "collectable" resource map makes sense. will update it

zac-nixon · 2025-02-17T22:09:27Z

pkg/metrics/lbc/instruments.go

+	// MetricControllerCacheObjectCount tracks the total number of object in the controller runtime cache.
+	MetricControllerCacheObjectCount = "controller_cache_object_total"
+	// MetricPodReadinessGateFlipAboveX tracks readiness gate flips that are X seconds old
+	MetricPodReadinessGateFlipAbove60Seconds = "readiness_gate_above_60_seconds"


This should be derivable from the buckets we've defined.

Are you saying the MetricPodReadinessGateFlipAbove60Seconds are redundant ? yeah, I know we are able to derive how many of them exceed 60s and 90s using PromQL. However, I was thinking we want to some direct metrics ?

In this case I think we're just duplicating the bucketing function that Prometheus supports out of the box. If we want these 2 specific buckets, adding them to this vector makes more sense. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/metrics/lbc/instruments.go#L32

Agree, I will remove MetricPodReadinessGateFlipAbove60Seconds metric and MetricPodReadinessGateFlipAbove90Seconds metric

M00nF1sh · 2025-02-19T01:21:21Z

controllers/elbv2/targetgroupbinding_controller.go

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
 }

 func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
+	defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")


have whitespaces in such "label" could cause issues when consuming those metrics in downstream services/code that cannot handle whitespace correctly.

maybe switch things like "add finalizers" to "add_finalizers". (make it a machine readable identifier)

Good point, will fix it

M00nF1sh · 2025-02-19T01:31:37Z

controllers/elbv2/targetgroupbinding_controller.go

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
 }

 func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
+	defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")


this defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers") won't work, it will always be 0. you need to return a anonymous function and call it in defer

Yeah, like Zac suggested above,

I will have

var err error finalizerFn := func() { err = r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer) } r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add_finalizers", finalizerFn) if err != nil { return err }

func (c *Collector) ObserveControllerReconcileLatency(controller string, stage string, fn func()) { start := time.Now() defer func() { c.instruments.controllerReconcileLatency.With(prometheus.Labels{ labelController: controller, labelReconcileStage: stage, }).Observe(time.Since(start).Seconds()) }() fn() }

M00nF1sh · 2025-02-19T01:33:38Z

controllers/elbv2/targetgroupbinding_controller.go

 	if err := r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer); err != nil {
 		r.eventRecorder.Event(tgb, corev1.EventTypeWarning, k8s.TargetGroupBindingEventReasonFailedAddFinalizer, fmt.Sprintf("Failed add finalizer due to %v", err))
 		return err
 	}

+	defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "reconcile")


this seems not work due to how defer works in golang.
ideally we should collect latency metrics spend for each task, instead of "latency spend since X stage".
There is a difference between

latency spent on adding finalizers

latency since adding finalizer to finish reconcile.

The first one is clearly the better option and allows us to monitor on

Agree, will collect latency spent on each task

oliviassss · 2025-02-20T00:15:37Z

pkg/metrics/aws/instruments.go

+	metricAPIAuthErrorsTotal      = "api_call_auth_errors_total"
+	metricAPILimitExceededTotal   = "api_call_limit_exceeded_total"
+	metricAPIThrottledTotal       = "api_call_throttled_total"
+	metricAPIValidationErrorTotal = "api_call_validation_error_total"


nit: maybe to be consistent on using either Error or Errors for the metrics naming.

Good point, updated it to use Errors

oliviassss · 2025-02-20T00:16:08Z

pkg/metrics/aws/collector.go

@@ -71,6 +72,38 @@ func WithSDKCallMetricCollector(c *Collector) func(stack *smithymiddleware.Stack
 				labelStatusCode: statusCode,
 				labelErrorCode:  errorCode,
 			}).Inc()
+
+			// https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/CommonErrors.html


do we only consider ELBv2 errors in this PR? how about other like EC2 API errors?

Yeah, this was only considered for ELB. I think EC2 API also matters. I can add some common errors based on https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html

zac-nixon · 2025-02-24T20:48:51Z

controllers/elbv2/targetgroupbinding_controller.go

 	if err != nil {
-		return err
+		return errmetrics.NewErrorWithMetrics("targetGroupBinding", "reconcile_targetgroupblinding_error", err, r.metricsCollector)


typo : blinding -> binding

zac-nixon · 2025-02-24T20:49:09Z

controllers/elbv2/targetgroupbinding_controller.go

@@ -93,6 +101,7 @@ type targetGroupBindingReconciler struct {
 // +kubebuilder:rbac:groups="discovery.k8s.io",resources=endpointslices,verbs=get;list;watch

 func (r *targetGroupBindingReconciler) Reconcile(ctx context.Context, req reconcile.Request) (ctrl.Result, error) {
+	r.reconcileCounters.IncrementTGB(req.NamespacedName)


zac-nixon · 2025-02-24T20:50:33Z

controllers/elbv2/targetgroupbinding_controller.go

+	updateTargetGroupBindingStatusFn := func() {
+		err = r.updateTargetGroupBindingStatus(ctx, tgb)
+	}
+	r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "update_status", updateTargetGroupBindingStatusFn)


Can you please define the resource as a const, it's re-used for every metric collection step. I think the step names are unique and don't need be consts.

zac-nixon · 2025-02-24T20:52:30Z

controllers/ingress/group_controller.go

+		dnsResolveAndUpdateStatus := func() {
+			lbDNS, err := lb.DNSName().Resolve(ctx)
+			if err != nil {
+				return


I believe you still need to propagate this error to the parent function. Right now, it will silently fail.

zac-nixon · 2025-02-24T20:53:07Z

controllers/ingress/group_controller.go

+			if err != nil {
+				return
+			}
+			if err := r.updateIngressGroupStatus(ctx, ingGroup, lbDNS); err != nil {


same comment, when using this nested function approach we need to propagate the error to the parent function.

zac-nixon · 2025-02-24T21:54:10Z

controllers/ingress/group_controller.go

+		removeGroupFinalizerFn := func() {
+			err = r.groupFinalizerManager.RemoveGroupFinalizer(ctx, ingGroupID, ingGroup.InactiveMembers)
+		}
+		r.metricsCollector.ObserveControllerReconcileLatency("ingress", "remove_group_finalizer", removeGroupFinalizerFn)


please define ingress as a constant.

zac-nixon · 2025-02-24T21:58:07Z

controllers/service/service_controller.go

 	if err != nil {
-		return err
+		return errmetrics.NewErrorWithMetrics("service", "deploy_model_error", err, r.metricsCollector)


same, please define service as a const.

zac-nixon · 2025-02-24T22:17:10Z

main.go

@@ -206,6 +210,25 @@ func main() {
 		setupLog.Error(err, "problem wait for podInfo repo sync")
 		os.Exit(1)
 	}
+
+	if metrics.Registry != nil {


I'd prefer to just use the no-ops collector instead nil checking the registry. It makes the code easier to reason about.

zac-nixon · 2025-02-24T22:18:22Z

pkg/error/error_with_metrics.go

+		Err:           err,
+	}
+
+	if metricCollector != nil {


same here, let's just utilize the no-op implementation so we don't need nil checks.

zac-nixon · 2025-02-24T22:33:27Z

main.go

+			setupLog.Info("starting collect cache size")
+			if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil {
+				setupLog.Error(err, "problem periodically collect cache size")
+				os.Exit(1)


What startup error(s) do we expect here?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 14, 2025

wweiwei-li changed the title ~~add prometheusmetrics~~ add prometheus metrics Feb 14, 2025

k8s-ci-robot requested review from oliviassss and zac-nixon February 14, 2025 16:45

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 14, 2025

zac-nixon reviewed Feb 17, 2025

View reviewed changes

M00nF1sh reviewed Feb 19, 2025

View reviewed changes

oliviassss reviewed Feb 20, 2025

View reviewed changes

wweiwei-li force-pushed the ingress-metrics branch 2 times, most recently from c5d86dd to 5a7af88 Compare February 20, 2025 02:14

Add Prometheus Metrics

8f12952

wweiwei-li force-pushed the ingress-metrics branch from 5a7af88 to 8f12952 Compare February 24, 2025 03:35

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 24, 2025

zac-nixon reviewed Feb 24, 2025

View reviewed changes

add prometheus metrics #4056

Are you sure you want to change the base?

add prometheus metrics #4056

Conversation

wweiwei-li commented Feb 14, 2025 • edited Loading

Description

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

k8s-ci-robot commented Feb 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

M00nF1sh Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

M00nF1sh Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wweiwei-li commented Feb 14, 2025 •

edited

Loading

M00nF1sh Feb 19, 2025 •

edited

Loading

M00nF1sh Feb 19, 2025 •

edited

Loading