Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add prometheus metrics #4056

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wweiwei-li
Copy link
Collaborator

@wweiwei-li wweiwei-li commented Feb 14, 2025

Description

  • metricAPIPermissionErrorsTotal
  • metricAPILimitExceededTotal
  • metricAPIThrottledTotal
  • metricAPIValidationErrorTotal
  • MetricControllerReconcileErrors tracks the total number of controller errors by error type.
  • MetricControllerReconcileStageDuration tracks latencies of different reconcile stages.
  • MetricWebhookValidationFailure tracks the total number of validation errors by error type.
  • MetricWebhookMutationFailure tracks the total number of mutation errors by error type.
  • MetricControllerCacheObjectCount tracks the total number of object in the controller runtime cache.
  • MetricControllerTopTalker

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 14, 2025
@wweiwei-li wweiwei-li changed the title add prometheusmetrics add prometheus metrics Feb 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wweiwei-li

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 14, 2025
@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
}

func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works correctly. If the idea is to get the latency of the reconcile request, we need to get a timestamp at the start of the function. Currently, looking at the implementation of ObserveControllerReconcileLatency it will always report 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not working correctly. My original plan was something like this. Would this make more sense ?

reconcileAddFinalizerStartTime := time.Now()
....
r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers", time.Since(econcileAddFinalizerStartTime))

reconcileUpdateStatusStartTime := time.Now()
.....

r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "update status", time.Since(reconcileUpdateStatusStartTime))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion works but we can cut down on code duplication. I would suggest defining something like this:

func (mc *MetricsCollector) latencyHelper(resource string, step string, fn func()) {
	start := time.Now()
	defer mc.ObserveControllerReconcileLatency(resource, step, time.Now().Sub(start))
	fn()
}

Then you can call it like:

var err error
finalizerFn := func () {
   err = r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer)
} 
r.metricsCollector.latencyHelper("targetGroupBinding", "add finalizers", finalizerFn)
... handle error here ...

Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually forgot about the quirk with using time.Now and deferred statements.

https://stackoverflow.com/questions/72965657/measure-elapsed-time-with-a-defer-statement-in-go

it's a small change to the example.

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
}

func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear, the idea is to provide more granular reconcile latency metrics than what is already provided the controller-runtime? Have you looked into what constructs the controller-runtime provides so we don't need to roll our own?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the idea is to provide more granular reconcile latency metrics. I checked controller-runtime . It only provided end to end reconcile time.

main.go Outdated
@@ -107,6 +107,12 @@ func main() {
os.Exit(1)
}

var lbcMetricsCollector *lbcmetrics.Collector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this? The original goal of this implementation was to to avoid having to nil check when no metrics were requested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's because I saw awsMetricsCollector had the nil check, and I didn't see lbcmetrics.NewCollector() handle nil metrics.Registry internally. So have a feeling we need to apply the same nil check. Let me know if I am wrong

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should continue setting lbcMetricsCollector. The factory knows whether or not the metric registry is nil. If it's nil, we return a no-op collector. If the metric registry is not nil, we return an actual collector. The users of the metrics collector don't need to know either way.

main.go Outdated
@@ -206,6 +212,14 @@ func main() {
setupLog.Error(err, "problem wait for podInfo repo sync")
os.Exit(1)
}

go func() {
if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will currently cause a NPE if a registry is not defined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I will add check metrics.Registry

if metrics.Registry != nil {
		go func() {
			if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil {
				setupLog.Error(err, "problem periodically collect cache size")
				os.Exit(1)
			}
		}()
	}

}

func (s *loadBalancerSynthesizer) Synthesize(ctx context.Context) error {
defer s.metricsCollector.ObserveControllerReconcileLatency("service/ingress", "synthesize load balancer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to distinguish between ALB / NLB here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agree, will update it to distinguish between ALB and NLB

return elbv2model.LoadBalancerSpec{}, err
}
coIPv4Pool, err := t.buildLoadBalancerCOIPv4Pool(ctx)
if err != nil {
t.metricsCollector.ObserveControllerReconcileError("ingress", "build model error", "build customer owned IPv4 pool")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of return an error here, it might be good to define a custom error struct that embeds the error message and various metric fields into the struct. The plus side here is that
1/ It's impossible to forget add new metrics for new fields.
2/ It's hopefully a little more readable.

LMK what you think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I agree. will implement that

// https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/CommonErrors.html
if statusCode == "401" || statusCode == "403" {
c.instruments.apiCallAuthErrorsTotal.With(map[string]string{
labelService: service,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you define this same string map for every case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix it

labelStatusCode: statusCode,
labelErrorCode: errorCode,
}).Inc()
} else if errorCode == "ServiceLimitExceeded" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check that the current metrics don't cover these cases? If they don't, it would be good to make these cases a little more generic. Right now, all the cases are tailored to ELB specific error codes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. I know we are able to derive this from current API metrics. However, I was thinking if we want some aggregated metrics for important errors we care ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Sounds good.

apiCallAuthErrorsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
Subsystem: metricSubSystem,
Name: metricAPIAuthErrorsTotal,
Help: "Number of failed AWS API calls that due to auth or authrorization failures",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: remove that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for all these descriptions it looks like.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will remove them

c.ObserveControllerCacheSize("service", len(svcList.Items))

// Collect TargetGroupBlinding cache size
tgbList := &elbv2api.TargetGroupBindingList{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you collect the deferred target queue cache size too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add.

}
}

func (c *Collector) CollectCacheSize(ctx context.Context) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this a bit more generic as ultimately it's the same code block repeated. Basically defining a "collectable" resource map and then allowing the type to be passed in might work. Let's talk offline about how to make this more extendable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, "collectable" resource map makes sense. will update it

// MetricControllerCacheObjectCount tracks the total number of object in the controller runtime cache.
MetricControllerCacheObjectCount = "controller_cache_object_total"
// MetricPodReadinessGateFlipAboveX tracks readiness gate flips that are X seconds old
MetricPodReadinessGateFlipAbove60Seconds = "readiness_gate_above_60_seconds"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be derivable from the buckets we've defined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying the MetricPodReadinessGateFlipAbove60Seconds are redundant ? yeah, I know we are able to derive how many of them exceed 60s and 90s using PromQL. However, I was thinking we want to some direct metrics ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think we're just duplicating the bucketing function that Prometheus supports out of the box. If we want these 2 specific buckets, adding them to this vector makes more sense. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/metrics/lbc/instruments.go#L32

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I will remove MetricPodReadinessGateFlipAbove60Seconds metric and MetricPodReadinessGateFlipAbove90Seconds metric

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
}

func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")
Copy link
Collaborator

@M00nF1sh M00nF1sh Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have whitespaces in such "label" could cause issues when consuming those metrics in downstream services/code that cannot handle whitespace correctly.

maybe switch things like "add finalizers" to "add_finalizers". (make it a machine readable identifier)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will fix it

@@ -110,22 +114,26 @@ func (r *targetGroupBindingReconciler) reconcile(ctx context.Context, req reconc
}

func (r *targetGroupBindingReconciler) reconcileTargetGroupBinding(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add finalizers") won't work, it will always be 0. you need to return a anonymous function and call it in defer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, like Zac suggested above,

I will have

	var err error
	finalizerFn := func() {
		err = r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer)
	}
	r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "add_finalizers", finalizerFn)
	if err != nil {
		return err
	}
func (c *Collector) ObserveControllerReconcileLatency(controller string, stage string, fn func()) {
	start := time.Now()
	defer func() {
		c.instruments.controllerReconcileLatency.With(prometheus.Labels{
			labelController:     controller,
			labelReconcileStage: stage,
		}).Observe(time.Since(start).Seconds())
	}()
	fn()
}

if err := r.finalizerManager.AddFinalizers(ctx, tgb, targetGroupBindingFinalizer); err != nil {
r.eventRecorder.Event(tgb, corev1.EventTypeWarning, k8s.TargetGroupBindingEventReasonFailedAddFinalizer, fmt.Sprintf("Failed add finalizer due to %v", err))
return err
}

defer r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "reconcile")
Copy link
Collaborator

@M00nF1sh M00nF1sh Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems not work due to how defer works in golang.
ideally we should collect latency metrics spend for each task, instead of "latency spend since X stage".
There is a difference between

  • latency spent on adding finalizers
  • latency since adding finalizer to finish reconcile.

The first one is clearly the better option and allows us to monitor on

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will collect latency spent on each task

metricAPIAuthErrorsTotal = "api_call_auth_errors_total"
metricAPILimitExceededTotal = "api_call_limit_exceeded_total"
metricAPIThrottledTotal = "api_call_throttled_total"
metricAPIValidationErrorTotal = "api_call_validation_error_total"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe to be consistent on using either Error or Errors for the metrics naming.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated it to use Errors

@@ -71,6 +72,38 @@ func WithSDKCallMetricCollector(c *Collector) func(stack *smithymiddleware.Stack
labelStatusCode: statusCode,
labelErrorCode: errorCode,
}).Inc()

// https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/CommonErrors.html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we only consider ELBv2 errors in this PR? how about other like EC2 API errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was only considered for ELB. I think EC2 API also matters. I can add some common errors based on https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html

@wweiwei-li wweiwei-li force-pushed the ingress-metrics branch 2 times, most recently from c5d86dd to 5a7af88 Compare February 20, 2025 02:14
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 24, 2025
if err != nil {
return err
return errmetrics.NewErrorWithMetrics("targetGroupBinding", "reconcile_targetgroupblinding_error", err, r.metricsCollector)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo : blinding -> binding

@@ -93,6 +101,7 @@ type targetGroupBindingReconciler struct {
// +kubebuilder:rbac:groups="discovery.k8s.io",resources=endpointslices,verbs=get;list;watch

func (r *targetGroupBindingReconciler) Reconcile(ctx context.Context, req reconcile.Request) (ctrl.Result, error) {
r.reconcileCounters.IncrementTGB(req.NamespacedName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

updateTargetGroupBindingStatusFn := func() {
err = r.updateTargetGroupBindingStatus(ctx, tgb)
}
r.metricsCollector.ObserveControllerReconcileLatency("targetGroupBinding", "update_status", updateTargetGroupBindingStatusFn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please define the resource as a const, it's re-used for every metric collection step. I think the step names are unique and don't need be consts.

dnsResolveAndUpdateStatus := func() {
lbDNS, err := lb.DNSName().Resolve(ctx)
if err != nil {
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you still need to propagate this error to the parent function. Right now, it will silently fail.

if err != nil {
return
}
if err := r.updateIngressGroupStatus(ctx, ingGroup, lbDNS); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, when using this nested function approach we need to propagate the error to the parent function.

removeGroupFinalizerFn := func() {
err = r.groupFinalizerManager.RemoveGroupFinalizer(ctx, ingGroupID, ingGroup.InactiveMembers)
}
r.metricsCollector.ObserveControllerReconcileLatency("ingress", "remove_group_finalizer", removeGroupFinalizerFn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please define ingress as a constant.

if err != nil {
return err
return errmetrics.NewErrorWithMetrics("service", "deploy_model_error", err, r.metricsCollector)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, please define service as a const.

@@ -206,6 +210,25 @@ func main() {
setupLog.Error(err, "problem wait for podInfo repo sync")
os.Exit(1)
}

if metrics.Registry != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to just use the no-ops collector instead nil checking the registry. It makes the code easier to reason about.

Err: err,
}

if metricCollector != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, let's just utilize the no-op implementation so we don't need nil checks.

setupLog.Info("starting collect cache size")
if err := lbcMetricsCollector.StartCollectCacheSize(ctx); err != nil {
setupLog.Error(err, "problem periodically collect cache size")
os.Exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What startup error(s) do we expect here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants