Add metrics for managed resources count #4031

oliviassss · 2025-01-22T18:55:31Z

Issue

Description

This PR adds the metrics of managed resources count, for ingress, service type of load balancer (nlb), targetgroupbinding. And get the count of AWS resources like ALB and NLB via resourcegrouptagging API. Adds the tag:GetResources iam policy in the template.

Test

Created 4 ingresses (2 of them are in the same ingress group), and 2 service type of load balancer (nlb). Verified the prometheus metrics are as expected

dev-dsk-sonyingy-2c-61cf589a % k get ingress -A
NAMESPACE   NAME              CLASS   HOSTS   ADDRESS                                                                     PORTS   AGE
default     nginx-ingress-1   alb     *       internal-k8s-myinggroup-0af676b930-1495877518.us-west-2.elb.amazonaws.com   80      12m
default     nginx-ingress-2   alb     *       internal-k8s-myinggroup-0af676b930-1495877518.us-west-2.elb.amazonaws.com   80      12m
game-2048   ingress-2048      alb     *       k8s-game2048-ingress2-4d86c6a92e-1406978370.us-west-2.elb.amazonaws.com     80      13m
game-2048   ingress-2048-2    alb     *       k8s-game2048-ingress2-8d165b56df-1236487848.us-west-2.elb.amazonaws.com     80      16m

(25-01-22 1:35:31) <0> [~/EKS/LBC]
dev-dsk-sonyingy-2c-61cf589a % k get svc -A | grep LoadBalancer
default       ip-nlb-svc-01                       LoadBalancer   10.100.30.169    k8s-default-ipnlbsvc-c34fd4b56d-e30509379b365c24.elb.us-west-2.amazonaws.com   80:32320/TCP             9m4s
default       ip-nlb-svc-02                       LoadBalancer   10.100.210.187   k8s-default-ipnlbsvc-299bc353da-c2da5e41643e1f18.elb.us-west-2.amazonaws.com   80:32425/TCP             8m54s

(25-01-22 1:35:42) <0> [~/EKS/LBC]
dev-dsk-sonyingy-2c-61cf589a % k get targetgroupbindings.elbv2.k8s.aws -A
NAMESPACE   NAME                               SERVICE-NAME    SERVICE-PORT   TARGET-TYPE   AGE
default     k8s-default-ipnlbsvc-725fc8e5fe    ip-nlb-svc-02   80             ip            8m56s
default     k8s-default-ipnlbsvc-9b749841f7    ip-nlb-svc-01   80             ip            9m6s
default     k8s-default-nginxsvc-06b13125f0    nginx-svc03     80             ip            12m
default     k8s-default-nginxsvc-563deb1176    nginx-svc03     80             ip            12m
game-2048   k8s-game2048-service2-446c26be3b   service-2048    80             ip            16m
game-2048   k8s-game2048-service2-6abcfb7ee1   service-2048    80             ip            13m

In the metrics I have:

dev-dsk-sonyingy-2c-61cf589a % curl http://localhost:8080/metrics | grep lb_controller_managed
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  103k    0  1# HELP lb_controller_managed_albs_total Current number of ALBs managed by the controller
03# TYPE lb_controller_managed_albs_total gauge
lb_controller_managed_albs_total 3
 # HELP lb_controller_managed_ingress_count Number of ingresses managed by the AWS Load Balancer Controller.
 # TYPE lb_controller_managed_ingress_count gauge
 lb_controller_managed_ingress_count 4
 0# HELP lb_controller_managed_nlbs_total Current number of NLBs managed by the controller
  # TYPE lb_controller_managed_nlbs_total gauge
 lb_controller_managed_nlbs_total 2
  # HELP lb_controller_managed_service_count Number of service type Load Balancers (NLBs) managed by the AWS Load Balancer Controller.
0# TYPE lb_controller_managed_service_count gauge
 lb_controller_managed_service_count 2
  # HELP lb_controller_managed_targetgroupbinding_count Number of targetgroupbindings managed by the AWS Load Balancer Controller.
3# TYPE lb_controller_managed_targetgroupbinding_count gauge
3lb_controller_managed_targetgroupbinding_count 6

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the docs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

k8s-ci-robot · 2025-01-22T18:55:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oliviassss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [oliviassss]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

docs/install/iam_policy.json

main.go

zac-nixon · 2025-01-22T21:26:06Z

main.go

+			select {
+			case <-ticker.C:
+				// Update managed resource metrics
+				err := lbcMetricsCollector.UpdateManagedK8sResourceMetrics(context.Background())


collecting all these resources during the same tick might lead to sparse metrics. I would suggest a ticker per resource to improve performance and metric reliability.

makes sense, will do 3 tickers - 1 for k8s resources, 1 for ALB and 1 for NLB. Just in case the API call has latency, but it should be rare.

@zac-nixon Hi, I though twice but decided to keep them in the same ticker, because I'd like to have all the metrics to be updated in one loop. Though I increased the ticker to 2min, to reduce unnecessary calls, as we don't expect a super timely metrics. Also added a TODO to update the metrics per reconciliation.

M00nF1sh · 2025-01-30T18:46:47Z

pkg/metrics/lbc/collector.go

+			},
+		},
+	}
+	resources, err := c.rgt.GetResourcesAsList(ctx, req)


we should not call AWS APIs to get those counter metrics. This can cause significant performance impact when there are large amount of LBs.
It shall be technical possible to get the number of LBs managed by the controller without using k8s apis.

Thanks! I used RGT API call, so it's just 1 api call every 2min. But I do agree it's better to update the metrics per CRUD event. let me double check and get back.

M00nF1sh · 2025-01-30T18:53:55Z

main.go

@@ -202,6 +210,28 @@ func main() {
 		deferredTGBQueue.Run()
 	}()

+	// TODO: we can better improve this to update the metrics per reconcile
+	go func() {
+		ticker := time.NewTicker(2 * time.Minute)


this is not flexible and we don't be able to get real-time metrics..
i think we should trigger this when there are service/ingress/ingressGroup events happens.
for example, trigger a function in metricsCollector from within ingressGroupController.

e.g. to track the number of ALBs:
(note: just a naive thought, there could be edge cases like manually removed finalizer that shall be handled properly)

inject a lbMetricsCollector within ingressGroupController, when

when it reconciles an ingressGroup, it calls lbMetricsCollector.trackIngressGroup(group's name, maybe some other params necessary like active members)

after it successfully deleted ingressGroup, it calls lbMetricsCollector.untrackIngressGroup(group's name)

Then lbMetricsCollector shall have a correct view of number of currently managed ingressGroups at realtime.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 22, 2025

k8s-ci-robot requested review from shraddhabang and zac-nixon January 22, 2025 18:55

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2025

oliviassss force-pushed the metrics-improve branch from 83dec2e to a487343 Compare January 22, 2025 18:57

zac-nixon reviewed Jan 22, 2025

View reviewed changes

oliviassss force-pushed the metrics-improve branch from a487343 to 9700d0e Compare January 29, 2025 23:32

add metrics to track the managed resource count

4e1b9f3

oliviassss force-pushed the metrics-improve branch from 9700d0e to 4e1b9f3 Compare January 30, 2025 00:02

oliviassss added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2025

increase ticker to 2min

97fe20f

oliviassss removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2025

bump up go version

cdf7da9

M00nF1sh reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for managed resources count #4031

Add metrics for managed resources count #4031

oliviassss commented Jan 22, 2025 •

edited

Loading

k8s-ci-robot commented Jan 22, 2025

zac-nixon Jan 22, 2025

oliviassss Jan 22, 2025

oliviassss Jan 30, 2025 •

edited

Loading

M00nF1sh Jan 30, 2025

oliviassss Jan 30, 2025

M00nF1sh Jan 30, 2025

Add metrics for managed resources count #4031

Are you sure you want to change the base?

Add metrics for managed resources count #4031

Conversation

oliviassss commented Jan 22, 2025 • edited Loading

Issue

Description

Test

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

k8s-ci-robot commented Jan 22, 2025

zac-nixon Jan 22, 2025

Choose a reason for hiding this comment

oliviassss Jan 22, 2025

Choose a reason for hiding this comment

oliviassss Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

M00nF1sh Jan 30, 2025

Choose a reason for hiding this comment

oliviassss Jan 30, 2025

Choose a reason for hiding this comment

M00nF1sh Jan 30, 2025

Choose a reason for hiding this comment

oliviassss commented Jan 22, 2025 •

edited

Loading

oliviassss Jan 30, 2025 •

edited

Loading