clustering traffic admission control #2970

thampiotr · 2025-03-12T12:38:06Z

PR Description

Added --cluster.wait-for-size and --cluster.wait-timeout flags which allow to specify the minimum cluster size
required before components that use clustering begin processing traffic to ensure adequate cluster capacity is
available.

Extended the existing tests, including the e2e tests added previously.

Which issue(s) this PR fixes

Fixes #201

Notes to the Reviewer

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

github-actions · 2025-03-27T14:59:36Z

💻 Deploy preview available:

https://deploy-preview-alloy-2970-zb444pucvq-vp.a.run.app/docs/alloy/latest/

thampiotr · 2025-03-27T15:03:28Z

internal/service/cluster/cluster_readonly.go

I want to rename this file to cluster.go and the cluster.go -> service.go as it makes more sense, but it messes up the diff a lot, so I will leave this for later.

thampiotr

It's ready for feedback, I have just a few small things to address in the meantime.

internal/service/cluster/cluster.go

dehaansa · 2025-03-27T17:26:45Z

In cases with very large k8s clusters for example, where discovery is immense and costly, I think we would want to wait to even do discovery until after cluster has converged.

Do you agree? Should there be an additional option to wait for cluster converge before doing any work?

thampiotr · 2025-03-31T09:36:58Z

In cases with very large k8s clusters for example, where discovery is immense and costly, I think we would want to wait to even do discovery until after cluster has converged.

Do you agree? Should there be an additional option to wait for cluster converge before doing any work?

The discovery is the same for each instance, regardless of number of instances in the cluster, so stopping them from doing work will not improve things. Every instance needs to be able to handle the entire cluster discovery in current architecture. This PR follows the design that I shared here where we explicitly said that

we scope this behaviour only to components that support clustering. Other components will run as usual.

I think scaling discovery is a separate problem to address (and some plans on what to do are here) and once we have it sharded in some way we can definitely include the min cluster requirements to it in the future.

thampiotr · 2025-03-31T11:09:16Z

I will still need to add the following, as per design:

Debug info should ideally explain the situation to the user in the UI.
The clustering overview dashboard should show clearly when the minimum size is not met. May need to add metrics for it.

I would like to do this in a follow-up PR.

docs/sources/reference/cli/run.md

internal/service/cluster/cluster.go

internal/service/cluster/cluster_readonly.go

docs/sources/reference/cli/run.md

thampiotr · 2025-04-01T14:20:06Z

internal/service/cluster/cluster.go

@@ -269,62 +299,16 @@ func (s *Service) Run(ctx context.Context, host service.Host) error {
 	ctx, cancel := context.WithCancel(ctx)
 	defer cancel()

-	limiter := rate.NewLimiter(rate.Every(stateUpdateMinInterval), 1)
-	s.node.Observe(ckit.FuncObserver(func(peers []peer.Peer) (reregister bool) {
-		tracer := s.tracer.Tracer("")


This logic was moved below to a goroutine that handles dispatching cluster change notifications. This is because when cluster minimum size timer expires we also need to dispatch cluster change notifications, even though the peers in ckit didn't change.

thampiotr · 2025-04-01T14:49:52Z

internal/service/cluster/cluster.go

+		// Set the gauge to the configured minimum cluster size
+		minClusterSizeGauge.Set(float64(opts.MinimumClusterSize))


Adding this static metric so we can clearly show on dashboards when we're below minimum. Will make dashboards changes in a follow-up PR.

thampiotr · 2025-04-01T16:39:13Z

internal/service/cluster/cluster_e2e_test.go

+			// slow components can currently lead to timeouts and communication errors
+			// TODO: consider decoupling cluster operations from runtime/components performance


I actually have done this by putting the dispatching of cluster updates on a separate goroutine. TODO: remove this comment.

dehaansa · 2025-04-01T19:12:42Z

internal/service/cluster/cluster.go

+		_, subSpan := tracer.Start(spanCtx, "NotifyClusterChange", trace.WithSpanKind(trace.SpanKindInternal))
+		subSpan.SetAttributes(attribute.String("component_id", comp.ID.String()))
+
+		clusterComponent.NotifyClusterChange()


Should this fire in a goroutine? To ensure all components get notified close to simultaneously?

dehaansa · 2025-04-01T19:17:48Z

internal/service/cluster/cluster_readonly.go

+		"minimum_cluster_size", c.opts.MinimumClusterSize,
+		"peers_count", len(c.sharder.Peers()),
+	)
+	c.clusterChangeCallback()


Feels off that the callback is called before releasing the lock, but maybe I'm thinking about it wrong.

dehaansa · 2025-04-01T19:28:58Z

internal/service/cluster/cluster.go

+	span.SetAttributes(attribute.Int("minimum_cluster_size", s.opts.MinimumClusterSize))
+
+	// Notify all components about the clustering change.
+	components := component.GetAllComponents(host, component.InfoOptions{})


I'm trying to think about this change -

We get a new callback on node.Observe() from ckit. This triggers a notification.

The notification calls all cluster aware components' NotifyClusterChange(). This triggers a Ready() call in most/all cluster aware components.

The first cluster aware component (dependent on limiter) triggers a relevant stateChange if there is one.

The state change triggers a notification, which calls all cluster aware components' NotifyClusterChange()

All? Cluster aware components hit the limiter.

Does this sound right? Something here feels off, if I'm understanding it correctly.

thampiotr force-pushed the thampiotr/clustering-traffic-admission-control branch from 76ddbff to 2bb4e5b Compare March 26, 2025 15:13

thampiotr commented Mar 27, 2025

View reviewed changes

thampiotr marked this pull request as ready for review March 27, 2025 15:15

thampiotr requested review from clayton-cornell and a team as code owners March 27, 2025 15:15

thampiotr commented Mar 27, 2025

View reviewed changes

internal/service/cluster/cluster.go Outdated Show resolved Hide resolved

internal/service/cluster/cluster.go Outdated Show resolved Hide resolved

internal/service/cluster/cluster.go Outdated Show resolved Hide resolved

thampiotr force-pushed the thampiotr/clustering-traffic-admission-control branch from fa2747c to 50df532 Compare March 31, 2025 12:54

clayton-cornell reviewed Mar 31, 2025

View reviewed changes

docs/sources/reference/cli/run.md Outdated Show resolved Hide resolved

clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Mar 31, 2025

dehaansa reviewed Apr 1, 2025

View reviewed changes

internal/service/cluster/cluster.go Outdated Show resolved Hide resolved

internal/service/cluster/cluster_readonly.go Outdated Show resolved Hide resolved

docs/sources/reference/cli/run.md Outdated Show resolved Hide resolved

thampiotr requested a review from a team as a code owner April 1, 2025 14:13

thampiotr commented Apr 1, 2025

View reviewed changes

thampiotr added 13 commits April 1, 2025 17:36

clustering traffic admission control

c129b6e

add min size flag

2ea9c09

Add deadline and basic min cluster size checks.

d52966b

Add test.

0b28fa6

Add init logging.

89582ce

More testing and logging and refactor.

7748000

More testing and logging and refactor.

8e3ade9

More testing and logging and refactor.

81f72fc

More testing and logging and refactor.

78d03db

More tests

c0168b8

introduce admission controller

772cc6f

fix tests

b891e71

fix tests

8b8c396

thampiotr added 18 commits April 1, 2025 17:36

add rate limit test

49f2813

add rate limit test

e4976d1

Check for cluster ready everywhere

223db8d

Check for cluster ready everywhere

c82a31a

Undo renames to fix diff

f85fbff

Undo renames to fix diff

674b35d

Undo renames to fix diff

b1e9526

add docs

4ace572

changelog

83c0e29

cleanup

a0385a5

cleanup

567e174

allow some errors when node name conflict

879edfe

clean up

ddaa27a

fix flaky test

522441c

fix issues with state updates propagation

42f2041

docs feedback

32b3dda

add metric and fix tests

571c814

add cluster ready for traffic metric

dcc03cd

thampiotr force-pushed the thampiotr/clustering-traffic-admission-control branch from 643510b to dcc03cd Compare April 1, 2025 16:36

thampiotr commented Apr 1, 2025

View reviewed changes

dehaansa reviewed Apr 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering traffic admission control #2970

clustering traffic admission control #2970

thampiotr commented Mar 12, 2025 •

edited

Loading

github-actions bot commented Mar 27, 2025 •

edited

Loading

thampiotr Mar 27, 2025

thampiotr left a comment

dehaansa commented Mar 27, 2025

thampiotr commented Mar 31, 2025 •

edited

Loading

thampiotr commented Mar 31, 2025

thampiotr Apr 1, 2025

thampiotr Apr 1, 2025

thampiotr Apr 1, 2025

dehaansa Apr 1, 2025

dehaansa Apr 1, 2025

dehaansa Apr 1, 2025

		// Set the gauge to the configured minimum cluster size
		minClusterSizeGauge.Set(float64(opts.MinimumClusterSize))

		// slow components can currently lead to timeouts and communication errors
		// TODO: consider decoupling cluster operations from runtime/components performance

clustering traffic admission control #2970

Are you sure you want to change the base?

clustering traffic admission control #2970

Conversation

thampiotr commented Mar 12, 2025 • edited Loading

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

github-actions bot commented Mar 27, 2025 • edited Loading

thampiotr Mar 27, 2025

Choose a reason for hiding this comment

thampiotr left a comment

Choose a reason for hiding this comment

dehaansa commented Mar 27, 2025

thampiotr commented Mar 31, 2025 • edited Loading

thampiotr commented Mar 31, 2025

thampiotr Apr 1, 2025

Choose a reason for hiding this comment

thampiotr Apr 1, 2025

Choose a reason for hiding this comment

thampiotr Apr 1, 2025

Choose a reason for hiding this comment

dehaansa Apr 1, 2025

Choose a reason for hiding this comment

dehaansa Apr 1, 2025

Choose a reason for hiding this comment

dehaansa Apr 1, 2025

Choose a reason for hiding this comment

thampiotr commented Mar 12, 2025 •

edited

Loading

github-actions bot commented Mar 27, 2025 •

edited

Loading

thampiotr commented Mar 31, 2025 •

edited

Loading