-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CON-11819 spread sync of load balancers #802
Conversation
@@ -85,6 +86,14 @@ func (s *tickerSyncer) Sync(name string, period time.Duration, stopCh <-chan str | |||
klog.Errorf("%s failed: %s", name, err) | |||
} | |||
|
|||
initialDelayTicker := time.NewTicker(initialDelay) | |||
defer initialDelayTicker.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) could use time.After()
https://pkg.go.dev/time#After instead of ticker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using time.After() instead. Thanks
@@ -126,7 +135,7 @@ func (r *ResourcesController) Run(stopCh <-chan struct{}) { | |||
klog.Info("No cluster ID configured -- skipping cluster dependent syncers.") | |||
return | |||
} | |||
go r.syncer.Sync("tags syncer", controllerSyncTagsPeriod, stopCh, r.syncTags) | |||
go r.syncer.Sync("tags syncer", controllerSyncTagsPeriod, time.Second*time.Duration(rand.Int31n(300)), stopCh, r.syncTags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that it's initial delay, maybe let's increase the spread to ~10 min?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped to 600 (10minutes)
Spread the sync period of the loadbalancer unevenly to prevent a spike of loadbalancer API calls at the same time. We saw an issue when multiple clusters see their maintenance window happening at the same time and the CCM component is restarted at the same time. The sync tag period will match closely for all those CCM pods and all of those clusters will make a LIST api calls relatively close to each other resulting in spikes each 15minutes. This PR attempts to mitigate this issue by adding some randomness to the sync period (5minutes) as an initial delay. Thus, the new sync period will remain 15 minutes but the subsequent sync tag will be off between 0 to 300 seconds.