-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1427: Recompute node insights when MC/MCP changes #1144
base: main
Are you sure you want to change the base?
Conversation
@hongkailiu: This pull request references OTA-1427 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1a7430a
to
6eeacd5
Compare
/test e2e-agnostic-usc-devpreview |
1 similar comment
/test e2e-agnostic-usc-devpreview |
Reconciling is working as there are logs The init seems not working ... 🤔
But |
6eeacd5
to
45b0b70
Compare
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1885905947818201088/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-844f579bfd-mxg7z_update-status-controller.log | rg 'Syncing with key|which|Cached|Stored|Reconciling all nodes' | head -n 10
I0202 06:06:53.968786 1 nodeinformer.go:282] Stored 0 machineConfigPools in the cache
I0202 06:06:53.968954 1 nodeinformer.go:294] Stored 0 machineConfig versions in the cache
I0202 06:06:54.183789 1 nodeinformer.go:107] NI :: Syncing with key MachineConfigPool/master
I0202 06:06:54.183939 1 nodeinformer.go:244] Cached MachineConfigPool master with selector node-role.kubernetes.io/master=
I0202 06:06:54.184005 1 nodeinformer.go:182] Reconciling all nodes as machine config pool "master" is refreshed
I0202 06:06:54.184104 1 nodeinformer.go:107] NI :: Syncing with key MachineConfigPool/worker
I0202 06:06:54.184171 1 nodeinformer.go:244] Cached MachineConfigPool worker with selector node-role.kubernetes.io/worker=
I0202 06:06:54.184224 1 nodeinformer.go:182] Reconciling all nodes as machine config pool "worker" is refreshed
I0202 06:06:54.184279 1 nodeinformer.go:107] NI :: Syncing with key Node/ci-op-n2q0th1r-8c750-5r658-master-0
I0202 06:06:54.200901 1 nodeinformer.go:107] NI :: Syncing with key Node/ci-op-n2q0th1r-8c750-5r658-master-1 In the init phase, the lister was not synced yet. However, it caught up by the events on MCPs. |
07ebaa6
to
5a1a854
Compare
43096ac
to
791b701
Compare
/retest-required |
The log looks good. |
/test e2e-hypershift |
/cc @petr-muller |
I could not warm up the cache with the listers from the informer as I do not know when the lister is synced, i.e., they have the MCPs/MCs. My init function was useless because the listers was NOT synced when it is called. Moreover, from the testing, MC/MCP events always came earlier than node events, i.e., "requeuing a ton of nodes" did not happen.
Added it back in the 5th commit with the hook. Let us see if it helps in the test. If still not helping,
My expectation is that the listers are synced before Edit: From the comment cluster-version-operator/pkg/updatestatus/start.go Lines 55 to 56 in 3af45bc
and "run asynchronously" from cluster-version-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/factory.go Lines 182 to 183 in 8f0f8d6
I got the feeling that we will need |
44c1f2c
to
71e9c4c
Compare
/test e2e-agnostic-usc-devpreview |
Must gather did not work in the last CI run. /test e2e-agnostic-usc-devpreview |
The ordering looks so nice even without explicitly calling Note that the first event arrived before our caches got warmed up (after the informers' caches synced) which verified my theory in #1144 (comment) ... $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890400781975162880/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-7f56cd7468-ffg4q_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0214 15:31:07.406362 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0214 15:31:07.406389 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0214 15:31:07.406459 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0214 15:31:07.423666 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0214 15:31:07.423697 1 base_controller.go:82] Caches are synced for NodeInformer
I0214 15:31:07.423727 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0214 15:31:07.423772 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-6jd7864k-8c750-fc76f-master-0
I0214 15:31:07.423810 1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0214 15:31:07.424105 1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0214 15:31:07.423828 1 nodeinformer.go:122] Node ci-op-6jd7864k-8c750-fc76f-master-0 belongs to machine config pool master
I0214 15:31:07.430259 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-6jd7864k-8c750-fc76f-master-1
I0214 15:31:07.430279 1 nodeinformer.go:122] Node ci-op-6jd7864k-8c750-fc76f-master-1 belongs to machine config pool master
... Let me try again. If it showed me the same result, i intend to excluding the wait, letting the reconciliation of event on MCP to do the job. /test e2e-agnostic-usc-devpreview |
install failure in the last CI job. /test e2e-agnostic-usc-devpreview |
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890553751542960128/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-5dc5599f49-6ff4n_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 01:45:51.099734 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 01:45:51.100905 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 01:45:51.100990 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 01:45:51.117880 1 base_controller.go:82] Caches are synced for NodeInformer
I0215 01:45:51.118257 1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 01:45:51.118563 1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 01:45:51.118158 1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 01:45:51.118719 1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 01:45:51.118781 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-8g5b5-master-0
I0215 01:45:51.118859 1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-8g5b5-master-0 belongs to machine config pool master
I0215 01:45:51.118534 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 01:45:51.118548 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 01:45:51.130188 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-8g5b5-master-1 Still good. This time, MCP events came first this time. /test e2e-agnostic-usc-devpreview |
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890602608901492736/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-76769d4bcc-gdfrq_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 04:57:44.796093 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 04:57:44.796128 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 04:57:44.796132 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 04:57:44.809388 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 04:57:44.809416 1 base_controller.go:82] Caches are synced for NodeInformer
I0215 04:57:44.809519 1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 04:57:44.809521 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-2qcn6-master-0
I0215 04:57:44.809519 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 04:57:44.809539 1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-2qcn6-master-0 belongs to machine config pool master
I0215 04:57:44.809561 1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 04:57:44.814586 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-2qcn6-master-1
I0215 04:57:44.814600 1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-2qcn6-master-1 belongs to machine config pool master Still works: the first node got its matching MCP. Three consecutive tests worked. To summarize: Those 3 things racing up with one another.
In all three tests so far, the informers' caches are synced quickly enough: when PostStartHook was called, the lister had data already. (It was not the case without setting up PostStartHook in my original implementation. So PostStartHook helps) The initializing of controller's caches was either complete (test 2) or close to complete (test 1 and 3) when the first event came in. Ideally, we want to ensure the completeness (but I do not think we have a way to achieve it at the moment). I feel this is good enough (the window of cold caches is small) and the MCP/MC event (also coming quickly) will reconcile all nodes if errors were caused from cold caches. We do not have to call WaitForCacheSync explicitly. Just for my curiosity, let me see the difference if we do it anyway. |
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890735530849603584/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-6b58f4f84d-gwwjc_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 13:38:35.934527 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 13:38:35.934609 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 13:38:35.934635 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 13:38:35.954205 1 base_controller.go:82] Caches are synced for NodeInformer
I0215 13:38:35.954048 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 13:38:35.954334 1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 13:38:35.954385 1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 13:38:35.954427 1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 13:38:35.954477 1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 13:38:35.954493 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-6xlg8-master-0
I0215 13:38:35.954508 1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-6xlg8-master-0 belongs to machine config pool master
I0215 13:38:35.954058 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 13:38:35.965000 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-6xlg8-master-1
It looks like Test 2 above. /test e2e-agnostic-usc-devpreview Trying again... |
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890768740337848320/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-85bcd6d7c6-f4jp7_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 15:27:40.242112 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 15:27:40.242132 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 15:27:40.242171 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 15:27:40.257544 1 base_controller.go:82] Caches are synced for NodeInformer
I0215 15:27:40.257806 1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 15:27:40.257851 1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 15:27:40.257670 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 15:27:40.257675 1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 15:27:40.257971 1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 15:27:40.257994 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-fhcw9-master-0
I0215 15:27:40.258007 1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-fhcw9-master-0 belongs to machine config pool master Again, like Test 2 above. /test e2e-agnostic-usc-devpreview |
"gather-extra failed", rerunning ... /test e2e-agnostic-usc-devpreview |
Something wrong with the cluster. Retrying ... |
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890965887544987648/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-79f559cc9d-qjmlx_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0216 04:54:37.613087 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0216 04:54:37.613168 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0216 04:54:37.614167 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0216 04:54:37.625667 1 base_controller.go:82] Caches are synced for NodeInformer
I0216 04:54:37.625830 1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0216 04:54:37.626125 1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0216 04:54:37.626194 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-8b42p-master-0
I0216 04:54:37.626232 1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-8b42p-master-0 belongs to machine config pool master
I0216 04:54:37.627826 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0216 04:54:37.633539 1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-8b42p-master-1
I0216 04:54:37.633582 1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-8b42p-master-1 belongs to machine config pool master Although the CI job failed, we got the logs from the test. We have 3 tests without 7c289e9 and 3 with it. Facts:
Thinking:
|
OK, I proposed That means |
7c289e9
to
c3483f4
Compare
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1894124084053151744/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-785c89d6b-dmj26_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0224 21:53:16.289576 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0224 21:53:16.289608 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0224 21:53:16.289585 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0224 21:53:16.306588 1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0224 21:53:16.306599 1 base_controller.go:82] Caches are synced for NodeInformer
I0224 21:53:16.306626 1 base_controller.go:82] Caches are synced for UpdateStatusController
I0224 21:53:16.306836 1 nodeinformer.go:221] Stored 2 machineConfigPools in the cache
I0224 21:53:16.306874 1 nodeinformer.go:231] Stored 2 machineConfig versions in the cache
I0224 21:53:16.306880 1 nodeinformer.go:101] NI :: Syncing with key MachineConfigPool/master
I0224 21:53:16.306900 1 nodeinformer.go:101] NI :: Syncing with key MachineConfigPool/worker
I0224 21:53:16.306916 1 nodeinformer.go:101] NI :: Syncing with key Node/ci-op-yc00jphy-8c750-78krc-master-0
I0224 21:53:16.306930 1 nodeinformer.go:128] Node ci-op-yc00jphy-8c750-78krc-master-0 belongs to machine config pool master
... The above log is from this job. It ( |
/retest-required |
1 similar comment
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many small bits of feedback inline. I think the only substantial change is the lack of robustness when reconcileAllNodes
fails. The rest is mostly naming and code structure proposals - address the ones you agree with.
In line with our retro agreement, I am approving the PR so you can merge it and address the feedback in a followup.
/hold
/lgtm
/approve
/label no-qe
default: | ||
panic(fmt.Sprintf("USC :: Unknown object type: %T", object)) | ||
} | ||
} | ||
|
||
func kindAndNameToQueueKey(kind, name string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: kindAndName
is redundant with parameter names, something like queueKeyFor
would probably suffice
nodes, err := c.nodes.List(labels.Everything()) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so when this fails, it propagates through sync
error, which makes the controller retry 👍
@@ -39,6 +40,10 @@ type nodeInformerController struct { | |||
// sendInsight should be called to send produced insights to the update status controller | |||
sendInsight sendInsightFn | |||
|
|||
// machineConfigVersionCache caches machine config versions | |||
// The cache stores the name of MC as the key and the release image version as its value which is retrieved from the annotation of the MC. | |||
machineConfigVersionCache machineConfigVersionCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd call the member machineConfigsToVersions
(type name is good though)
@@ -39,6 +40,10 @@ type nodeInformerController struct { | |||
// sendInsight should be called to send produced insights to the update status controller | |||
sendInsight sendInsightFn | |||
|
|||
// machineConfigVersionCache caches machine config versions | |||
// The cache stores the name of MC as the key and the release image version as its value which is retrieved from the annotation of the MC. | |||
machineConfigVersionCache machineConfigVersionCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd call the member machineConfigVersions
(type name is good though)
machineConfig, err := c.machineConfigs.Get(name) | ||
if err != nil && !kerrors.IsNotFound(err) { | ||
return err | ||
} | ||
if kerrors.IsNotFound(err) { | ||
// The machine config was deleted | ||
if changed := c.machineConfigVersionCache.forget(name); changed { | ||
|
||
klog.V(2).Infof("Reconciling all nodes as machine config %q is deleted", name) | ||
return c.reconcileAllNodes(syncCtx.Queue()) | ||
} | ||
return nil | ||
} | ||
if changed := c.machineConfigVersionCache.ingest(machineConfig); changed { | ||
klog.V(2).Infof("Reconciling all nodes as machine config %q is refreshed", name) | ||
return c.reconcileAllNodes(syncCtx.Queue()) | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code structure: individual switch
branches are now complex enough for me to think we should have separate methods like syncMachineConfig
, syncNode
etc and just call them from here
func (c *machineConfigPoolSelectorCache) len() int { | ||
c.lock.Lock() | ||
defer c.lock.Unlock() | ||
return len(c.cache) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like overkill to have a locked method just to log ...N config pools
- I think a simple "does not belong to any MachineConfigPool" log is just as useful 🤷
func (c *nodeInformerController) initializeCaches() error { | ||
var errs []error | ||
|
||
if pools, err := c.machineConfigPools.List(labels.Everything()); err != nil { | ||
errs = append(errs, err) | ||
} else { | ||
for _, pool := range pools { | ||
c.machineConfigPoolSelectorCache.ingest(pool) | ||
} | ||
} | ||
klog.V(2).Infof("Stored %d machineConfigPools in the cache", c.machineConfigPoolSelectorCache.len()) | ||
|
||
machineConfigs, err := c.machineConfigs.List(labels.Everything()) | ||
if err != nil { | ||
errs = append(errs, err) | ||
} else { | ||
for _, mc := range machineConfigs { | ||
c.machineConfigVersionCache.ingest(mc) | ||
} | ||
} | ||
klog.V(2).Infof("Stored %d machineConfig versions in the cache", c.machineConfigVersionCache.len()) | ||
|
||
return kutilerrors.NewAggregate(errs) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to not couple the two into a single method: they are similar but are used in a different way at different places - consider having separate dedicated methods (which saves us from NewAggregate
so the two methods would even be simpler, and have two Once
instances.
c.machineConfigPoolSelectorCache.ingest(pool) | ||
} | ||
} | ||
klog.V(2).Infof("Stored %d machineConfigPools in the cache", c.machineConfigPoolSelectorCache.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider not having a len()
(that needs to lock) and just count the ingestions?
func (c *nodeInformerController) initializeCachesNoErrors() { | ||
if err := c.initializeCaches(); err != nil { | ||
klog.Errorf("Failed to initialize caches: %v", err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered whether we'd need to abort the controller 🤔
I think it is safe to continue the controller despite these failures because it just means we started cold and will eventually load all items from the cache?
if changed := c.machineConfigVersionCache.forget(name); changed { | ||
|
||
klog.V(2).Infof("Reconciling all nodes as machine config %q is deleted", name) | ||
return c.reconcileAllNodes(syncCtx.Queue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are losing some robustness by directly returning an error here. Consider the following;
- MC is updated and carries a new version - cache returns
changed
and we get into one of these branches where we doreconcileAll
- The
List
insidereconcileAll
fails so we propagate the err outside, and the controller library will requeue the MC queueKey that triggered us - New
sync
is called for this MC - but this time the cache already has the version, sochanged==false
and we finish the sync without callingreconcileAll
-> and end up with potentially outdated insights 😨
I'm not sure if it is an ugly hack or not, but I wonder if reconcileAllNodes
could be a special synthetic queue key that we'd just queue here, and it would get its own dedicated sync()
call. That synthetic call would then be handled simply by what reconcileAllNodes
is doing, and on a List error it would just return it, which would cause a requeue of the reconcileAllNodes
key... WDYT?
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This is to address the following comments from the review [1]:
[1]. #1136 (comment)