OTA-1427: Recompute node insights when MC/MCP changes #1144

hongkailiu · 2025-01-31T19:51:02Z

This is to address the following comments from the review [1]:

Decide what to do with the code we copied from MCO (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))
Avoid iterating all MCP resource and re-computing selectors on every Node-triggered sync (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))
Recompute Node insights when MCP selectors change (changed selectors can change MCP membership, potentially making Node insights stale) (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))
Avoid iterating all MC resources on every Node-triggered sync (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))
Decide how to gracefully handle Nodes that are not matched by any MCP selector (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))
Polishing condition reasons, messages and the "summary" insight message (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

openshift-ci-robot · 2025-01-31T19:51:06Z

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This is to address the following comments from the review [1]:

Decide what to do with the code we copied from MCO (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

Avoid iterating all MCP resource and re-computing selectors on every Node-triggered sync (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

Recompute Node insights when MCP selectors change (changed selectors can change MCP membership, potentially making Node insights stale) (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

Avoid iterating all MC resources on every Node-triggered sync (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

Decide how to gracefully handle Nodes that are not matched by any MCP selector (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

Polishing condition reasons, messages and the "summary" insight message (OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment), OTA-1427: USC: Maintain status insights for Nodes #1136 (comment))

[1]. #1136 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-02-01T12:54:13Z

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-02T00:32:53Z

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-02T03:51:06Z

Reconciling is working as there are logs Reconciling all nodes ...

The init seems not working ... 🤔

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1885848843463954432/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-65c495d5c6-4ljd7_update-status-controller.log

I0202 01:52:01.950445       1 nodeinformer.go:278] Stored 0 machineConfigPools in the cache
I0202 01:52:01.950524       1 nodeinformer.go:290] Stored 0 machineConfig versions in the cache

But Failed to determine which machine config pool did not show. That means the cache had been loaded (with events on MC/MCP) before the events on nodes got in? Let me add more logs to understand this.

hongkailiu · 2025-02-02T15:07:18Z

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1885905947818201088/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-844f579bfd-mxg7z_update-status-controller.log | rg 'Syncing with key|which|Cached|Stored|Reconciling all nodes' | head -n 10
I0202 06:06:53.968786       1 nodeinformer.go:282] Stored 0 machineConfigPools in the cache
I0202 06:06:53.968954       1 nodeinformer.go:294] Stored 0 machineConfig versions in the cache
I0202 06:06:54.183789       1 nodeinformer.go:107] NI :: Syncing with key MachineConfigPool/master
I0202 06:06:54.183939       1 nodeinformer.go:244] Cached MachineConfigPool master with selector node-role.kubernetes.io/master=
I0202 06:06:54.184005       1 nodeinformer.go:182] Reconciling all nodes as machine config pool "master" is refreshed
I0202 06:06:54.184104       1 nodeinformer.go:107] NI :: Syncing with key MachineConfigPool/worker
I0202 06:06:54.184171       1 nodeinformer.go:244] Cached MachineConfigPool worker with selector node-role.kubernetes.io/worker=
I0202 06:06:54.184224       1 nodeinformer.go:182] Reconciling all nodes as machine config pool "worker" is refreshed
I0202 06:06:54.184279       1 nodeinformer.go:107] NI :: Syncing with key Node/ci-op-n2q0th1r-8c750-5r658-master-0
I0202 06:06:54.200901       1 nodeinformer.go:107] NI :: Syncing with key Node/ci-op-n2q0th1r-8c750-5r658-master-1

In the init phase, the lister was not synced yet. However, it caught up by the events on MCPs.
It is safe to to remove initializeCaches() as it is useless.

https://github.com/openshift/cluster-version-operator/pull/1144/files#diff-c5a41cb83200e13544e0509a5e77c180e16a1d70d7411a79cac2bc107e89950fR271-R298

pkg/updatestatus/nodeinformer.go

hongkailiu · 2025-02-03T00:16:08Z

/retest-required

hongkailiu · 2025-02-03T00:18:41Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1886150916344647680/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-56468d56d6-trf86_update-status-controller.log | rg 'Syncing with key|which|Cached|Stored|Reconciling all nodes' | head -n 10
I0202 22:07:59.245643       1 nodeinformer.go:103] NI :: Syncing with key MachineConfigPool/master
I0202 22:07:59.245746       1 nodeinformer.go:243] Cached MachineConfigPool master with selector node-role.kubernetes.io/master=
I0202 22:07:59.245776       1 nodeinformer.go:178] Reconciling all nodes as machine config pool "master" is refreshed
I0202 22:07:59.245842       1 nodeinformer.go:103] NI :: Syncing with key MachineConfigPool/worker
I0202 22:07:59.245925       1 nodeinformer.go:243] Cached MachineConfigPool worker with selector node-role.kubernetes.io/worker=
I0202 22:07:59.245963       1 nodeinformer.go:178] Reconciling all nodes as machine config pool "worker" is refreshed
I0202 22:07:59.246009       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-master-0
I0202 22:07:59.254500       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-master-1
I0202 22:07:59.268333       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-master-2
I0202 22:07:59.280301       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-worker-westus-rx8ls

The log looks good.

hongkailiu · 2025-02-03T11:47:59Z

/test e2e-hypershift

hongkailiu · 2025-02-03T13:27:32Z

/cc @petr-muller

pkg/updatestatus/nodeinformer.go

pkg/updatestatus/nodeinformer_test.go

hongkailiu · 2025-02-14T05:26:40Z

I see you already had something like this and removed it (#1144 (comment) / #1144 (comment) ) but I do not entirely understand why you found it useless?

I could not warm up the cache with the listers from the informer as I do not know when the lister is synced, i.e., they have the MCPs/MCs. My init function was useless because the listers was NOT synced when it is called. Moreover, from the testing, MC/MCP events always came earlier than node events, i.e., "requeuing a ton of nodes" did not happen.

Maybe we could use a post-start hook to warmup the caches?

Added it back in the 5th commit with the hook. Let us see if it helps in the test.

If still not helping, machineConfigInformers.WaitForCacheSync(ctx.Done()) is the next thing to try before this line:

cluster-version-operator/pkg/updatestatus/start.go

Line 64 in 3af45bc

go nodeInformerController.Run(ctx, 1)

My expectation is that the listers are synced before nodeInformerController runs.

Edit:

From the comment

cluster-version-operator/pkg/updatestatus/start.go

Lines 55 to 56 in 3af45bc

    
           // start the informers, but we do not need to wait for them to sync because each controller waits 
        
           // for synced informers it uses in its Run() method

and "run asynchronously" from

cluster-version-operator/vendor/github.com/openshift/library-go/pkg/controller/factory/factory.go

Lines 182 to 183 in 8f0f8d6

    
           // WithPostStartHooks allows to register functions that will run asynchronously after the controller is started via Run command. 
        
           func (f *Factory) WithPostStartHooks(hooks ...PostStartHook) *Factory {

I got the feeling that we will need machineConfigInformers.WaitForCacheSync(ctx.Done()) to ensure? that the informer's sync has been finished before the hook is called. Still waiting for the CI job to prove this feeling.

hongkailiu · 2025-02-14T11:34:54Z

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-14T14:00:36Z

Must gather did not work in the last CI run.

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-14T19:23:07Z

The ordering looks so nice even without explicitly callingmachineConfigInformers.WaitForCacheSync(ctx.Done()).
The first node ci-op-6jd7864k-8c750-fc76f-master-0 got MCP/master.

Note that the first event arrived before our caches got warmed up (after the informers' caches synced) which verified my theory in #1144 (comment) ...

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890400781975162880/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-7f56cd7468-ffg4q_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0214 15:31:07.406362       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0214 15:31:07.406389       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0214 15:31:07.406459       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0214 15:31:07.423666       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0214 15:31:07.423697       1 base_controller.go:82] Caches are synced for NodeInformer
I0214 15:31:07.423727       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0214 15:31:07.423772       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-6jd7864k-8c750-fc76f-master-0
I0214 15:31:07.423810       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0214 15:31:07.424105       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0214 15:31:07.423828       1 nodeinformer.go:122] Node ci-op-6jd7864k-8c750-fc76f-master-0 belongs to machine config pool master
I0214 15:31:07.430259       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-6jd7864k-8c750-fc76f-master-1
I0214 15:31:07.430279       1 nodeinformer.go:122] Node ci-op-6jd7864k-8c750-fc76f-master-1 belongs to machine config pool master
...

Let me try again.

If it showed me the same result, i intend to excluding the wait, letting the reconciliation of event on MCP to do the job.

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-15T00:08:31Z

install failure in the last CI job.

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-15T03:22:39Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890553751542960128/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-5dc5599f49-6ff4n_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 01:45:51.099734       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 01:45:51.100905       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 01:45:51.100990       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 01:45:51.117880       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 01:45:51.118257       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 01:45:51.118563       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 01:45:51.118158       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 01:45:51.118719       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 01:45:51.118781       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-8g5b5-master-0
I0215 01:45:51.118859       1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-8g5b5-master-0 belongs to machine config pool master
I0215 01:45:51.118534       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 01:45:51.118548       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 01:45:51.130188       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-8g5b5-master-1

Still good. This time, MCP events came first this time.

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-15T12:05:03Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890602608901492736/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-76769d4bcc-gdfrq_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 04:57:44.796093       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 04:57:44.796128       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 04:57:44.796132       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 04:57:44.809388       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 04:57:44.809416       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 04:57:44.809519       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 04:57:44.809521       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-2qcn6-master-0
I0215 04:57:44.809519       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 04:57:44.809539       1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-2qcn6-master-0 belongs to machine config pool master
I0215 04:57:44.809561       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 04:57:44.814586       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-2qcn6-master-1
I0215 04:57:44.814600       1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-2qcn6-master-1 belongs to machine config pool master

Still works: the first node got its matching MCP.

Three consecutive tests worked.

To summarize: Those 3 things racing up with one another.

informer.start(): not blocking. The informers caches start to sync. The events will feed up controller.
controller.run() calls WaitForCacheSync().
controller's PostStartHook: initialize our/controller's caches which requires the synced listers.

In all three tests so far, the informers' caches are synced quickly enough: when PostStartHook was called, the lister had data already. (It was not the case without setting up PostStartHook in my original implementation. So PostStartHook helps)

The initializing of controller's caches was either complete (test 2) or close to complete (test 1 and 3) when the first event came in. Ideally, we want to ensure the completeness (but I do not think we have a way to achieve it at the moment).

I feel this is good enough (the window of cold caches is small) and the MCP/MC event (also coming quickly) will reconcile all nodes if errors were caused from cold caches. We do not have to call WaitForCacheSync explicitly.

Just for my curiosity, let me see the difference if we do it anyway.

hongkailiu · 2025-02-15T14:22:48Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890735530849603584/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-6b58f4f84d-gwwjc_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 13:38:35.934527       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 13:38:35.934609       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 13:38:35.934635       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 13:38:35.954205       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 13:38:35.954048       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 13:38:35.954334       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 13:38:35.954385       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 13:38:35.954427       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 13:38:35.954477       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 13:38:35.954493       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-6xlg8-master-0
I0215 13:38:35.954508       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-6xlg8-master-0 belongs to machine config pool master
I0215 13:38:35.954058       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 13:38:35.965000       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-6xlg8-master-1

~~It does what I expect it to do: First event came ~~before~~ AFTER caches were initialized.~~

It looks like Test 2 above.

/test e2e-agnostic-usc-devpreview

Trying again...

hongkailiu · 2025-02-15T16:51:49Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890768740337848320/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-85bcd6d7c6-f4jp7_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 15:27:40.242112       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 15:27:40.242132       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 15:27:40.242171       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 15:27:40.257544       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 15:27:40.257806       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 15:27:40.257851       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 15:27:40.257670       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 15:27:40.257675       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 15:27:40.257971       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 15:27:40.257994       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-fhcw9-master-0
I0215 15:27:40.258007       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-fhcw9-master-0 belongs to machine config pool master

Again, like Test 2 above.

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-15T23:04:20Z

"gather-extra failed", rerunning ...

/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-16T03:26:11Z

Something wrong with the cluster. Retrying ...
/test e2e-agnostic-usc-devpreview

hongkailiu · 2025-02-16T13:34:04Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890965887544987648/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-79f559cc9d-qjmlx_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0216 04:54:37.613087       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0216 04:54:37.613168       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0216 04:54:37.614167       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0216 04:54:37.625667       1 base_controller.go:82] Caches are synced for NodeInformer
I0216 04:54:37.625830       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0216 04:54:37.626125       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0216 04:54:37.626194       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-8b42p-master-0
I0216 04:54:37.626232       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-8b42p-master-0 belongs to machine config pool master
I0216 04:54:37.627826       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0216 04:54:37.633539       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-8b42p-master-1
I0216 04:54:37.633582       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-8b42p-master-1 belongs to machine config pool master

Although the CI job failed, we got the logs from the test.
Node event came in first AFTER our caches had been initialized.

We have 3 tests without 7c289e9 and 3 with it.

Facts:

all 6: the informers' caches had been synced before PostStartHook was called.
all 6: all nodes got a matching MCP.
1/3 without it vs 3/3 with it: PostStartHook had finished before the first event came.

Thinking:

Even without WaitForCacheSync, informers' caches were synced.
Even with it, it wont ensure that PostStartHook finishes before the first event comes. It is my understanding the code. It did not happen in the testing. It seems it helped to achieve the goal (if the samples are not too few): 1/3 ---> 3/3. So maybe keep the WaitForCacheSync?

petr-muller · 2025-02-24T19:18:03Z

OK, I proposed PostStartHook because I thought they would run before the reconciliation loop but they don't (check out the Run() implementation, it's actually very nice and readable...).

That means PostStartHook will really race with our sync, which makes it inappropriate solution for the problem. I think we could use something like sync.Once (https://pkg.go.dev/sync#Once) for warmup in sync() - first sync would call the warmup code at the beginning, and maybe that's enough?

hongkailiu · 2025-02-25T00:58:59Z

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1894124084053151744/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-785c89d6b-dmj26_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0224 21:53:16.289576       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0224 21:53:16.289608       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0224 21:53:16.289585       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0224 21:53:16.306588       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0224 21:53:16.306599       1 base_controller.go:82] Caches are synced for NodeInformer
I0224 21:53:16.306626       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0224 21:53:16.306836       1 nodeinformer.go:221] Stored 2 machineConfigPools in the cache
I0224 21:53:16.306874       1 nodeinformer.go:231] Stored 2 machineConfig versions in the cache
I0224 21:53:16.306880       1 nodeinformer.go:101] NI :: Syncing with key MachineConfigPool/master
I0224 21:53:16.306900       1 nodeinformer.go:101] NI :: Syncing with key MachineConfigPool/worker
I0224 21:53:16.306916       1 nodeinformer.go:101] NI :: Syncing with key Node/ci-op-yc00jphy-8c750-78krc-master-0
I0224 21:53:16.306930       1 nodeinformer.go:128] Node ci-op-yc00jphy-8c750-78krc-master-0 belongs to machine config pool master
...

The above log is from this job. It (once.Do(c.initializeCachesNoErrors)) seems working nicely: called ONCE and ONLY once.

hongkailiu · 2025-02-25T00:59:20Z

/retest-required

hongkailiu · 2025-02-25T13:59:12Z

/retest-required

petr-muller

Many small bits of feedback inline. I think the only substantial change is the lack of robustness when reconcileAllNodes fails. The rest is mostly naming and code structure proposals - address the ones you agree with.

In line with our retro agreement, I am approving the PR so you can merge it and address the feedback in a followup.

/hold
/lgtm
/approve
/label no-qe

petr-muller · 2025-02-25T13:33:44Z

pkg/updatestatus/nodeinformer.go

 	default:
 		panic(fmt.Sprintf("USC :: Unknown object type: %T", object))
 	}
 }
+
+func kindAndNameToQueueKey(kind, name string) string {


nit: kindAndName is redundant with parameter names, something like queueKeyFor would probably suffice

petr-muller · 2025-02-25T13:40:00Z

pkg/updatestatus/nodeinformer.go

+	nodes, err := c.nodes.List(labels.Everything())
+	if err != nil {
+		return err
+	}


OK so when this fails, it propagates through sync error, which makes the controller retry 👍

petr-muller · 2025-02-25T13:43:22Z

pkg/updatestatus/nodeinformer.go

@@ -39,6 +40,10 @@ type nodeInformerController struct {
 	// sendInsight should be called to send produced insights to the update status controller
 	sendInsight sendInsightFn

+	// machineConfigVersionCache caches machine config versions
+	// The cache stores the name of MC as the key and the release image version as its value which is retrieved from the annotation of the MC.
+	machineConfigVersionCache machineConfigVersionCache


nit: I'd call the member machineConfigsToVersions (type name is good though)

petr-muller · 2025-02-25T13:43:28Z

pkg/updatestatus/nodeinformer.go

@@ -39,6 +40,10 @@ type nodeInformerController struct {
 	// sendInsight should be called to send produced insights to the update status controller
 	sendInsight sendInsightFn

+	// machineConfigVersionCache caches machine config versions
+	// The cache stores the name of MC as the key and the release image version as its value which is retrieved from the annotation of the MC.
+	machineConfigVersionCache machineConfigVersionCache


nit: I'd call the member machineConfigVersions (type name is good though)

petr-muller · 2025-02-25T13:46:57Z

pkg/updatestatus/nodeinformer.go

+		machineConfig, err := c.machineConfigs.Get(name)
+		if err != nil && !kerrors.IsNotFound(err) {
+			return err
+		}
+		if kerrors.IsNotFound(err) {
+			// The machine config was deleted
+			if changed := c.machineConfigVersionCache.forget(name); changed {
+
+				klog.V(2).Infof("Reconciling all nodes as machine config %q is deleted", name)
+				return c.reconcileAllNodes(syncCtx.Queue())
+			}
+			return nil
+		}
+		if changed := c.machineConfigVersionCache.ingest(machineConfig); changed {
+			klog.V(2).Infof("Reconciling all nodes as machine config %q is refreshed", name)
+			return c.reconcileAllNodes(syncCtx.Queue())
+		}
+		return nil


code structure: individual switch branches are now complex enough for me to think we should have separate methods like syncMachineConfig, syncNode etc and just call them from here

petr-muller · 2025-02-25T14:28:01Z

pkg/updatestatus/nodeinformer.go

+func (c *machineConfigPoolSelectorCache) len() int {
+	c.lock.Lock()
+	defer c.lock.Unlock()
+	return len(c.cache)
+}


Feels like overkill to have a locked method just to log ...N config pools - I think a simple "does not belong to any MachineConfigPool" log is just as useful 🤷

petr-muller · 2025-02-25T14:35:07Z

pkg/updatestatus/nodeinformer.go

+func (c *nodeInformerController) initializeCaches() error {
+	var errs []error
+
+	if pools, err := c.machineConfigPools.List(labels.Everything()); err != nil {
+		errs = append(errs, err)
+	} else {
+		for _, pool := range pools {
+			c.machineConfigPoolSelectorCache.ingest(pool)
+		}
+	}
+	klog.V(2).Infof("Stored %d machineConfigPools in the cache", c.machineConfigPoolSelectorCache.len())
+
+	machineConfigs, err := c.machineConfigs.List(labels.Everything())
+	if err != nil {
+		errs = append(errs, err)
+	} else {
+		for _, mc := range machineConfigs {
+			c.machineConfigVersionCache.ingest(mc)
+		}
+	}
+	klog.V(2).Infof("Stored %d machineConfig versions in the cache", c.machineConfigVersionCache.len())
+
+	return kutilerrors.NewAggregate(errs)
+}


I'd prefer to not couple the two into a single method: they are similar but are used in a different way at different places - consider having separate dedicated methods (which saves us from NewAggregate so the two methods would even be simpler, and have two Once instances.

petr-muller · 2025-02-25T14:36:42Z

pkg/updatestatus/nodeinformer.go

+			c.machineConfigPoolSelectorCache.ingest(pool)
+		}
+	}
+	klog.V(2).Infof("Stored %d machineConfigPools in the cache", c.machineConfigPoolSelectorCache.len())


consider not having a len() (that needs to lock) and just count the ingestions?

petr-muller · 2025-02-25T14:39:41Z

pkg/updatestatus/nodeinformer.go

+func (c *nodeInformerController) initializeCachesNoErrors() {
+	if err := c.initializeCaches(); err != nil {
+		klog.Errorf("Failed to initialize caches: %v", err)
+	}
+}


I wondered whether we'd need to abort the controller 🤔

I think it is safe to continue the controller despite these failures because it just means we started cold and will eventually load all items from the cache?

petr-muller · 2025-02-25T14:48:20Z

pkg/updatestatus/nodeinformer.go

+			if changed := c.machineConfigVersionCache.forget(name); changed {
+
+				klog.V(2).Infof("Reconciling all nodes as machine config %q is deleted", name)
+				return c.reconcileAllNodes(syncCtx.Queue())


I think we are losing some robustness by directly returning an error here. Consider the following;

MC is updated and carries a new version - cache returns changed and we get into one of these branches where we do reconcileAll

The List inside reconcileAll fails so we propagate the err outside, and the controller library will requeue the MC queueKey that triggered us

New sync is called for this MC - but this time the cache already has the version, so changed==false and we finish the sync without calling reconcileAll -> and end up with potentially outdated insights 😨

I'm not sure if it is an ugly hack or not, but I wonder if reconcileAllNodes could be a special synthetic queue key that we'd just queue here, and it would get its own dedicated sync() call. That synthetic call would then be handled simply by what reconcileAllNodes is doing, and on a List error it would just return it, which would cause a requeue of the reconcileAllNodes key... WDYT?

petr-muller · 2025-02-25T14:52:43Z

/lgtm
/approve

openshift-ci · 2025-02-25T14:54:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-02-25T15:49:48Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2025

hongkailiu changed the title ~~OTA-1427: Recompute node insights when MC/MCP changes~~ [wip]OTA-1427: Recompute node insights when MC/MCP changes Jan 31, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2025

openshift-ci bot requested review from DavidHurta and wking January 31, 2025 19:52

hongkailiu force-pushed the OTA-1427-watch-mcp branch from 1a7430a to 6eeacd5 Compare January 31, 2025 21:57

hongkailiu force-pushed the OTA-1427-watch-mcp branch from 6eeacd5 to 45b0b70 Compare February 2, 2025 04:19

hongkailiu force-pushed the OTA-1427-watch-mcp branch 2 times, most recently from 07ebaa6 to 5a1a854 Compare February 2, 2025 15:20

hongkailiu commented Feb 2, 2025

View reviewed changes

pkg/updatestatus/nodeinformer.go Outdated Show resolved Hide resolved

hongkailiu force-pushed the OTA-1427-watch-mcp branch 6 times, most recently from 43096ac to 791b701 Compare February 2, 2025 20:33

hongkailiu changed the title ~~[wip]OTA-1427: Recompute node insights when MC/MCP changes~~ OTA-1427: Recompute node insights when MC/MCP changes Feb 3, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025

openshift-ci bot requested a review from petr-muller February 3, 2025 13:27

DavidHurta reviewed Feb 4, 2025

View reviewed changes

pkg/updatestatus/nodeinformer_test.go Show resolved Hide resolved

hongkailiu force-pushed the OTA-1427-watch-mcp branch 3 times, most recently from 44c1f2c to 71e9c4c Compare February 14, 2025 05:36

usc: init controller's caches before syncing

c3483f4

hongkailiu force-pushed the OTA-1427-watch-mcp branch from 7c289e9 to c3483f4 Compare February 24, 2025 20:35

petr-muller reviewed Feb 25, 2025

View reviewed changes

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. no-qe Allows PRs to merge without qe-approved label labels Feb 25, 2025

openshift-ci bot assigned petr-muller Feb 25, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2025

OTA-1427: Recompute node insights when MC/MCP changes #1144

Are you sure you want to change the base?

OTA-1427: Recompute node insights when MC/MCP changes #1144

Conversation

hongkailiu commented Jan 31, 2025 • edited Loading

openshift-ci-robot commented Jan 31, 2025 • edited by openshift-ci bot Loading

hongkailiu commented Feb 1, 2025

hongkailiu commented Feb 2, 2025

hongkailiu commented Feb 2, 2025 • edited Loading

hongkailiu commented Feb 2, 2025

hongkailiu commented Feb 3, 2025

hongkailiu commented Feb 3, 2025

hongkailiu commented Feb 3, 2025

hongkailiu commented Feb 3, 2025

hongkailiu commented Feb 14, 2025 • edited Loading

hongkailiu commented Feb 14, 2025

hongkailiu commented Feb 14, 2025

hongkailiu commented Feb 14, 2025 • edited Loading

hongkailiu commented Feb 15, 2025

hongkailiu commented Feb 15, 2025 • edited Loading

hongkailiu commented Feb 15, 2025 • edited Loading

hongkailiu commented Feb 15, 2025 • edited Loading

hongkailiu commented Feb 15, 2025

hongkailiu commented Feb 15, 2025

hongkailiu commented Feb 16, 2025

hongkailiu commented Feb 16, 2025

petr-muller commented Feb 24, 2025

hongkailiu commented Feb 25, 2025

hongkailiu commented Feb 25, 2025

hongkailiu commented Feb 25, 2025

petr-muller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petr-muller commented Feb 25, 2025

openshift-ci bot commented Feb 25, 2025

openshift-ci bot commented Feb 25, 2025

hongkailiu commented Jan 31, 2025 •

edited

Loading

openshift-ci-robot commented Jan 31, 2025 •

edited by openshift-ci bot

Loading

hongkailiu commented Feb 2, 2025 •

edited

Loading

hongkailiu commented Feb 14, 2025 •

edited

Loading

hongkailiu commented Feb 14, 2025 •

edited

Loading

hongkailiu commented Feb 15, 2025 •

edited

Loading

hongkailiu commented Feb 15, 2025 •

edited

Loading

hongkailiu commented Feb 15, 2025 •

edited

Loading