Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1427: Recompute node insights when MC/MCP changes #1144

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Jan 31, 2025

This is to address the following comments from the review [1]:

[1]. #1136 (comment)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 31, 2025

@hongkailiu: This pull request references OTA-1427 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This is to address the following comments from the review [1]:

[1]. #1136 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu hongkailiu changed the title OTA-1427: Recompute node insights when MC/MCP changes [wip]OTA-1427: Recompute node insights when MC/MCP changes Jan 31, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2025
@openshift-ci openshift-ci bot requested review from DavidHurta and wking January 31, 2025 19:52
@hongkailiu
Copy link
Member Author

/test e2e-agnostic-usc-devpreview

1 similar comment
@hongkailiu
Copy link
Member Author

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 2, 2025

Reconciling is working as there are logs Reconciling all nodes ...

The init seems not working ... 🤔

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1885848843463954432/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-65c495d5c6-4ljd7_update-status-controller.log

I0202 01:52:01.950445       1 nodeinformer.go:278] Stored 0 machineConfigPools in the cache
I0202 01:52:01.950524       1 nodeinformer.go:290] Stored 0 machineConfig versions in the cache

But Failed to determine which machine config pool did not show. That means the cache had been loaded (with events on MC/MCP) before the events on nodes got in? Let me add more logs to understand this.

@hongkailiu
Copy link
Member Author

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1885905947818201088/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-844f579bfd-mxg7z_update-status-controller.log | rg 'Syncing with key|which|Cached|Stored|Reconciling all nodes' | head -n 10
I0202 06:06:53.968786       1 nodeinformer.go:282] Stored 0 machineConfigPools in the cache
I0202 06:06:53.968954       1 nodeinformer.go:294] Stored 0 machineConfig versions in the cache
I0202 06:06:54.183789       1 nodeinformer.go:107] NI :: Syncing with key MachineConfigPool/master
I0202 06:06:54.183939       1 nodeinformer.go:244] Cached MachineConfigPool master with selector node-role.kubernetes.io/master=
I0202 06:06:54.184005       1 nodeinformer.go:182] Reconciling all nodes as machine config pool "master" is refreshed
I0202 06:06:54.184104       1 nodeinformer.go:107] NI :: Syncing with key MachineConfigPool/worker
I0202 06:06:54.184171       1 nodeinformer.go:244] Cached MachineConfigPool worker with selector node-role.kubernetes.io/worker=
I0202 06:06:54.184224       1 nodeinformer.go:182] Reconciling all nodes as machine config pool "worker" is refreshed
I0202 06:06:54.184279       1 nodeinformer.go:107] NI :: Syncing with key Node/ci-op-n2q0th1r-8c750-5r658-master-0
I0202 06:06:54.200901       1 nodeinformer.go:107] NI :: Syncing with key Node/ci-op-n2q0th1r-8c750-5r658-master-1

In the init phase, the lister was not synced yet. However, it caught up by the events on MCPs.
It is safe to to remove initializeCaches() as it is useless.

https://github.com/openshift/cluster-version-operator/pull/1144/files#diff-c5a41cb83200e13544e0509a5e77c180e16a1d70d7411a79cac2bc107e89950fR271-R298

@hongkailiu hongkailiu force-pushed the OTA-1427-watch-mcp branch 2 times, most recently from 07ebaa6 to 5a1a854 Compare February 2, 2025 15:20
@hongkailiu hongkailiu force-pushed the OTA-1427-watch-mcp branch 6 times, most recently from 43096ac to 791b701 Compare February 2, 2025 20:33
@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1886150916344647680/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-56468d56d6-trf86_update-status-controller.log | rg 'Syncing with key|which|Cached|Stored|Reconciling all nodes' | head -n 10
I0202 22:07:59.245643       1 nodeinformer.go:103] NI :: Syncing with key MachineConfigPool/master
I0202 22:07:59.245746       1 nodeinformer.go:243] Cached MachineConfigPool master with selector node-role.kubernetes.io/master=
I0202 22:07:59.245776       1 nodeinformer.go:178] Reconciling all nodes as machine config pool "master" is refreshed
I0202 22:07:59.245842       1 nodeinformer.go:103] NI :: Syncing with key MachineConfigPool/worker
I0202 22:07:59.245925       1 nodeinformer.go:243] Cached MachineConfigPool worker with selector node-role.kubernetes.io/worker=
I0202 22:07:59.245963       1 nodeinformer.go:178] Reconciling all nodes as machine config pool "worker" is refreshed
I0202 22:07:59.246009       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-master-0
I0202 22:07:59.254500       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-master-1
I0202 22:07:59.268333       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-master-2
I0202 22:07:59.280301       1 nodeinformer.go:103] NI :: Syncing with key Node/ci-op-89pfjfg5-8c750-p4r2f-worker-westus-rx8ls

The log looks good.

@hongkailiu hongkailiu changed the title [wip]OTA-1427: Recompute node insights when MC/MCP changes OTA-1427: Recompute node insights when MC/MCP changes Feb 3, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025
@hongkailiu
Copy link
Member Author

/test e2e-hypershift

@hongkailiu
Copy link
Member Author

/cc @petr-muller

@openshift-ci openshift-ci bot requested a review from petr-muller February 3, 2025 13:27
@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 14, 2025

I see you already had something like this and removed it (#1144 (comment) / #1144 (comment) ) but I do not entirely understand why you found it useless?

I could not warm up the cache with the listers from the informer as I do not know when the lister is synced, i.e., they have the MCPs/MCs. My init function was useless because the listers was NOT synced when it is called. Moreover, from the testing, MC/MCP events always came earlier than node events, i.e., "requeuing a ton of nodes" did not happen.

Maybe we could use a post-start hook to warmup the caches?

Added it back in the 5th commit with the hook. Let us see if it helps in the test.

If still not helping, machineConfigInformers.WaitForCacheSync(ctx.Done()) is the next thing to try before this line:

go nodeInformerController.Run(ctx, 1)

My expectation is that the listers are synced before nodeInformerController runs.

Edit:

From the comment

// start the informers, but we do not need to wait for them to sync because each controller waits
// for synced informers it uses in its Run() method

and "run asynchronously" from

// WithPostStartHooks allows to register functions that will run asynchronously after the controller is started via Run command.
func (f *Factory) WithPostStartHooks(hooks ...PostStartHook) *Factory {

I got the feeling that we will need machineConfigInformers.WaitForCacheSync(ctx.Done()) to ensure? that the informer's sync has been finished before the hook is called. Still waiting for the CI job to prove this feeling.

@hongkailiu hongkailiu force-pushed the OTA-1427-watch-mcp branch 3 times, most recently from 44c1f2c to 71e9c4c Compare February 14, 2025 05:36
@hongkailiu
Copy link
Member Author

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

Must gather did not work in the last CI run.

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 14, 2025

The ordering looks so nice even without explicitly callingmachineConfigInformers.WaitForCacheSync(ctx.Done()).
The first node ci-op-6jd7864k-8c750-fc76f-master-0 got MCP/master.

Note that the first event arrived before our caches got warmed up (after the informers' caches synced) which verified my theory in #1144 (comment) ...

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890400781975162880/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-7f56cd7468-ffg4q_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0214 15:31:07.406362       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0214 15:31:07.406389       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0214 15:31:07.406459       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0214 15:31:07.423666       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0214 15:31:07.423697       1 base_controller.go:82] Caches are synced for NodeInformer
I0214 15:31:07.423727       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0214 15:31:07.423772       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-6jd7864k-8c750-fc76f-master-0
I0214 15:31:07.423810       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0214 15:31:07.424105       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0214 15:31:07.423828       1 nodeinformer.go:122] Node ci-op-6jd7864k-8c750-fc76f-master-0 belongs to machine config pool master
I0214 15:31:07.430259       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-6jd7864k-8c750-fc76f-master-1
I0214 15:31:07.430279       1 nodeinformer.go:122] Node ci-op-6jd7864k-8c750-fc76f-master-1 belongs to machine config pool master
...

Let me try again.

If it showed me the same result, i intend to excluding the wait, letting the reconciliation of event on MCP to do the job.

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

install failure in the last CI job.

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 15, 2025

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890553751542960128/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-5dc5599f49-6ff4n_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 01:45:51.099734       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 01:45:51.100905       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 01:45:51.100990       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 01:45:51.117880       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 01:45:51.118257       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 01:45:51.118563       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 01:45:51.118158       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 01:45:51.118719       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 01:45:51.118781       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-8g5b5-master-0
I0215 01:45:51.118859       1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-8g5b5-master-0 belongs to machine config pool master
I0215 01:45:51.118534       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 01:45:51.118548       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 01:45:51.130188       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-8g5b5-master-1

Still good. This time, MCP events came first this time.

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 15, 2025

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890602608901492736/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-76769d4bcc-gdfrq_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 04:57:44.796093       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 04:57:44.796128       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 04:57:44.796132       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 04:57:44.809388       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 04:57:44.809416       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 04:57:44.809519       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 04:57:44.809521       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-2qcn6-master-0
I0215 04:57:44.809519       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 04:57:44.809539       1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-2qcn6-master-0 belongs to machine config pool master
I0215 04:57:44.809561       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 04:57:44.814586       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-4gsljddq-8c750-2qcn6-master-1
I0215 04:57:44.814600       1 nodeinformer.go:122] Node ci-op-4gsljddq-8c750-2qcn6-master-1 belongs to machine config pool master

Still works: the first node got its matching MCP.

Three consecutive tests worked.

To summarize: Those 3 things racing up with one another.

  • informer.start(): not blocking. The informers caches start to sync. The events will feed up controller.
  • controller.run() calls WaitForCacheSync().
  • controller's PostStartHook: initialize our/controller's caches which requires the synced listers.

In all three tests so far, the informers' caches are synced quickly enough: when PostStartHook was called, the lister had data already. (It was not the case without setting up PostStartHook in my original implementation. So PostStartHook helps)

The initializing of controller's caches was either complete (test 2) or close to complete (test 1 and 3) when the first event came in. Ideally, we want to ensure the completeness (but I do not think we have a way to achieve it at the moment).

I feel this is good enough (the window of cold caches is small) and the MCP/MC event (also coming quickly) will reconcile all nodes if errors were caused from cold caches. We do not have to call WaitForCacheSync explicitly.

Just for my curiosity, let me see the difference if we do it anyway.

@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 15, 2025

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890735530849603584/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-6b58f4f84d-gwwjc_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 13:38:35.934527       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 13:38:35.934609       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 13:38:35.934635       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 13:38:35.954205       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 13:38:35.954048       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 13:38:35.954334       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 13:38:35.954385       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 13:38:35.954427       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 13:38:35.954477       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 13:38:35.954493       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-6xlg8-master-0
I0215 13:38:35.954508       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-6xlg8-master-0 belongs to machine config pool master
I0215 13:38:35.954058       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0215 13:38:35.965000       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-6xlg8-master-1

It does what I expect it to do: First event came before AFTER caches were initialized.

It looks like Test 2 above.

/test e2e-agnostic-usc-devpreview

Trying again...

@hongkailiu
Copy link
Member Author

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890768740337848320/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-85bcd6d7c6-f4jp7_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0215 15:27:40.242112       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0215 15:27:40.242132       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0215 15:27:40.242171       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0215 15:27:40.257544       1 base_controller.go:82] Caches are synced for NodeInformer
I0215 15:27:40.257806       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0215 15:27:40.257851       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0215 15:27:40.257670       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0215 15:27:40.257675       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/master
I0215 15:27:40.257971       1 nodeinformer.go:95] NI :: Syncing with key MachineConfigPool/worker
I0215 15:27:40.257994       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-fhcw9-master-0
I0215 15:27:40.258007       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-fhcw9-master-0 belongs to machine config pool master

Again, like Test 2 above.

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

"gather-extra failed", rerunning ...

/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

Something wrong with the cluster. Retrying ...
/test e2e-agnostic-usc-devpreview

@hongkailiu
Copy link
Member Author

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1890965887544987648/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-79f559cc9d-qjmlx_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0216 04:54:37.613087       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0216 04:54:37.613168       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0216 04:54:37.614167       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0216 04:54:37.625667       1 base_controller.go:82] Caches are synced for NodeInformer
I0216 04:54:37.625830       1 nodeinformer.go:209] Stored 2 machineConfigPools in the cache
I0216 04:54:37.626125       1 nodeinformer.go:219] Stored 4 machineConfig versions in the cache
I0216 04:54:37.626194       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-8b42p-master-0
I0216 04:54:37.626232       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-8b42p-master-0 belongs to machine config pool master
I0216 04:54:37.627826       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0216 04:54:37.633539       1 nodeinformer.go:95] NI :: Syncing with key Node/ci-op-19xpkgks-8c750-8b42p-master-1
I0216 04:54:37.633582       1 nodeinformer.go:122] Node ci-op-19xpkgks-8c750-8b42p-master-1 belongs to machine config pool master

Although the CI job failed, we got the logs from the test.
Node event came in first AFTER our caches had been initialized.

We have 3 tests without 7c289e9 and 3 with it.

Facts:

  • all 6: the informers' caches had been synced before PostStartHook was called.
  • all 6: all nodes got a matching MCP.
  • 1/3 without it vs 3/3 with it: PostStartHook had finished before the first event came.

Thinking:

  • Even without WaitForCacheSync, informers' caches were synced.
  • Even with it, it wont ensure that PostStartHook finishes before the first event comes. It is my understanding the code. It did not happen in the testing. It seems it helped to achieve the goal (if the samples are not too few): 1/3 ---> 3/3. So maybe keep the WaitForCacheSync?

@petr-muller
Copy link
Member

OK, I proposed PostStartHook because I thought they would run before the reconciliation loop but they don't (check out the Run() implementation, it's actually very nice and readable...).

That means PostStartHook will really race with our sync, which makes it inappropriate solution for the problem. I think we could use something like sync.Once (https://pkg.go.dev/sync#Once) for warmup in sync() - first sync would call the warmup code at the beginning, and maybe that's enough?

@hongkailiu
Copy link
Member Author

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1144/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-usc-devpreview/1894124084053151744/artifacts/e2e-agnostic-usc-devpreview/gather-extra/artifacts/pods/openshift-update-status-controller_update-status-controller-785c89d6b-dmj26_update-status-controller.log | rg 'Caches are synced|Stored|Syncing with key|belong'
I0224 21:53:16.289576       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0224 21:53:16.289608       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0224 21:53:16.289585       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController
I0224 21:53:16.306588       1 base_controller.go:82] Caches are synced for ControlPlaneInformer
I0224 21:53:16.306599       1 base_controller.go:82] Caches are synced for NodeInformer
I0224 21:53:16.306626       1 base_controller.go:82] Caches are synced for UpdateStatusController
I0224 21:53:16.306836       1 nodeinformer.go:221] Stored 2 machineConfigPools in the cache
I0224 21:53:16.306874       1 nodeinformer.go:231] Stored 2 machineConfig versions in the cache
I0224 21:53:16.306880       1 nodeinformer.go:101] NI :: Syncing with key MachineConfigPool/master
I0224 21:53:16.306900       1 nodeinformer.go:101] NI :: Syncing with key MachineConfigPool/worker
I0224 21:53:16.306916       1 nodeinformer.go:101] NI :: Syncing with key Node/ci-op-yc00jphy-8c750-78krc-master-0
I0224 21:53:16.306930       1 nodeinformer.go:128] Node ci-op-yc00jphy-8c750-78krc-master-0 belongs to machine config pool master
...

The above log is from this job. It (once.Do(c.initializeCachesNoErrors)) seems working nicely: called ONCE and ONLY once.

@hongkailiu
Copy link
Member Author

/retest-required

1 similar comment
@hongkailiu
Copy link
Member Author

/retest-required

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many small bits of feedback inline. I think the only substantial change is the lack of robustness when reconcileAllNodes fails. The rest is mostly naming and code structure proposals - address the ones you agree with.

In line with our retro agreement, I am approving the PR so you can merge it and address the feedback in a followup.

/hold
/lgtm
/approve
/label no-qe

default:
panic(fmt.Sprintf("USC :: Unknown object type: %T", object))
}
}

func kindAndNameToQueueKey(kind, name string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kindAndName is redundant with parameter names, something like queueKeyFor would probably suffice

Comment on lines +159 to +162
nodes, err := c.nodes.List(labels.Everything())
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so when this fails, it propagates through sync error, which makes the controller retry 👍

@@ -39,6 +40,10 @@ type nodeInformerController struct {
// sendInsight should be called to send produced insights to the update status controller
sendInsight sendInsightFn

// machineConfigVersionCache caches machine config versions
// The cache stores the name of MC as the key and the release image version as its value which is retrieved from the annotation of the MC.
machineConfigVersionCache machineConfigVersionCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd call the member machineConfigsToVersions (type name is good though)

@@ -39,6 +40,10 @@ type nodeInformerController struct {
// sendInsight should be called to send produced insights to the update status controller
sendInsight sendInsightFn

// machineConfigVersionCache caches machine config versions
// The cache stores the name of MC as the key and the release image version as its value which is retrieved from the annotation of the MC.
machineConfigVersionCache machineConfigVersionCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd call the member machineConfigVersions (type name is good though)

Comment on lines +136 to +153
machineConfig, err := c.machineConfigs.Get(name)
if err != nil && !kerrors.IsNotFound(err) {
return err
}
if kerrors.IsNotFound(err) {
// The machine config was deleted
if changed := c.machineConfigVersionCache.forget(name); changed {

klog.V(2).Infof("Reconciling all nodes as machine config %q is deleted", name)
return c.reconcileAllNodes(syncCtx.Queue())
}
return nil
}
if changed := c.machineConfigVersionCache.ingest(machineConfig); changed {
klog.V(2).Infof("Reconciling all nodes as machine config %q is refreshed", name)
return c.reconcileAllNodes(syncCtx.Queue())
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code structure: individual switch branches are now complex enough for me to think we should have separate methods like syncMachineConfig, syncNode etc and just call them from here

Comment on lines +310 to +314
func (c *machineConfigPoolSelectorCache) len() int {
c.lock.Lock()
defer c.lock.Unlock()
return len(c.cache)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like overkill to have a locked method just to log ...N config pools - I think a simple "does not belong to any MachineConfigPool" log is just as useful 🤷

Comment on lines 211 to 234
func (c *nodeInformerController) initializeCaches() error {
var errs []error

if pools, err := c.machineConfigPools.List(labels.Everything()); err != nil {
errs = append(errs, err)
} else {
for _, pool := range pools {
c.machineConfigPoolSelectorCache.ingest(pool)
}
}
klog.V(2).Infof("Stored %d machineConfigPools in the cache", c.machineConfigPoolSelectorCache.len())

machineConfigs, err := c.machineConfigs.List(labels.Everything())
if err != nil {
errs = append(errs, err)
} else {
for _, mc := range machineConfigs {
c.machineConfigVersionCache.ingest(mc)
}
}
klog.V(2).Infof("Stored %d machineConfig versions in the cache", c.machineConfigVersionCache.len())

return kutilerrors.NewAggregate(errs)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not couple the two into a single method: they are similar but are used in a different way at different places - consider having separate dedicated methods (which saves us from NewAggregate so the two methods would even be simpler, and have two Once instances.

c.machineConfigPoolSelectorCache.ingest(pool)
}
}
klog.V(2).Infof("Stored %d machineConfigPools in the cache", c.machineConfigPoolSelectorCache.len())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider not having a len() (that needs to lock) and just count the ingestions?

Comment on lines 205 to 209
func (c *nodeInformerController) initializeCachesNoErrors() {
if err := c.initializeCaches(); err != nil {
klog.Errorf("Failed to initialize caches: %v", err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered whether we'd need to abort the controller 🤔

I think it is safe to continue the controller despite these failures because it just means we started cold and will eventually load all items from the cache?

if changed := c.machineConfigVersionCache.forget(name); changed {

klog.V(2).Infof("Reconciling all nodes as machine config %q is deleted", name)
return c.reconcileAllNodes(syncCtx.Queue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are losing some robustness by directly returning an error here. Consider the following;

  1. MC is updated and carries a new version - cache returns changed and we get into one of these branches where we do reconcileAll
  2. The List inside reconcileAll fails so we propagate the err outside, and the controller library will requeue the MC queueKey that triggered us
  3. New sync is called for this MC - but this time the cache already has the version, so changed==false and we finish the sync without calling reconcileAll -> and end up with potentially outdated insights 😨

I'm not sure if it is an ugly hack or not, but I wonder if reconcileAllNodes could be a special synthetic queue key that we'd just queue here, and it would get its own dedicated sync() call. That synthetic call would then be handled simply by what reconcileAllNodes is doing, and on a List error it would just return it, which would cause a requeue of the reconcileAllNodes key... WDYT?

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. no-qe Allows PRs to merge without qe-approved label labels Feb 25, 2025
@petr-muller
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2025
Copy link
Contributor

openshift-ci bot commented Feb 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2025
Copy link
Contributor

openshift-ci bot commented Feb 25, 2025

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. no-qe Allows PRs to merge without qe-approved label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants