[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063

gmarciani · 2025-12-11T19:52:30Z

Description of changes

Ensure clustermgtd runs after cluster update (unconditionally on success, safely on failure) and fix race condition where compute nodes could deploy the wrong cluster config version after an update failure.

User Experience

Update success

clustermgtd is restarted unconditionally at the end of the update recipe, regardless of whether the update includes queue changes.

Update or rollback failure

Restart clustermgtd only if scontrol reconfigure succeeded, ensuring cluster management resumes safely
Clean up DNA files shared with compute nodes to prevent them from deploying a config version that is about to be rolled back

Logs emitted by the handler

[2025-12-11T22:17:49+00:00] ERROR: Running exception handlers
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Started
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Update failed on HeadNode due to: execute[Check cluster readiness] (aws-parallelcluster-slurm::update_head_node line 169) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'

... omitted details about the error ...

[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Resources that have been successfully executed before the failure:
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - ruby_block[Configure environment variable for recipes context: PATH]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - fetch_dna_files[Fetch ComputeFleet's Dna files]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - fetch_config[Fetch and load cluster configs]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[stop clustermgtd]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - ruby_block[replace slurm queue nodes]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[generate_pcluster_slurm_configs]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[generate_pcluster_fleet_config]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - service[slurmctld]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - chef_sleep[5]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[check slurmctld status]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[reload config for running nodes]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - chef_sleep[15]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Resource 'reload config for running nodes' has execution status: updated
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Running recovery commands
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Executing: cleanup DNA files
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/share_compute_fleet_dna.py --region us-east-1 --cleanup
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Command stdout:
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Command stderr: INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/extra.json
INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/LaunchTemplateB1b67670b4d707d5-dna.json
INFO:__main__:All dna.json files have been shared!

[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Successfully executed: cleanup DNA files
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: scontrol reconfigure succeeded, starting clustermgtd
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Executing: start clustermgtd
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/supervisorctl start clustermgtd
[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Command stdout: clustermgtd: started

[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Command stderr: /opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/supervisor/options.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Successfully executed: start clustermgtd
[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Completed successfully
  - UpdateChefError::UpdateFailureHandler
Running handlers complete

Tests

Manual test where I injected failures in the update and rollback and verified that clustermgtd was running, no dna json file were left over.
Manual test where I manually killed clustermgtd to simulate a corner case where it is not restarted, then verified that even an update with not changes to queues is able to restart it.
Integ Test PENDING

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cookbooks/aws-parallelcluster-entrypoints/recipes/update.rb

hanwen-cluster · 2025-12-12T15:44:48Z

cookbooks/aws-parallelcluster-entrypoints/libraries/update_failure_handler.rb

+
+      Chef::Log.info("#{LOG_PREFIX} Started")
+
+      unless node_type == 'HeadNode'


Minor: if the logic is only about head node, the file name/class name could mention head node for better clarity

DESIGN PATTERNS DISCUSSION

Good point. Your comment is much more than a minor about renaming a class; it actually opens an interesting discussion about design and how to enforce single responsibilities+open/close principle. :-)

The solution that I have in mind is based on the strategy pattern, where entrypoints::update calls an UpdateFailureHandler, which is in charge of applying the right recovery strategy according to the node type. We do not want to bubble up this responsibility to upstream levels.

Regarding the name of the class, we should rename it only if :

the class cannot be executed on node types != HeadNode: this is not true, it can safely be executed on compute/login nodes, but for those nodes the recovery strategy at the moment is a no-op. In future we may need to introduce recovery strategies also for the other types of nodes.

the class has a single responsibility for the head node. I don't think the handler should be focused on the head node only. It's responsibility of the handler to decide the right strategy according to the node type (strategy pattern). Ideally, we should not even modify the handler body, but simply add a strategy for the specific node type to it and encapsulate that logic in the strategy. If we rename the class to something node type specific, then you would transfer the responsibility to decide the strategy to the entrypoint::update recipe, which is not correct (it would be equivalent to let that entrypoint decide which node-specific update recipe to apply, which is not the case).

In terms of single responsibilities, your comment makes me realize that the logic of command executions should be encapsulated into a dedicated class, for better separation of responsibilities. I'll do that.

what do you think?

I agree! Thank you!

cookbooks/aws-parallelcluster-entrypoints/libraries/update_failure_handler.rb

CHANGELOG.md

and fix race condition making compute node deploy wrong cluster config version on update failure. Ensure clustermgtd is running after an update completes, regardless of whether the update succeeded or failed. On success, restart clustermgtd unconditionally at the end of the update recipe, regardless of whether the update includes queue changes On failure on the head node, execute recovery actions: - Clean up DNA files shared with compute nodes to prevent them from deploying a config version that is about to be rolled back - Restart clustermgtd if scontrol reconfigure succeeded, ensuring cluster management resumes after update/rollback failures

codecov · 2025-12-15T15:42:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.20%. Comparing base (a2651f4) to head (0343048).
⚠️ Report is 49 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3063   +/-   ##
========================================
  Coverage    75.20%   75.20%           
========================================
  Files           24       24           
  Lines         2444     2444           
========================================
  Hits          1838     1838           
  Misses         606      606

Flag	Coverage Δ
unittests	`75.20% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch 3 times, most recently from c78f77b to 76165d3 Compare December 11, 2025 20:45

gmarciani added enhancement 3.x labels Dec 11, 2025

gmarciani closed this Dec 11, 2025

gmarciani reopened this Dec 11, 2025

hanwen-cluster previously approved these changes Dec 12, 2025

View reviewed changes

gmarciani dismissed hanwen-cluster’s stale review via 587484d December 12, 2025 17:57

gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch from 76165d3 to 587484d Compare December 12, 2025 17:57

hanwen-cluster previously approved these changes Dec 12, 2025

View reviewed changes

gmarciani dismissed hanwen-cluster’s stale review via 1f2523e December 12, 2025 19:15

gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch 4 times, most recently from 8ee12e6 to fddfd67 Compare December 12, 2025 20:04

gmarciani marked this pull request as ready for review December 12, 2025 20:05

gmarciani requested review from a team as code owners December 12, 2025 20:05

gmarciani mentioned this pull request Dec 12, 2025

[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3064

Merged

gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch 2 times, most recently from fddfd67 to 1daa67f Compare December 15, 2025 14:31

hanwen-cluster reviewed Dec 15, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch from 1daa67f to 0343048 Compare December 15, 2025 15:39

hanwen-cluster approved these changes Dec 15, 2025

View reviewed changes

gmarciani merged commit 8167f39 into aws:develop Dec 15, 2025
30 of 32 checks passed

gmarciani deleted the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch December 15, 2025 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063

[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063

Uh oh!

gmarciani commented Dec 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

hanwen-cluster Dec 12, 2025

Uh oh!

gmarciani Dec 12, 2025 •

edited

Loading

Uh oh!

hanwen-cluster Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Chef::Log.info("#{LOG_PREFIX} Started")

		unless node_type == 'HeadNode'

[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063

[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063

Uh oh!

Conversation

gmarciani commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

User Experience

Update success

Update or rollback failure

Logs emitted by the handler

Tests

Uh oh!

Uh oh!

hanwen-cluster Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gmarciani Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanwen-cluster Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarciani commented Dec 11, 2025 •

edited

Loading

gmarciani Dec 12, 2025 •

edited

Loading

codecov bot commented Dec 15, 2025 •

edited

Loading