Skip to content

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Dec 11, 2025

Description of changes

Ensure clustermgtd runs after cluster update (unconditionally on success, safely on failure) and fix race condition where compute nodes could deploy the wrong cluster config version after an update failure.

User Experience

Update success

clustermgtd is restarted unconditionally at the end of the update recipe, regardless of whether the update includes queue changes.

Update or rollback failure

  • Restart clustermgtd only if scontrol reconfigure succeeded, ensuring cluster management resumes safely
  • Clean up DNA files shared with compute nodes to prevent them from deploying a config version that is about to be rolled back

Logs emitted by the handler

[2025-12-11T22:17:49+00:00] ERROR: Running exception handlers
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Started
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Update failed on HeadNode due to: execute[Check cluster readiness] (aws-parallelcluster-slurm::update_head_node line 169) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'

... omitted details about the error ...

[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Resources that have been successfully executed before the failure:
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - ruby_block[Configure environment variable for recipes context: PATH]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - fetch_dna_files[Fetch ComputeFleet's Dna files]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - fetch_config[Fetch and load cluster configs]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[stop clustermgtd]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - ruby_block[replace slurm queue nodes]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[generate_pcluster_slurm_configs]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[generate_pcluster_fleet_config]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - service[slurmctld]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - chef_sleep[5]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[check slurmctld status]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - execute[reload config for running nodes]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler:   - chef_sleep[15]
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Resource 'reload config for running nodes' has execution status: updated
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Running recovery commands
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Executing: cleanup DNA files
[2025-12-11T22:17:49+00:00] INFO: UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/share_compute_fleet_dna.py --region us-east-1 --cleanup
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Command stdout:
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Command stderr: INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/extra.json
INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/LaunchTemplateB1b67670b4d707d5-dna.json
INFO:__main__:All dna.json files have been shared!

[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Successfully executed: cleanup DNA files
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: scontrol reconfigure succeeded, starting clustermgtd
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Executing: start clustermgtd
[2025-12-11T22:17:50+00:00] INFO: UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/bin/supervisorctl start clustermgtd
[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Command stdout: clustermgtd: started

[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Command stderr: /opt/parallelcluster/pyenv/versions/3.12.11/envs/cookbook_virtualenv/lib/python3.12/site-packages/supervisor/options.py:13: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Successfully executed: start clustermgtd
[2025-12-11T22:17:52+00:00] INFO: UpdateFailureHandler: Completed successfully
  - UpdateChefError::UpdateFailureHandler
Running handlers complete

Tests

  • Manual test where I injected failures in the update and rollback and verified that clustermgtd was running, no dna json file were left over.
  • Manual test where I manually killed clustermgtd to simulate a corner case where it is not restarted, then verified that even an update with not changes to queues is able to restart it.
  • Integ Test PENDING

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title [UpdateWorkflow] Ensure clustermgtd is always running after an update [UpdateWorkflow] Ensure clustermgtd runs after cluster update (unconditionally on update success, on safe scenarios on update failure) and fix race condition making compute node deploy wrong cluster config version on update failure. Dec 11, 2025
@gmarciani gmarciani changed the title [UpdateWorkflow] Ensure clustermgtd runs after cluster update (unconditionally on update success, on safe scenarios on update failure) and fix race condition making compute node deploy wrong cluster config version on update failure. [UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. Dec 11, 2025
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch 3 times, most recently from c78f77b to 76165d3 Compare December 11, 2025 20:45
@gmarciani gmarciani closed this Dec 11, 2025
@gmarciani gmarciani reopened this Dec 11, 2025
hanwen-cluster
hanwen-cluster previously approved these changes Dec 12, 2025

Chef::Log.info("#{LOG_PREFIX} Started")

unless node_type == 'HeadNode'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: if the logic is only about head node, the file name/class name could mention head node for better clarity

Copy link
Contributor Author

@gmarciani gmarciani Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DESIGN PATTERNS DISCUSSION

Good point. Your comment is much more than a minor about renaming a class; it actually opens an interesting discussion about design and how to enforce single responsibilities+open/close principle. :-)

The solution that I have in mind is based on the strategy pattern, where entrypoints::update calls an UpdateFailureHandler, which is in charge of applying the right recovery strategy according to the node type. We do not want to bubble up this responsibility to upstream levels.

Regarding the name of the class, we should rename it only if :

  1. the class cannot be executed on node types != HeadNode: this is not true, it can safely be executed on compute/login nodes, but for those nodes the recovery strategy at the moment is a no-op. In future we may need to introduce recovery strategies also for the other types of nodes.
  2. the class has a single responsibility for the head node. I don't think the handler should be focused on the head node only. It's responsibility of the handler to decide the right strategy according to the node type (strategy pattern). Ideally, we should not even modify the handler body, but simply add a strategy for the specific node type to it and encapsulate that logic in the strategy. If we rename the class to something node type specific, then you would transfer the responsibility to decide the strategy to the entrypoint::update recipe, which is not correct (it would be equivalent to let that entrypoint decide which node-specific update recipe to apply, which is not the case).

In terms of single responsibilities, your comment makes me realize that the logic of command executions should be encapsulated into a dedicated class, for better separation of responsibilities. I'll do that.

what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Thank you!

@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch from 76165d3 to 587484d Compare December 12, 2025 17:57
hanwen-cluster
hanwen-cluster previously approved these changes Dec 12, 2025
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch 4 times, most recently from 8ee12e6 to fddfd67 Compare December 12, 2025 20:04
@gmarciani gmarciani marked this pull request as ready for review December 12, 2025 20:05
@gmarciani gmarciani requested review from a team as code owners December 12, 2025 20:05
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch 2 times, most recently from fddfd67 to 1daa67f Compare December 15, 2025 14:31
and fix race condition making compute node deploy wrong cluster config version on update failure.

Ensure clustermgtd is running after an update completes, regardless of
whether the update succeeded or failed.

On success, restart clustermgtd unconditionally at the end of the update recipe,
regardless of whether the update includes queue changes

On failure on the head node, execute recovery actions:
  - Clean up DNA files shared with compute nodes to prevent them from
    deploying a config version that is about to be rolled back
  - Restart clustermgtd if scontrol reconfigure succeeded, ensuring
    cluster management resumes after update/rollback failures
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch from 1daa67f to 0343048 Compare December 15, 2025 15:39
@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.20%. Comparing base (a2651f4) to head (0343048).
⚠️ Report is 49 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3063   +/-   ##
========================================
  Coverage    75.20%   75.20%           
========================================
  Files           24       24           
  Lines         2444     2444           
========================================
  Hits          1838     1838           
  Misses         606      606           
Flag Coverage Δ
unittests 75.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani merged commit 8167f39 into aws:develop Dec 15, 2025
30 of 32 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1 branch December 15, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants