Skip to content

Conversation

@gmarciani
Copy link
Contributor

Cherry-picked from #3063
See description of changes and tests in the original PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani requested review from a team as code owners December 12, 2025 20:23
@gmarciani gmarciani force-pushed the wip/mgiacomo/3141/fix-clustermgtd-restart-1211-1 branch 4 times, most recently from 6d8a15f to 84bb465 Compare December 15, 2025 14:30
and fix race condition making compute node deploy wrong cluster config version on update failure.

Ensure clustermgtd is running after an update completes, regardless of
whether the update succeeded or failed.

On success, restart clustermgtd unconditionally at the end of the update recipe,
regardless of whether the update includes queue changes

On failure on the head node, execute recovery actions:
  - Clean up DNA files shared with compute nodes to prevent them from
    deploying a config version that is about to be rolled back
  - Restart clustermgtd if scontrol reconfigure succeeded, ensuring
    cluster management resumes after update/rollback failures
@gmarciani gmarciani force-pushed the wip/mgiacomo/3141/fix-clustermgtd-restart-1211-1 branch from 84bb465 to 2f02db9 Compare December 15, 2025 15:40
@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.22%. Comparing base (f4ed435) to head (2f02db9).
⚠️ Report is 2 commits behind head on release-3.14.

Additional details and impacted files
@@              Coverage Diff              @@
##           release-3.14    #3064   +/-   ##
=============================================
  Coverage         75.22%   75.22%           
=============================================
  Files                24       24           
  Lines              2446     2446           
=============================================
  Hits               1840     1840           
  Misses              606      606           
Flag Coverage Δ
unittests 75.22% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani merged commit da4e9d1 into aws:release-3.14 Dec 15, 2025
30 of 32 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3141/fix-clustermgtd-restart-1211-1 branch December 15, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants