[Develop][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7154

hehe7318 · 2025-12-16T21:31:55Z

Description of changes

This is a cherry-pick PR of #7150

Add integration test to verify the following fixes work correctly:

[F1] clustermgtd remains running after both update and rollback fail
(expected when failure occurs after slurm reconfiguration, which is the safe section)
[F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h
[F3] dna.json files are cleaned up after update and rollback failure

Test scenario:

Create cluster with 3 static compute nodes
Inject cfn-signal failure on head node (simulating expired wait condition)
Disable cfn-hup on CN1 before update (causes update to fail)
Trigger cluster update (add new queue)
Wait for CN2 to apply update, then disable its cfn-hup
Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop

Tests

Running and debugging

References

Link to impacted open issues.
Link to related PRs in other packages (i.e. cookbook, node).
Link to documentation useful to understand the changes.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…aused by cluster update and rollback failure (aws#7150) Add integration test to verify the following fixes work correctly: - [F1] clustermgtd remains running after both update and rollback fail (expected when failure occurs after slurm reconfiguration, which is the safe section) - [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h - [F3] dna.json files are cleaned up after update and rollback failure Test scenario: 1. Create cluster with 3 static compute nodes 2. Inject cfn-signal failure on head node (simulating expired wait condition) 3. Disable cfn-hup on CN1 before update (causes update to fail) 4. Trigger cluster update (add new queue) 5. Wait for CN2 to apply update, then disable its cfn-hup 6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback) 7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop

hehe7318 requested review from a team as code owners December 16, 2025 21:31

hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Dec 16, 2025

gmarciani approved these changes Dec 16, 2025

View reviewed changes

hehe7318 enabled auto-merge (squash) December 16, 2025 21:32

hehe7318 merged commit 820e0d9 into aws:develop Dec 16, 2025
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Develop][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7154

[Develop][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7154

Uh oh!

hehe7318 commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Develop][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7154

[Develop][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure #7154

Uh oh!

Conversation

hehe7318 commented Dec 16, 2025

Description of changes

Tests

References

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants