Skip to content

Conversation

@hehe7318
Copy link
Contributor

Description of changes

This is a cherry-pick PR of #7150

Add integration test to verify the following fixes work correctly:

  • [F1] clustermgtd remains running after both update and rollback fail
    (expected when failure occurs after slurm reconfiguration, which is the safe section)
  • [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h
  • [F3] dna.json files are cleaned up after update and rollback failure

Test scenario:

  1. Create cluster with 3 static compute nodes
  2. Inject cfn-signal failure on head node (simulating expired wait condition)
  3. Disable cfn-hup on CN1 before update (causes update to fail)
  4. Trigger cluster update (add new queue)
  5. Wait for CN2 to apply update, then disable its cfn-hup
  6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
  7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop

Tests

  • Running and debugging

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…aused by cluster update and rollback failure (aws#7150)

Add integration test to verify the following fixes work correctly:
- [F1] clustermgtd remains running after both update and rollback fail
   (expected when failure occurs after slurm reconfiguration, which is the safe section)
- [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h
- [F3] dna.json files are cleaned up after update and rollback failure

Test scenario:
1. Create cluster with 3 static compute nodes
2. Inject cfn-signal failure on head node (simulating expired wait condition)
3. Disable cfn-hup on CN1 before update (causes update to fail)
4. Trigger cluster update (add new queue)
5. Wait for CN2 to apply update, then disable its cfn-hup
6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop
@hehe7318 hehe7318 requested review from a team as code owners December 16, 2025 21:31
@hehe7318 hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Dec 16, 2025
@hehe7318 hehe7318 enabled auto-merge (squash) December 16, 2025 21:32
@hehe7318 hehe7318 merged commit 820e0d9 into aws:develop Dec 16, 2025
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants