-
Notifications
You must be signed in to change notification settings - Fork 109
[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c78f77b to
76165d3
Compare
|
|
||
| Chef::Log.info("#{LOG_PREFIX} Started") | ||
|
|
||
| unless node_type == 'HeadNode' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: if the logic is only about head node, the file name/class name could mention head node for better clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DESIGN PATTERNS DISCUSSION
Good point. Your comment is much more than a minor about renaming a class; it actually opens an interesting discussion about design and how to enforce single responsibilities+open/close principle. :-)
The solution that I have in mind is based on the strategy pattern, where entrypoints::update calls an UpdateFailureHandler, which is in charge of applying the right recovery strategy according to the node type. We do not want to bubble up this responsibility to upstream levels.
Regarding the name of the class, we should rename it only if :
- the class cannot be executed on node types != HeadNode: this is not true, it can safely be executed on compute/login nodes, but for those nodes the recovery strategy at the moment is a no-op. In future we may need to introduce recovery strategies also for the other types of nodes.
- the class has a single responsibility for the head node. I don't think the handler should be focused on the head node only. It's responsibility of the handler to decide the right strategy according to the node type (strategy pattern). Ideally, we should not even modify the handler body, but simply add a strategy for the specific node type to it and encapsulate that logic in the strategy. If we rename the class to something node type specific, then you would transfer the responsibility to decide the strategy to the entrypoint::update recipe, which is not correct (it would be equivalent to let that entrypoint decide which node-specific update recipe to apply, which is not the case).
In terms of single responsibilities, your comment makes me realize that the logic of command executions should be encapsulated into a dedicated class, for better separation of responsibilities. I'll do that.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Thank you!
cookbooks/aws-parallelcluster-entrypoints/libraries/update_failure_handler.rb
Outdated
Show resolved
Hide resolved
76165d3 to
587484d
Compare
8ee12e6 to
fddfd67
Compare
fddfd67 to
1daa67f
Compare
and fix race condition making compute node deploy wrong cluster config version on update failure.
Ensure clustermgtd is running after an update completes, regardless of
whether the update succeeded or failed.
On success, restart clustermgtd unconditionally at the end of the update recipe,
regardless of whether the update includes queue changes
On failure on the head node, execute recovery actions:
- Clean up DNA files shared with compute nodes to prevent them from
deploying a config version that is about to be rolled back
- Restart clustermgtd if scontrol reconfigure succeeded, ensuring
cluster management resumes after update/rollback failures
1daa67f to
0343048
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #3063 +/- ##
========================================
Coverage 75.20% 75.20%
========================================
Files 24 24
Lines 2444 2444
========================================
Hits 1838 1838
Misses 606 606
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description of changes
Ensure clustermgtd runs after cluster update (unconditionally on success, safely on failure) and fix race condition where compute nodes could deploy the wrong cluster config version after an update failure.
User Experience
Update success
clustermgtd is restarted unconditionally at the end of the update recipe, regardless of whether the update includes queue changes.
Update or rollback failure
Logs emitted by the handler
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.