-
Notifications
You must be signed in to change notification settings - Fork 110
[UpdateWorkflow] Ensure clustermgtd runs after cluster update and fix race condition making compute node deploy wrong cluster config version on update failure. #3063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gmarciani
merged 1 commit into
aws:develop
from
gmarciani:wip/mgiacomo/3150/fix-clustermgtd-restart-1211-1
Dec 15, 2025
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
56 changes: 56 additions & 0 deletions
56
cookbooks/aws-parallelcluster-entrypoints/libraries/command_runner.rb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| # | ||
| # Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the | ||
| # License. A copy of the License is located at | ||
| # | ||
| # http://aws.amazon.com/apache2.0/ | ||
| # | ||
| # or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
| # OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| module ErrorHandlers | ||
| # Executes shell commands with retry logic and logging. | ||
| class CommandRunner | ||
| include Chef::Mixin::ShellOut | ||
|
|
||
| DEFAULT_RETRIES = 10 | ||
| DEFAULT_RETRY_DELAY = 90 | ||
| DEFAULT_TIMEOUT = 30 | ||
|
|
||
| def initialize(log_prefix:) | ||
| @log_prefix = log_prefix | ||
| end | ||
|
|
||
| def run_with_retries(command, description:, retries: DEFAULT_RETRIES, retry_delay: DEFAULT_RETRY_DELAY, timeout: DEFAULT_TIMEOUT) | ||
| Chef::Log.info("#{@log_prefix} Executing: #{description}") | ||
| max_attempts = retries + 1 | ||
|
|
||
| max_attempts.times do |attempt| | ||
| attempt_num = attempt + 1 | ||
| Chef::Log.info("#{@log_prefix} Running command (attempt #{attempt_num}/#{max_attempts}): #{command}") | ||
| result = shell_out(command, timeout: timeout) | ||
| Chef::Log.info("#{@log_prefix} Command stdout: #{result.stdout}") | ||
| Chef::Log.info("#{@log_prefix} Command stderr: #{result.stderr}") | ||
|
|
||
| if result.exitstatus == 0 | ||
| Chef::Log.info("#{@log_prefix} Successfully executed: #{description}") | ||
| return true | ||
| end | ||
|
|
||
| Chef::Log.warn("#{@log_prefix} Failed to #{description} (attempt #{attempt_num}/#{max_attempts})") | ||
|
|
||
| if attempt_num < max_attempts | ||
| Chef::Log.info("#{@log_prefix} Retrying in #{retry_delay} seconds...") | ||
| sleep(retry_delay) | ||
| end | ||
| end | ||
|
|
||
| Chef::Log.error("#{@log_prefix} Failed to #{description} after #{max_attempts} attempts") | ||
| false | ||
| end | ||
| end | ||
| end |
114 changes: 114 additions & 0 deletions
114
cookbooks/aws-parallelcluster-entrypoints/libraries/update_failure_handler.rb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| # | ||
| # Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the | ||
| # License. A copy of the License is located at | ||
| # | ||
| # http://aws.amazon.com/apache2.0/ | ||
| # | ||
| # or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
| # OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| require 'chef/handler' | ||
| require_relative 'command_runner' | ||
|
|
||
| module ErrorHandlers | ||
| # Chef exception handler for cluster update failures. | ||
| # | ||
| # This handler is triggered when the update recipe fails. It performs recovery actions | ||
| # to restore the cluster to a consistent state: | ||
| # 1. Logs information about the update failure including which resources succeeded before failure | ||
| # 2. Cleans up DNA files shared with compute nodes | ||
| # 3. Starts clustermgtd if scontrol reconfigure succeeded | ||
| # | ||
| # Only runs on HeadNode - compute and login nodes skip this handler. | ||
| class UpdateFailureHandler < Chef::Handler | ||
| def report | ||
| Chef::Log.info("#{log_prefix} Started") | ||
|
|
||
| unless node_type == 'HeadNode' | ||
| Chef::Log.info("#{log_prefix} Node type is #{node_type}, recovery from update failure only executes on the HeadNode") | ||
| return | ||
| end | ||
|
|
||
| begin | ||
| write_error_report | ||
| run_recovery | ||
| Chef::Log.info("#{log_prefix} Completed successfully") | ||
| rescue => e | ||
| Chef::Log.error("#{log_prefix} Failed with error: #{e.message}") | ||
| Chef::Log.error("#{log_prefix} Backtrace: #{e.backtrace.join("\n")}") | ||
| end | ||
| end | ||
|
|
||
| def write_error_report | ||
| Chef::Log.info("#{log_prefix} Update failed on #{node_type} due to: #{run_status.exception}") | ||
| Chef::Log.info("#{log_prefix} Resources that have been successfully executed before the failure:") | ||
| run_status.updated_resources.each do |resource| | ||
| Chef::Log.info("#{log_prefix} - #{resource}") | ||
| end | ||
| end | ||
|
|
||
| def run_recovery | ||
| Chef::Log.info("#{log_prefix} Running recovery commands") | ||
|
|
||
| # Cleanup DNA files | ||
| cleanup_dna_files | ||
|
|
||
| # Start clustermgtd if scontrol reconfigure succeeded | ||
| # Must match SCONTROL_RECONFIGURE_RESOURCE_NAME in aws-parallelcluster-slurm/libraries/update.rb | ||
| scontrol_reconfigure_resource_name = 'reload config for running nodes' | ||
| Chef::Log.info("#{log_prefix} Resource '#{scontrol_reconfigure_resource_name}' has execution status: #{resource_status(scontrol_reconfigure_resource_name)}") | ||
| if resource_succeeded?(scontrol_reconfigure_resource_name) | ||
| Chef::Log.info("#{log_prefix} scontrol reconfigure succeeded, starting clustermgtd") | ||
| start_clustermgtd | ||
| else | ||
| Chef::Log.info("#{log_prefix} scontrol reconfigure did not succeed, skipping clustermgtd start") | ||
| end | ||
| end | ||
|
|
||
| def cleanup_dna_files | ||
| command = "#{cookbook_virtualenv_path}/bin/python #{cluster_attributes['scripts_dir']}/share_compute_fleet_dna.py --region #{cluster_attributes['region']} --cleanup" | ||
| command_runner.run_with_retries(command, description: "cleanup DNA files") | ||
| end | ||
|
|
||
| def start_clustermgtd | ||
| command = "#{cookbook_virtualenv_path}/bin/supervisorctl start clustermgtd" | ||
| command_runner.run_with_retries(command, description: "start clustermgtd") | ||
| end | ||
|
|
||
| def cluster_attributes | ||
| run_status.node['cluster'] | ||
| end | ||
|
|
||
| def node_type | ||
| cluster_attributes['node_type'] | ||
| end | ||
|
|
||
| def cookbook_virtualenv_path | ||
| "#{cluster_attributes['system_pyenv_root']}/versions/#{cluster_attributes['python-version']}/envs/cookbook_virtualenv" | ||
| end | ||
|
|
||
| def resource_succeeded?(resource_name) | ||
| %i(updated up_to_date).include?(resource_status(resource_name)) | ||
| end | ||
|
|
||
| def resource_status(resource_name) | ||
| # Use action_collection directly (inherited from Chef::Handler) | ||
| action_records = action_collection.filtered_collection | ||
| record = action_records.find { |r| r.new_resource.resource_name == :execute && r.new_resource.name == resource_name } | ||
| record ? record.status : :not_executed | ||
| end | ||
|
|
||
| def command_runner | ||
| @command_runner ||= CommandRunner.new(log_prefix: log_prefix) | ||
| end | ||
|
|
||
| def log_prefix | ||
| @log_prefix ||= "#{self.class.name}:" | ||
| end | ||
| end | ||
| end | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
128 changes: 128 additions & 0 deletions
128
cookbooks/aws-parallelcluster-entrypoints/spec/unit/libraries/command_runner_spec.rb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| # Copyright:: 2025 Amazon.com, Inc. and its affiliates. All Rights Reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the | ||
| # License. A copy of the License is located at | ||
| # | ||
| # http://aws.amazon.com/apache2.0/ | ||
| # | ||
| # or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
| # OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| require_relative '../../spec_helper' | ||
| require_relative '../../../libraries/command_runner' | ||
|
|
||
| describe ErrorHandlers::CommandRunner do | ||
| let(:log_prefix) { 'TestPrefix:' } | ||
| let(:runner) { described_class.new(log_prefix: log_prefix) } | ||
| let(:command) { 'test command' } | ||
| let(:description) { 'test operation' } | ||
| let(:shell_out_result) { double('shell_out_result', exitstatus: 0, stdout: 'success', stderr: '') } | ||
|
|
||
| before do | ||
| allow(runner).to receive(:shell_out).and_return(shell_out_result) | ||
| allow(runner).to receive(:sleep) | ||
| end | ||
|
|
||
| describe '#run_with_retries' do | ||
| context 'when command succeeds on first attempt' do | ||
| it 'returns true and does not retry' do | ||
| expect(runner).to receive(:shell_out).once.and_return(shell_out_result) | ||
| expect(runner).not_to receive(:sleep) | ||
| expect(runner.run_with_retries(command, description: description)).to be true | ||
| end | ||
|
|
||
| it 'logs stdout and stderr' do | ||
| allow(Chef::Log).to receive(:info) | ||
| expect(Chef::Log).to receive(:info).with(/Command stdout: success/) | ||
| expect(Chef::Log).to receive(:info).with(/Command stderr:/) | ||
| runner.run_with_retries(command, description: description) | ||
| end | ||
|
|
||
| it 'logs success message' do | ||
| allow(Chef::Log).to receive(:info) | ||
| expect(Chef::Log).to receive(:info).with(/Successfully executed: test operation/) | ||
| runner.run_with_retries(command, description: description) | ||
| end | ||
| end | ||
|
|
||
| context 'when command fails then succeeds' do | ||
| let(:failed_result) { double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') } | ||
|
|
||
| it 'retries and returns true on success' do | ||
| expect(runner).to receive(:shell_out).and_return(failed_result, shell_out_result) | ||
| expect(runner).to receive(:sleep).with(90).once | ||
| expect(runner.run_with_retries(command, description: description, retries: 1)).to be true | ||
| end | ||
|
|
||
| it 'logs retry message' do | ||
| allow(runner).to receive(:shell_out).and_return(failed_result, shell_out_result) | ||
| allow(Chef::Log).to receive(:info) | ||
| allow(Chef::Log).to receive(:warn) | ||
| expect(Chef::Log).to receive(:info).with(/Retrying in 90 seconds/) | ||
| runner.run_with_retries(command, description: description, retries: 1) | ||
| end | ||
| end | ||
|
|
||
| context 'when command fails all attempts' do | ||
| let(:failed_result) { double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') } | ||
|
|
||
| it 'returns false after exhausting retries' do | ||
| allow(runner).to receive(:shell_out).and_return(failed_result) | ||
| expect(runner.run_with_retries(command, description: description, retries: 1, retry_delay: 0)).to be false | ||
| end | ||
|
|
||
| it 'logs error after all attempts fail' do | ||
| allow(runner).to receive(:shell_out).and_return(failed_result) | ||
| expect(Chef::Log).to receive(:error).with(/Failed to test operation after 2 attempts/) | ||
| runner.run_with_retries(command, description: description, retries: 1, retry_delay: 0) | ||
| end | ||
|
|
||
| it 'logs warning for each failed attempt' do | ||
| allow(runner).to receive(:shell_out).and_return(failed_result) | ||
| allow(Chef::Log).to receive(:info) | ||
| allow(Chef::Log).to receive(:error) | ||
| expect(Chef::Log).to receive(:warn).with(%r{Failed to test operation \(attempt 1/2\)}) | ||
| expect(Chef::Log).to receive(:warn).with(%r{Failed to test operation \(attempt 2/2\)}) | ||
| runner.run_with_retries(command, description: description, retries: 1, retry_delay: 0) | ||
| end | ||
| end | ||
|
|
||
| context 'with custom retry parameters' do | ||
| it 'respects custom retries count' do | ||
| failed_result = double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') | ||
| allow(runner).to receive(:shell_out).and_return(failed_result) | ||
| expect(runner).to receive(:shell_out).exactly(3).times | ||
| runner.run_with_retries(command, description: description, retries: 2, retry_delay: 0) | ||
| end | ||
|
|
||
| it 'respects custom retry delay' do | ||
| failed_result = double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') | ||
| allow(runner).to receive(:shell_out).and_return(failed_result, shell_out_result) | ||
| expect(runner).to receive(:sleep).with(30).once | ||
| runner.run_with_retries(command, description: description, retries: 1, retry_delay: 30) | ||
| end | ||
|
|
||
| it 'respects custom timeout' do | ||
| expect(runner).to receive(:shell_out).with(command, timeout: 60).and_return(shell_out_result) | ||
| runner.run_with_retries(command, description: description, timeout: 60) | ||
| end | ||
| end | ||
|
|
||
| context 'with default parameters' do | ||
| it 'uses DEFAULT_RETRIES' do | ||
| failed_result = double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') | ||
| allow(runner).to receive(:shell_out).and_return(failed_result) | ||
| expect(runner).to receive(:shell_out).exactly(11).times # 10 retries + 1 initial = 11 attempts | ||
| runner.run_with_retries(command, description: description, retry_delay: 0) | ||
| end | ||
|
|
||
| it 'uses DEFAULT_TIMEOUT' do | ||
| expect(runner).to receive(:shell_out).with(command, timeout: 30).and_return(shell_out_result) | ||
| runner.run_with_retries(command, description: description) | ||
| end | ||
| end | ||
| end | ||
| end |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: if the logic is only about head node, the file name/class name could mention head node for better clarity
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DESIGN PATTERNS DISCUSSION
Good point. Your comment is much more than a minor about renaming a class; it actually opens an interesting discussion about design and how to enforce single responsibilities+open/close principle. :-)
The solution that I have in mind is based on the strategy pattern, where entrypoints::update calls an UpdateFailureHandler, which is in charge of applying the right recovery strategy according to the node type. We do not want to bubble up this responsibility to upstream levels.
Regarding the name of the class, we should rename it only if :
In terms of single responsibilities, your comment makes me realize that the logic of command executions should be encapsulated into a dedicated class, for better separation of responsibilities. I'll do that.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Thank you!