Skip to content

Commit 1f2523e

Browse files
committed
[UpdateWorkflow] Ensure clustermgtd runs after cluster update
and fix race condition making compute node deploy wrong cluster config version on update failure. Ensure clustermgtd is running after an update completes, regardless of whether the update succeeded or failed. On success, restart clustermgtd unconditionally at the end of the update recipe, regardless of whether the update includes queue changes On failure on the head node, execute recovery actions: - Clean up DNA files shared with compute nodes to prevent them from deploying a config version that is about to be rolled back - Restart clustermgtd if scontrol reconfigure succeeded, ensuring cluster management resumes after update/rollback failures
1 parent 48044c2 commit 1f2523e

File tree

10 files changed

+542
-2
lines changed

10 files changed

+542
-2
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ This file is used to list changes made in each version of the AWS ParallelCluste
99
3.14.1
1010
------
1111

12+
**ENHANCEMENTS**
13+
- Ensure clustermgtd runs after cluster update. On success, restart unconditionally. On failure, restart if the queue reconfiguration succeeded.
14+
1215
**CHANGES**
1316
- Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable in-place updates on compute and login nodes
1417
and achieve better performance at scale.
@@ -27,6 +30,9 @@ This file is used to list changes made in each version of the AWS ParallelCluste
2730
- Rdma-core: rdma-core-59.0-1
2831
- Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.8-11
2932

33+
**BUG FIXES**
34+
- Fix race condition where compute nodes could deploy the wrong cluster config version after an update failure.
35+
3036
3.14.0
3137
------
3238

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# frozen_string_literal: true
2+
3+
#
4+
# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
7+
# License. A copy of the License is located at
8+
#
9+
# http://aws.amazon.com/apache2.0/
10+
#
11+
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
12+
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
module ErrorHandlers
16+
# Executes shell commands with retry logic and logging.
17+
class CommandRunner
18+
include Chef::Mixin::ShellOut
19+
20+
DEFAULT_RETRIES = 10
21+
DEFAULT_RETRY_DELAY = 90
22+
DEFAULT_TIMEOUT = 30
23+
24+
def initialize(log_prefix:)
25+
@log_prefix = log_prefix
26+
end
27+
28+
def run_with_retries(command, description:, retries: DEFAULT_RETRIES, retry_delay: DEFAULT_RETRY_DELAY, timeout: DEFAULT_TIMEOUT)
29+
Chef::Log.info("#{@log_prefix} Executing: #{description}")
30+
max_attempts = retries + 1
31+
32+
max_attempts.times do |attempt|
33+
attempt_num = attempt + 1
34+
Chef::Log.info("#{@log_prefix} Running command (attempt #{attempt_num}/#{max_attempts}): #{command}")
35+
result = shell_out(command, timeout: timeout)
36+
Chef::Log.info("#{@log_prefix} Command stdout: #{result.stdout}")
37+
Chef::Log.info("#{@log_prefix} Command stderr: #{result.stderr}")
38+
39+
if result.exitstatus == 0
40+
Chef::Log.info("#{@log_prefix} Successfully executed: #{description}")
41+
return true
42+
end
43+
44+
Chef::Log.warn("#{@log_prefix} Failed to #{description} (attempt #{attempt_num}/#{max_attempts})")
45+
46+
if attempt_num < max_attempts
47+
Chef::Log.info("#{@log_prefix} Retrying in #{retry_delay} seconds...")
48+
sleep(retry_delay)
49+
end
50+
end
51+
52+
Chef::Log.error("#{@log_prefix} Failed to #{description} after #{max_attempts} attempts")
53+
false
54+
end
55+
end
56+
end
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# frozen_string_literal: true
2+
3+
#
4+
# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
7+
# License. A copy of the License is located at
8+
#
9+
# http://aws.amazon.com/apache2.0/
10+
#
11+
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
12+
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
require 'chef/handler'
16+
require_relative 'command_runner'
17+
18+
module ErrorHandlers
19+
# Chef exception handler for cluster update failures.
20+
#
21+
# This handler is triggered when the update recipe fails. It performs recovery actions
22+
# to restore the cluster to a consistent state:
23+
# 1. Logs information about the update failure including which resources succeeded before failure
24+
# 2. Cleans up DNA files shared with compute nodes
25+
# 3. Starts clustermgtd if scontrol reconfigure succeeded
26+
#
27+
# Only runs on HeadNode - compute and login nodes skip this handler.
28+
class UpdateFailureHandler < Chef::Handler
29+
LOG_PREFIX = "#{name}:"
30+
# Must match SCONTROL_RECONFIGURE_RESOURCE_NAME in aws-parallelcluster-slurm/libraries/update.rb
31+
SCONTROL_RECONFIGURE_RESOURCE = 'reload config for running nodes'
32+
33+
def report
34+
Chef::Log.info("#{LOG_PREFIX} Started")
35+
36+
unless node_type == 'HeadNode'
37+
Chef::Log.info("#{LOG_PREFIX} Node type is #{node_type}, recovery from update failure only executes on the HeadNode")
38+
return
39+
end
40+
41+
begin
42+
write_error_report
43+
run_recovery_commands
44+
Chef::Log.info("#{LOG_PREFIX} Completed successfully")
45+
rescue => e
46+
Chef::Log.error("#{LOG_PREFIX} Failed with error: #{e.message}")
47+
Chef::Log.error("#{LOG_PREFIX} Backtrace: #{e.backtrace.join("\n")}")
48+
end
49+
end
50+
51+
def write_error_report
52+
Chef::Log.info("#{LOG_PREFIX} Update failed on #{node_type} due to: #{run_status.exception}")
53+
Chef::Log.info("#{LOG_PREFIX} Resources that have been successfully executed before the failure:")
54+
run_status.updated_resources.each do |resource|
55+
Chef::Log.info("#{LOG_PREFIX} - #{resource}")
56+
end
57+
end
58+
59+
def run_recovery_commands
60+
Chef::Log.info("#{LOG_PREFIX} Running recovery commands")
61+
62+
# Cleanup DNA files
63+
cleanup_dna_files
64+
65+
# Start clustermgtd if scontrol reconfigure succeeded
66+
Chef::Log.info("#{LOG_PREFIX} Resource '#{SCONTROL_RECONFIGURE_RESOURCE}' has execution status: #{resource_status(SCONTROL_RECONFIGURE_RESOURCE)}")
67+
if resource_succeeded?(SCONTROL_RECONFIGURE_RESOURCE)
68+
Chef::Log.info("#{LOG_PREFIX} scontrol reconfigure succeeded, starting clustermgtd")
69+
start_clustermgtd
70+
else
71+
Chef::Log.info("#{LOG_PREFIX} scontrol reconfigure did not succeed, skipping clustermgtd start")
72+
end
73+
end
74+
75+
def cleanup_dna_files
76+
command = "#{cookbook_virtualenv_path}/bin/python #{cluster_attributes['scripts_dir']}/share_compute_fleet_dna.py --region #{cluster_attributes['region']} --cleanup"
77+
command_runner.run_with_retries(command, description: "cleanup DNA files")
78+
end
79+
80+
def start_clustermgtd
81+
command = "#{cookbook_virtualenv_path}/bin/supervisorctl start clustermgtd"
82+
command_runner.run_with_retries(command, description: "start clustermgtd")
83+
end
84+
85+
def command_runner
86+
@command_runner ||= CommandRunner.new(log_prefix: LOG_PREFIX)
87+
end
88+
89+
def cluster_attributes
90+
run_status.node['cluster']
91+
end
92+
93+
def node_type
94+
cluster_attributes['node_type']
95+
end
96+
97+
def cookbook_virtualenv_path
98+
"#{cluster_attributes['system_pyenv_root']}/versions/#{cluster_attributes['python-version']}/envs/cookbook_virtualenv"
99+
end
100+
101+
def resource_succeeded?(resource_name)
102+
%i(updated up_to_date).include?(resource_status(resource_name))
103+
end
104+
105+
def resource_status(resource_name)
106+
# Use action_collection directly (inherited from Chef::Handler)
107+
action_records = action_collection.filtered_collection
108+
record = action_records.find { |r| r.new_resource.resource_name == :execute && r.new_resource.name == resource_name }
109+
record ? record.status : :not_executed
110+
end
111+
end
112+
end

cookbooks/aws-parallelcluster-entrypoints/recipes/update.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@
1111
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
1212
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
1313
# limitations under the License.
14+
15+
chef_handler 'ErrorHandlers::UpdateFailureHandler' do
16+
type exception: true
17+
end
18+
1419
include_recipe "aws-parallelcluster-shared::setup_envars"
1520

1621
# Fetch and load cluster configs
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright:: 2025 Amazon.com, Inc. and its affiliates. All Rights Reserved.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
6+
# License. A copy of the License is located at
7+
#
8+
# http://aws.amazon.com/apache2.0/
9+
#
10+
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
11+
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
12+
# limitations under the License.
13+
14+
require_relative '../../spec_helper'
15+
require_relative '../../../libraries/command_runner'
16+
17+
describe ErrorHandlers::CommandRunner do
18+
let(:log_prefix) { 'TestPrefix:' }
19+
let(:runner) { described_class.new(log_prefix: log_prefix) }
20+
let(:command) { 'test command' }
21+
let(:description) { 'test operation' }
22+
let(:shell_out_result) { double('shell_out_result', exitstatus: 0, stdout: 'success', stderr: '') }
23+
24+
before do
25+
allow(runner).to receive(:shell_out).and_return(shell_out_result)
26+
allow(runner).to receive(:sleep)
27+
end
28+
29+
describe '#run_with_retries' do
30+
context 'when command succeeds on first attempt' do
31+
it 'returns true and does not retry' do
32+
expect(runner).to receive(:shell_out).once.and_return(shell_out_result)
33+
expect(runner).not_to receive(:sleep)
34+
expect(runner.run_with_retries(command, description: description)).to be true
35+
end
36+
37+
it 'logs stdout and stderr' do
38+
allow(Chef::Log).to receive(:info)
39+
expect(Chef::Log).to receive(:info).with(/Command stdout: success/)
40+
expect(Chef::Log).to receive(:info).with(/Command stderr:/)
41+
runner.run_with_retries(command, description: description)
42+
end
43+
44+
it 'logs success message' do
45+
allow(Chef::Log).to receive(:info)
46+
expect(Chef::Log).to receive(:info).with(/Successfully executed: test operation/)
47+
runner.run_with_retries(command, description: description)
48+
end
49+
end
50+
51+
context 'when command fails then succeeds' do
52+
let(:failed_result) { double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') }
53+
54+
it 'retries and returns true on success' do
55+
expect(runner).to receive(:shell_out).and_return(failed_result, shell_out_result)
56+
expect(runner).to receive(:sleep).with(90).once
57+
expect(runner.run_with_retries(command, description: description, retries: 1)).to be true
58+
end
59+
60+
it 'logs retry message' do
61+
allow(runner).to receive(:shell_out).and_return(failed_result, shell_out_result)
62+
allow(Chef::Log).to receive(:info)
63+
allow(Chef::Log).to receive(:warn)
64+
expect(Chef::Log).to receive(:info).with(/Retrying in 90 seconds/)
65+
runner.run_with_retries(command, description: description, retries: 1)
66+
end
67+
end
68+
69+
context 'when command fails all attempts' do
70+
let(:failed_result) { double('failed_result', exitstatus: 1, stdout: '', stderr: 'error') }
71+
72+
it 'returns false after exhausting retries' do
73+
allow(runner).to receive(:shell_out).and_return(failed_result)
74+
expect(runner.run_with_retries(command, description: description, retries: 1, retry_delay: 0)).to be false
75+
end
76+
77+
it 'logs error after all attempts fail' do
78+
allow(runner).to receive(:shell_out).and_return(failed_result)
79+
expect(Chef::Log).to receive(:error).with(/Failed to test operation after 2 attempts/)
80+
runner.run_with_retries(command, description: description, retries: 1, retry_delay: 0)
81+
end
82+
83+
it 'logs warning for each failed attempt' do
84+
allow(runner).to receive(:shell_out).and_return(failed_result)
85+
allow(Chef::Log).to receive(:info)
86+
allow(Chef::Log).to receive(:error)
87+
expect(Chef::Log).to receive(:warn).with(%r{Failed to test operation \(attempt 1/2\)})
88+
expect(Chef::Log).to receive(:warn).with(%r{Failed to test operation \(attempt 2/2\)})
89+
runner.run_with_retries(command, description: description, retries: 1, retry_delay: 0)
90+
end
91+
end
92+
93+
context 'with custom retry parameters' do
94+
it 'respects custom retries count' do
95+
failed_result = double('failed_result', exitstatus: 1, stdout: '', stderr: 'error')
96+
allow(runner).to receive(:shell_out).and_return(failed_result)
97+
expect(runner).to receive(:shell_out).exactly(3).times
98+
runner.run_with_retries(command, description: description, retries: 2, retry_delay: 0)
99+
end
100+
101+
it 'respects custom retry delay' do
102+
failed_result = double('failed_result', exitstatus: 1, stdout: '', stderr: 'error')
103+
allow(runner).to receive(:shell_out).and_return(failed_result, shell_out_result)
104+
expect(runner).to receive(:sleep).with(30).once
105+
runner.run_with_retries(command, description: description, retries: 1, retry_delay: 30)
106+
end
107+
108+
it 'respects custom timeout' do
109+
expect(runner).to receive(:shell_out).with(command, timeout: 60).and_return(shell_out_result)
110+
runner.run_with_retries(command, description: description, timeout: 60)
111+
end
112+
end
113+
114+
context 'with default parameters' do
115+
it 'uses DEFAULT_RETRIES' do
116+
failed_result = double('failed_result', exitstatus: 1, stdout: '', stderr: 'error')
117+
allow(runner).to receive(:shell_out).and_return(failed_result)
118+
expect(runner).to receive(:shell_out).exactly(11).times # 10 retries + 1 initial = 11 attempts
119+
runner.run_with_retries(command, description: description, retry_delay: 0)
120+
end
121+
122+
it 'uses DEFAULT_TIMEOUT' do
123+
expect(runner).to receive(:shell_out).with(command, timeout: 30).and_return(shell_out_result)
124+
runner.run_with_retries(command, description: description)
125+
end
126+
end
127+
end
128+
end

0 commit comments

Comments
 (0)