Skip to content

Commit 76165d3

Browse files
committed
[UpdateWorkflow] Ensure clustermgtd runs after cluster update
and fix race condition making compute node deploy wrong cluster config version on update failure. Ensure clustermgtd is running after an update completes, regardless of whether the update succeeded or failed. On success, restart clustermgtd unconditionally at the end of the update recipe, regardless of whether the update includes queue changes On failure on the head node, execute recovery actions: - Clean up DNA files shared with compute nodes to prevent them from deploying a config version that is about to be rolled back - Restart clustermgtd if scontrol reconfigure succeeded, ensuring cluster management resumes after update/rollback failures
1 parent 48044c2 commit 76165d3

File tree

8 files changed

+443
-2
lines changed

8 files changed

+443
-2
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ This file is used to list changes made in each version of the AWS ParallelCluste
99
3.14.1
1010
------
1111

12+
**ENHANCEMENTS**
13+
- Ensure clustermgtd runs after cluster update. On success, restart unconditionally. On failure, restart if the queue reconfiguration succeeded.
14+
1215
**CHANGES**
1316
- Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable in-place updates on compute and login nodes
1417
and achieve better performance at scale.
@@ -27,6 +30,9 @@ This file is used to list changes made in each version of the AWS ParallelCluste
2730
- Rdma-core: rdma-core-59.0-1
2831
- Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.8-11
2932

33+
**BUG FIXES**
34+
- Fix race condition where compute nodes could deploy the wrong cluster config version after an update failure.
35+
3036
3.14.0
3137
------
3238

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# frozen_string_literal: true
2+
3+
#
4+
# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
7+
# License. A copy of the License is located at
8+
#
9+
# http://aws.amazon.com/apache2.0/
10+
#
11+
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
12+
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
require 'chef/handler'
16+
17+
module UpdateChefError
18+
# Chef exception handler for cluster update failures.
19+
#
20+
# This handler is triggered when the update recipe fails. It performs recovery actions
21+
# to restore the cluster to a consistent state:
22+
# 1. Logs information about the update failure including which resources succeeded before failure
23+
# 2. Cleans up DNA files shared with compute nodes
24+
# 3. Starts clustermgtd if scontrol reconfigure succeeded
25+
#
26+
# Only runs on HeadNode - compute and login nodes skip this handler.
27+
class UpdateFailureHandler < Chef::Handler
28+
LOG_PREFIX = 'UpdateFailureHandler:'
29+
# Must match SCONTROL_RECONFIGURE_RESOURCE_NAME in aws-parallelcluster-slurm/libraries/update.rb
30+
SCONTROL_RECONFIGURE_RESOURCE = 'reload config for running nodes'
31+
32+
# Retry configuration for recovery commands
33+
DEFAULT_RETRIES = 10
34+
DEFAULT_RETRY_DELAY = 90
35+
DEFAULT_TIMEOUT = 30
36+
37+
def report
38+
extend Chef::Mixin::ShellOut
39+
40+
Chef::Log.info("#{LOG_PREFIX} Started")
41+
42+
unless node_type == 'HeadNode'
43+
Chef::Log.info("#{LOG_PREFIX} Node type is #{node_type}, recovery from update failure only executes on the HeadNode")
44+
return
45+
end
46+
47+
begin
48+
write_error_report
49+
run_recovery_commands
50+
Chef::Log.info("#{LOG_PREFIX} Completed successfully")
51+
rescue => e
52+
Chef::Log.error("#{LOG_PREFIX} Failed with error: #{e.message}")
53+
Chef::Log.error("#{LOG_PREFIX} Backtrace: #{e.backtrace.join("\n")}")
54+
end
55+
end
56+
57+
def write_error_report
58+
Chef::Log.info("#{LOG_PREFIX} Update failed on #{node_type} due to: #{run_status.exception}")
59+
Chef::Log.info("#{LOG_PREFIX} Resources that have been successfully executed before the failure:")
60+
run_status.updated_resources.each do |resource|
61+
Chef::Log.info("#{LOG_PREFIX} - #{resource}")
62+
end
63+
Chef::Log.info("#{LOG_PREFIX} Resource '#{SCONTROL_RECONFIGURE_RESOURCE}' has execution status: #{slurm_reconfigure_status}")
64+
end
65+
66+
def run_recovery_commands
67+
Chef::Log.info("#{LOG_PREFIX} Running recovery commands")
68+
69+
# Cleanup DNA files
70+
cleanup_dna_files
71+
72+
# Start clustermgtd if scontrol reconfigure succeeded
73+
if scontrol_reconfigure_succeeded?
74+
Chef::Log.info("#{LOG_PREFIX} scontrol reconfigure succeeded, starting clustermgtd")
75+
start_clustermgtd
76+
else
77+
Chef::Log.info("#{LOG_PREFIX} scontrol reconfigure did not succeed, skipping clustermgtd start")
78+
end
79+
end
80+
81+
def cleanup_dna_files
82+
command = "#{cookbook_virtualenv_path}/bin/python #{cluster_attributes['scripts_dir']}/share_compute_fleet_dna.py --region #{cluster_attributes['region']} --cleanup"
83+
run_command_with_retries(command, description: "cleanup DNA files")
84+
end
85+
86+
def start_clustermgtd
87+
command = "#{cookbook_virtualenv_path}/bin/supervisorctl start clustermgtd"
88+
run_command_with_retries(command, description: "start clustermgtd")
89+
end
90+
91+
def cluster_attributes
92+
run_status.node['cluster']
93+
end
94+
95+
def node_type
96+
cluster_attributes['node_type']
97+
end
98+
99+
def cookbook_virtualenv_path
100+
"#{cluster_attributes['system_pyenv_root']}/versions/#{cluster_attributes['python-version']}/envs/cookbook_virtualenv"
101+
end
102+
103+
def scontrol_reconfigure_succeeded?
104+
slurm_reconfigure_status == :updated
105+
end
106+
107+
def slurm_reconfigure_status
108+
reload_record = find_scontrol_reconfigure_record
109+
if reload_record
110+
reload_record.status
111+
else
112+
:not_executed
113+
end
114+
end
115+
116+
def find_scontrol_reconfigure_record
117+
# Use action_collection directly (inherited from Chef::Handler)
118+
action_records = action_collection.filtered_collection
119+
action_records.find { |r| r.new_resource.resource_name == :execute && r.new_resource.name == SCONTROL_RECONFIGURE_RESOURCE }
120+
end
121+
122+
def run_command_with_retries(command, description:, retries: DEFAULT_RETRIES, retry_delay: DEFAULT_RETRY_DELAY, timeout: DEFAULT_TIMEOUT)
123+
Chef::Log.info("#{LOG_PREFIX} Executing: #{description}")
124+
max_attempts = retries + 1
125+
126+
max_attempts.times do |attempt|
127+
attempt_num = attempt + 1
128+
Chef::Log.info("#{LOG_PREFIX} Running command (attempt #{attempt_num}/#{max_attempts}): #{command}")
129+
result = shell_out(command, timeout: timeout)
130+
Chef::Log.info("#{LOG_PREFIX} Command stdout: #{result.stdout}")
131+
Chef::Log.info("#{LOG_PREFIX} Command stderr: #{result.stderr}")
132+
133+
if result.exitstatus == 0
134+
Chef::Log.info("#{LOG_PREFIX} Successfully executed: #{description}")
135+
return true
136+
end
137+
138+
Chef::Log.warn("#{LOG_PREFIX} Failed to #{description} (attempt #{attempt_num}/#{max_attempts})")
139+
140+
if attempt_num < max_attempts
141+
Chef::Log.info("#{LOG_PREFIX} Retrying in #{retry_delay} seconds...")
142+
sleep(retry_delay)
143+
end
144+
end
145+
146+
Chef::Log.error("#{LOG_PREFIX} Failed to #{description} after #{max_attempts} attempts")
147+
false
148+
end
149+
end
150+
end

cookbooks/aws-parallelcluster-entrypoints/recipes/update.rb

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@
1111
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
1212
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
1313
# limitations under the License.
14+
15+
chef_handler 'UpdateChefError::UpdateFailureHandler' do
16+
type exception: true
17+
action :enable
18+
end
19+
1420
include_recipe "aws-parallelcluster-shared::setup_envars"
1521

1622
# Fetch and load cluster configs

0 commit comments

Comments
 (0)