You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cherry-pick: Set unhealthy nodes static nodes to down with reset node address
Set unhealthy nodes static nodes to down with reset node address in order to fix ice static nodes after a bootstrap failure being treated as bootstrap failure nodes.
Here's the bootstrap failure case that static node in replacement will ice issue will fail with:
def is_bootstrap_failure(self):
"""Check if a slurm node has boostrap failure."""
if self.is_static_nodes_in_replacement and not self.is_backing_instance_valid(log_warn_if_unhealthy=False):
# Node is currently in replacement and no backing instance
logger.warning(
"Node bootstrap error: Node %s is currently in replacement and no backing instance, node state %s:",
self,
self.state_string,
)
Behaviors before the change:
When detect unhealthy static node, static nodes will be set to down when it is unhealthy. In the same iteration, a run_instance call will be performed to launch a new instance for the node, node address will be changed to the new one if the run_instance call is succesfully.
If the run_instance call failed, node address will be remained, node will be treat as bootstrap failure node.
After this change, When detect unhealthy static node, static nodes will be set to down with node address reset. If run_instance call is successfully, node will be set to new address. If run_instance call failed, node address will be node_name. Node will not be treated as bootstrap failure.
Signed-off-by: chenwany <[email protected]>
0 commit comments