Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAGNUM: Cluster autoscaler support for manually removed node from node group #7776

Open
mkleku81 opened this issue Jan 28, 2025 · 1 comment
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mkleku81
Copy link

mkleku81 commented Jan 28, 2025

Description:
Magnum cluster-autoscaler in case of manually removed node from node group, can remove all nodes from the node group before the autoscaler removes the correct node.

Version:
Cluster Autoscaler: v1.27.5,v1.29.5
Cloud Provider: Magnum

Current Behavior:
Magnum cluster autoscaler fails to retrieve ID of manually removed node during resize on cleanup

I0128 10:03:19.798126       1 magnum_manager_impl.go:397] resizeOpts: node_count=4, remove=[]

Expected Behavior:
The Magnum autoscaler will be able to retrieve the ID of a manually deleted node, which in this case is a fake node with the prefix openstack:/// not fake:///.
https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.27/cluster-autoscaler/cloudprovider/magnum/magnum_manager_impl.go#L379-L386

Reproduction Steps:
Create a magnum cluster (k8s ver 1.27 ,1.29 ) with cluster autoscaler (v1.27.5,v1.29.5) running with log level 5 .
Add to it node group.
Add a workload that scales a node group to, say, 4 nodes.
Remove manualy the first node in the node group(by openstack server delete ...).
The autoscaler will start cleaning up random nodes, by resizing without ID nodes to remove

kubectl logs cluster-autoscaler-9bf47d569-bsdsn -f  | grep resizeOpts
I1221 15:37:37.541377       1 magnum_manager_impl.go:397] resizeOpts: node_count=4, remove=[]
I1221 15:37:53.991168       1 magnum_manager_impl.go:397] resizeOpts: node_count=3, remove=[]
I1221 15:38:10.812047       1 magnum_manager_impl.go:397] resizeOpts: node_count=2, remove=[]
I1221 15:38:27.103804       1 magnum_manager_impl.go:397] resizeOpts: node_count=1, remove=[]

which, in a pessimistic case like the one above, may result in the clearing of all nodes in node group.

Describe the solution you'd like:
Add here
https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.27/cluster-autoscaler/cloudprovider/magnum/magnum_manager_impl.go#L379-L386

support for fake nodes with the prefix openstack:///
e.g. like this

		if nodeRef.IsFake {
			if strings.HasPrefix(nodeRef.Name, "fake:///") {
			    _, index, err := parseFakeProviderID(nodeRef.Name)
			    if err != nil {
			    	return fmt.Errorf("error handling fake node: %v", err)
			    }
			    nodesToRemove = append(nodesToRemove, index)
			    continue
			} else if strings.HasPrefix(nodeRef.Name, "openstack:///") { 
			    nodeID , err := parseFakeProviderIDDeletedNode(nodeRef.Name)
			    if err != nil {
			    	return fmt.Errorf("error handling fake node: %v", err)
			    }
			    nodesToRemove = append(nodesToRemove, nodeID)
			    continue
			}
		}

where parseFakeProviderIDDeletedNode is an additional util function like this

func parseFakeProviderIDDeletedNode(id string) (string, error) {
	id2 := strings.TrimPrefix(id, "openstack:///")
	return id2, nil
}

Additional context:
After looking in the code in the main branch, I see that there this case will also occur.

@mkleku81 mkleku81 added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 28, 2025
@adrianmoisey
Copy link
Member

/area cluster-autoscaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants