Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator scale-up loop with recreate strategy #1105

Open
ddellarocca opened this issue Jan 30, 2025 · 7 comments
Open

Operator scale-up loop with recreate strategy #1105

ddellarocca opened this issue Jan 30, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@ddellarocca
Copy link

Describe the bug
During a deployment with the update strategy set to recreate in a 28-node cluster, the operator began increasing the number of replicas on the old StatefulSet instead of deleting it.
The update had a dual purpose: scaling up and changing a few settings.
We first scaled up (from 9:00:58 to 9:01:43), then changed the configurations around 9:05, which triggered the creation of a new StatefulSet (old being emqx-core-7c6fbb448d and new emqx-core-778c694444).
However, once the new StatefulSet was ready and nodes joined the cluster, the operator started increasing the replicas of the old StatefulSet (API server audit log events attached logs-insights-results.csv), kubectl get statefulset emqx-core-7c6fbb448d show 0 ready replicas and the requested were increasing.

As a workaround we scaled to 0 the operator and then scaled to 0 the old StatefulSet.

To Reproduce
We did not manage to reproduce it in non production environments.

Expected behavior
Instead of increasing the number of instances it should delete the old StatefulSet

Anything else we need to know?:
By looking at the code we think that the way the operator calculate the number of replicas during scale down is wrong, here is the part where it calculates the number of replicas

currentSts.Spec.Replicas = ptr.To(int32(instance.Status.CoreNodesStatus.CurrentReplicas - 1))
and that number comes from the status instead of the replicas field
// check emqx node status
coreNodes, replNodes, err := u.getEMQXNodes(ctx, instance, r)
if err != nil {
u.EventRecorder.Event(instance, corev1.EventTypeWarning, "FailedToGetNodeStatuses", err.Error())
}
instance.Status.CoreNodes = coreNodes
instance.Status.CoreNodesStatus.ReadyReplicas = 0
instance.Status.CoreNodesStatus.CurrentReplicas = 0
instance.Status.CoreNodesStatus.UpdateReplicas = 0
for _, node := range coreNodes {
if node.NodeStatus == "running" {
instance.Status.CoreNodesStatus.ReadyReplicas++
}
if currentSts != nil && node.ControllerUID == currentSts.UID {
instance.Status.CoreNodesStatus.CurrentReplicas++
}
if updateSts != nil && node.ControllerUID == updateSts.UID {
instance.Status.CoreNodesStatus.UpdateReplicas++
}
}

the number of running replicas is the sum of the two StatefulSets.
This creates a scale up instead of scale down.

Environment details::

  • Kubernetes version: 1.25
  • Cloud-provider/provisioner: AWS
  • emqx-operator version: 2.2.23
  • Install method: e.g. helm
@ddellarocca ddellarocca added the bug Something isn't working label Jan 30, 2025
@Rory-Z
Copy link
Member

Rory-Z commented Jan 31, 2025

Hi, @ddellarocca Thanks for feedback, I'm in Chinese New Year Holidays, I plan check this after 5th Feb.

@ddellarocca
Copy link
Author

ddellarocca commented Feb 1, 2025

Hi @Rory-Z, no worries we encoutered it in production but still we managed to do what we needed to do, it's ok to wait

@Rory-Z
Copy link
Member

Rory-Z commented Feb 6, 2025

Hi @ddellarocca I'm sorry I didn't repeat this issue in my local cluster, in

// check emqx node status
coreNodes, replNodes, err := u.getEMQXNodes(ctx, instance, r)
if err != nil {
u.EventRecorder.Event(instance, corev1.EventTypeWarning, "FailedToGetNodeStatuses", err.Error())
}
instance.Status.CoreNodes = coreNodes
instance.Status.CoreNodesStatus.ReadyReplicas = 0
instance.Status.CoreNodesStatus.CurrentReplicas = 0
instance.Status.CoreNodesStatus.UpdateReplicas = 0
for _, node := range coreNodes {
if node.NodeStatus == "running" {
instance.Status.CoreNodesStatus.ReadyReplicas++
}
if currentSts != nil && node.ControllerUID == currentSts.UID {
instance.Status.CoreNodesStatus.CurrentReplicas++
}
if updateSts != nil && node.ControllerUID == updateSts.UID {
instance.Status.CoreNodesStatus.UpdateReplicas++
}
}
, the number of running replicas is not the sum of the two StatefulSets, it will check UID and controller UID is it match, so I think CurrentReplicas is right. And check your log file, the max number of the emqx-core-7c6fbb448d replicas is 93, it's more than the sum of the two StatefulSets.

Could you please running kubectl get sts emqx-core-7c6fbb448d -o yaml --show-managed-fields and check managedFields to make sure there is no other controller changed the replicas of this StatefulSets ?

@ddellarocca
Copy link
Author

Hi @Rory-Z, thanks for the check. On my side, I am not able to find the managedFields with kubectl get sts emqx-core-7c6fbb448d -o yaml --show-managed-fields. I can only find the apps.emqx.io/managed-by: emqx-operator attributes.

Regarding the 93 replicas, yes, it is more than the sum, but if you look at the log file, it got there gradually. I did miss to mention that those API calls to the API server were filtered by the service account used by the operator, so it was actually that which scaled up the cluster.

Is there anything that I could check?

@Rory-Z
Copy link
Member

Rory-Z commented Feb 7, 2025

I have no idea, I checked code again, and there is no other code to change the replicas of the statefulSet except

// check emqx node status
coreNodes, replNodes, err := u.getEMQXNodes(ctx, instance, r)
if err != nil {
u.EventRecorder.Event(instance, corev1.EventTypeWarning, "FailedToGetNodeStatuses", err.Error())
}
instance.Status.CoreNodes = coreNodes
instance.Status.CoreNodesStatus.ReadyReplicas = 0
instance.Status.CoreNodesStatus.CurrentReplicas = 0
instance.Status.CoreNodesStatus.UpdateReplicas = 0
for _, node := range coreNodes {
if node.NodeStatus == "running" {
instance.Status.CoreNodesStatus.ReadyReplicas++
}
if currentSts != nil && node.ControllerUID == currentSts.UID {
instance.Status.CoreNodesStatus.CurrentReplicas++
}
if updateSts != nil && node.ControllerUID == updateSts.UID {
instance.Status.CoreNodesStatus.UpdateReplicas++
}
}
, so I create a PR #1108, would you mind to help me to test this? I can out a beta version release is 2.2.28-beta.1

@ddellarocca
Copy link
Author

I could try to test it out next week but not in the cluster we faced the issue the first time, would that work out?

@Rory-Z
Copy link
Member

Rory-Z commented Feb 11, 2025

I could try to test it out next week but not in the cluster we faced the issue the first time, would that work out?

Thanks a lot, please try https://github.com/emqx/emqx-operator/releases/tag/2.2.29-beta.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants