Skip to content

[Bug] RayCluster fails to transit Status.State to Ready when numOfHosts > 1 #3274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
jyakaranda opened this issue Apr 7, 2025 · 1 comment · Fixed by #3353
Closed
1 of 2 tasks

[Bug] RayCluster fails to transit Status.State to Ready when numOfHosts > 1 #3274

jyakaranda opened this issue Apr 7, 2025 · 1 comment · Fixed by #3353
Assignees
Labels
1.4.0 bug Something isn't working raycluster

Comments

@jyakaranda
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When I set workerGroupSpecs.numOfHosts greater than 1, such as 2, the raycluster fails to transit Status.State to Ready, which would block RayJob from submitting as a side effect.

if rayClusterInstance.Status.State != rayv1.Ready { //nolint:staticcheck // https://github.com/ray-project/kuberay/pull/2288
logger.Info("Wait for the RayCluster.Status.State to be ready before submitting the job.", "RayCluster", rayClusterInstance.Name, "State", rayClusterInstance.Status.State) //nolint:staticcheck // https://github.com/ray-project/kuberay/pull/2288
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

It seems related with the DesiredWorkerReplicas calculation, which only stands for single host case.

if reconcileErr == nil && len(runtimePods.Items) == int(newInstance.Status.DesiredWorkerReplicas)+1 { // workers + 1 head
if utils.CheckAllPodsRunning(ctx, runtimePods) {
newInstance.Status.State = rayv1.Ready //nolint:staticcheck // https://github.com/ray-project/kuberay/pull/2288
newInstance.Status.Reason = ""
}
}

Related logs in kuberay operator:
{"level":"info","ts":"2025-04-07T05:28:27.310Z","logger":"controllers.RayCluster","msg":"inconsistentRayClusterStatus","RayCluster":{"name":"henry-ray-plan-off-raycluster-d9ht7","namespace":"ray"},"reconcileID":"65a566fd-4f94-4492-b8f6-eacfcdf3b2cd","oldReadyWorkerReplicas":4,"newReadyWorkerReplicas":5,"oldAvailableWorkerReplicas":5,"newAvailableWorkerReplicas":6,"oldDesiredWorkerReplicas":3,"newDesiredWorkerReplicas":3,"oldMinWorkerReplicas":1,"newMinWorkerReplicas":1,"oldMaxWorkerReplicas":6,"newMaxWorkerReplicas":6}

Reproduction script

part of workerGroupSpecs

  workerGroupSpecs:                                                                                                                                                                                                                                                         
  - rayStartParams:                                                                                                                                                                                                                                                         
      {}                                                                                                                                                                                                                                                                    
    replicas: 3                                                                                                                                                                                                                                                             
    minReplicas: 1                                                                                                                                                                                                                                                          
    maxReplicas: 6                                                                                                                                                                                                                                                          
    numOfHosts: 2                                                                                                                                                                                                                                                           
    groupName: workergroup

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@jyakaranda jyakaranda added bug Something isn't working triage labels Apr 7, 2025
@CheyuWu
Copy link
Contributor

CheyuWu commented Apr 10, 2025

Hi @kevin85421 , May I help with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.4.0 bug Something isn't working raycluster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants