Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when use multuple gpus on ray #7084

Open
1 task done
oasis-0927 opened this issue Feb 26, 2025 · 1 comment
Open
1 task done

Bug when use multuple gpus on ray #7084

oasis-0927 opened this issue Feb 26, 2025 · 1 comment
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@oasis-0927
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

I'm trying to finetune an LLM using ray and since the sft dataset is large, I tried to allocate 4 gpus into one worker.

ray

ray_run_name: ray_run_name
ray_num_workers: 1 # number of GPUs to use
resources_per_worker:
GPU: 4
placement_strategy: PACK

Reproduction

 command:
USE_RAY=1 CUDA_VISIBLE_DEVICES=4,5,6,7 PYTHONPATH=./ llamafactory-cli train conf/safety/sft.yaml 

This command will raise an error as below:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Others

No response

@oasis-0927 oasis-0927 added bug Something isn't working pending This problem is yet to be addressed labels Feb 26, 2025
@oasis-0927
Copy link
Author

And everything works fine after I removed USE_RAY=1 and chose deepspeed zero3 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant