-
Notifications
You must be signed in to change notification settings - Fork 182
Add Open-R1 Example #2818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Open-R1 Example #2818
Conversation
commands: | ||
- uv pip install vllm==0.8.5.post1 | ||
- uv pip install setuptools | ||
- uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Minor) I guess a simply pip install will try to build the wheel? Isn't there a way to prevent that without hardcoding the wheel url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85
The flash_attn
url has specific ABI flag (cxx11abiFALSE
). ABI flag can be TRUE or FALSE and the torch
package installed by vllm==0.8.5.post1
needs FALSE.
To check whether we need TRUE or FALSE we need to do
python -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)"
which will return either TRUE or FASE.
I far as I remember, when I did uv pip install flash_attn==2.7.4
I got undefined symbol error:
ImportError: /root/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
trl vllm-serve --model $MODEL --tensor_parallel_size $TP --data_parallel_size $DP --host 0.0.0.0 | ||
else | ||
# Training node - adjust world size and nodes count for training | ||
GPUS_PER_NODE=$(($DSTACK_GPUS_NUM / $DSTACK_NODES_NUM)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the built-in DSTACK_GPUS_PER_NODE
- which is calculated the same way.
- uv pip install . | ||
- | | ||
# Get the last IP from DSTACK_NODES_IPS for vLLM node | ||
VLLM_HOST=$(echo $DSTACK_NODES_IPS | tr ' ' '\n' | tail -n 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we move this under if [ "$USE_VLLM" = "true" ]; then
?
ADJUSTED_NODES_NUM=$(($DSTACK_NODES_NUM - 1)) | ||
ADJUSTED_GPUS_TOTAL=$(($GPUS_PER_NODE * $ADJUSTED_NODES_NUM)) | ||
# Other nodes run training | ||
echo "Starting training with VLLM on $VLLM_HOST" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this echo
? Just thinking of simplifying the configuration. Same for the echo
above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echo is not necessary. We can remove it to simplify configuration
shm_size: 128GB | ||
|
||
volumes: | ||
- /checkpoints:/checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Minor) Just curious, given the VLLM may run on a random node, would checkpoints recover just work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85 This is a very interesting question. I think theoretically it should not work if nodes are shuffled. This is because the node in which vLLM is running is not recognized by accelerate
. The vLLM node is not within accelerate's world size
.
This PR is stale because it has been open for 14 days with no activity. |
No description provided.