Skip to content

Add Open-R1 Example #2818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add Open-R1 Example #2818

wants to merge 2 commits into from

Conversation

Bihan
Copy link
Collaborator

@Bihan Bihan commented Jun 18, 2025

No description provided.

@Bihan Bihan requested a review from peterschmidt85 June 18, 2025 12:20
commands:
- uv pip install vllm==0.8.5.post1
- uv pip install setuptools
- uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor) I guess a simply pip install will try to build the wheel? Isn't there a way to prevent that without hardcoding the wheel url?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85
The flash_attn url has specific ABI flag (cxx11abiFALSE). ABI flag can be TRUE or FALSE and the torch package installed by vllm==0.8.5.post1 needs FALSE.

To check whether we need TRUE or FALSE we need to do
python -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)" which will return either TRUE or FASE.

I far as I remember, when I did uv pip install flash_attn==2.7.4 I got undefined symbol error:
ImportError: /root/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

trl vllm-serve --model $MODEL --tensor_parallel_size $TP --data_parallel_size $DP --host 0.0.0.0
else
# Training node - adjust world size and nodes count for training
GPUS_PER_NODE=$(($DSTACK_GPUS_NUM / $DSTACK_NODES_NUM))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the built-in DSTACK_GPUS_PER_NODE - which is calculated the same way.

- uv pip install .
- |
# Get the last IP from DSTACK_NODES_IPS for vLLM node
VLLM_HOST=$(echo $DSTACK_NODES_IPS | tr ' ' '\n' | tail -n 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we move this under if [ "$USE_VLLM" = "true" ]; then?

ADJUSTED_NODES_NUM=$(($DSTACK_NODES_NUM - 1))
ADJUSTED_GPUS_TOTAL=$(($GPUS_PER_NODE * $ADJUSTED_NODES_NUM))
# Other nodes run training
echo "Starting training with VLLM on $VLLM_HOST"
Copy link
Contributor

@peterschmidt85 peterschmidt85 Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this echo? Just thinking of simplifying the configuration. Same for the echo above...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo is not necessary. We can remove it to simplify configuration

shm_size: 128GB

volumes:
- /checkpoints:/checkpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor) Just curious, given the VLLM may run on a random node, would checkpoints recover just work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 This is a very interesting question. I think theoretically it should not work if nodes are shuffled. This is because the node in which vLLM is running is not recognized by accelerate . The vLLM node is not within accelerate's world size.

Copy link

This PR is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants