Add Open-R1 Example #2818

Bihan · 2025-06-18T12:19:54Z

No description provided.

peterschmidt85 · 2025-06-20T10:32:07Z

examples/distributed-training/open-r1/.dstack.yml

+commands:
+  - uv pip install vllm==0.8.5.post1
+  - uv pip install setuptools
+  - uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl 


(Minor) I guess a simply pip install will try to build the wheel? Isn't there a way to prevent that without hardcoding the wheel url?

@peterschmidt85
The flash_attn url has specific ABI flag (cxx11abiFALSE). ABI flag can be TRUE or FALSE and the torch package installed by vllm==0.8.5.post1 needs FALSE.

To check whether we need TRUE or FALSE we need to do
python -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)" which will return either TRUE or FASE.

I far as I remember, when I did uv pip install flash_attn==2.7.4 I got undefined symbol error:
ImportError: /root/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

peterschmidt85 · 2025-06-20T10:33:49Z

examples/distributed-training/open-r1/.dstack.yml

+        trl vllm-serve --model $MODEL  --tensor_parallel_size $TP --data_parallel_size $DP --host 0.0.0.0
+      else
+        # Training node - adjust world size and nodes count for training
+        GPUS_PER_NODE=$(($DSTACK_GPUS_NUM / $DSTACK_NODES_NUM))


We already have the built-in DSTACK_GPUS_PER_NODE - which is calculated the same way.

peterschmidt85 · 2025-06-20T10:35:47Z

examples/distributed-training/open-r1/.dstack.yml

+  - uv pip install .
+  - |
+    # Get the last IP from DSTACK_NODES_IPS for vLLM node
+    VLLM_HOST=$(echo $DSTACK_NODES_IPS | tr ' ' '\n' | tail -n 1)


Shouldn't we move this under if [ "$USE_VLLM" = "true" ]; then?

peterschmidt85 · 2025-06-20T10:36:22Z

examples/distributed-training/open-r1/.dstack.yml

+        ADJUSTED_NODES_NUM=$(($DSTACK_NODES_NUM - 1))
+        ADJUSTED_GPUS_TOTAL=$(($GPUS_PER_NODE * $ADJUSTED_NODES_NUM))
+        # Other nodes run training
+        echo "Starting training with VLLM on $VLLM_HOST"


Do we need this echo? Just thinking of simplifying the configuration. Same for the echo above...

Echo is not necessary. We can remove it to simplify configuration

peterschmidt85 · 2025-06-20T10:39:03Z

examples/distributed-training/open-r1/.dstack.yml

+  shm_size: 128GB
+
+volumes:
+   - /checkpoints:/checkpoints


(Minor) Just curious, given the VLLM may run on a random node, would checkpoints recover just work?

@peterschmidt85 This is a very interesting question. I think theoretically it should not work if nodes are shuffled. This is because the node in which vLLM is running is not recognized by accelerate . The vLLM node is not within accelerate's world size.

github-actions · 2025-07-15T02:12:08Z

This PR is stale because it has been open for 14 days with no activity.

Add Open-R1 Example

9073229

Bihan requested a review from peterschmidt85 June 18, 2025 12:20

peterschmidt85 reviewed Jun 20, 2025

View reviewed changes

Resolve Review Comments

c88164f

github-actions bot added the stale label Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Open-R1 Example #2818

Add Open-R1 Example #2818

Uh oh!

Bihan commented Jun 18, 2025

Uh oh!

peterschmidt85 Jun 20, 2025

Uh oh!

Bihan Jun 20, 2025

Uh oh!

peterschmidt85 Jun 20, 2025

Uh oh!

peterschmidt85 Jun 20, 2025

Uh oh!

peterschmidt85 Jun 20, 2025 •

edited

Loading

Uh oh!

Bihan Jun 20, 2025

Uh oh!

peterschmidt85 Jun 20, 2025

Uh oh!

Bihan Jun 20, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

Uh oh!

Add Open-R1 Example #2818

Are you sure you want to change the base?

Add Open-R1 Example #2818

Uh oh!

Conversation

Bihan commented Jun 18, 2025

Uh oh!

peterschmidt85 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

Uh oh!

peterschmidt85 Jun 20, 2025 •

edited

Loading