News • Links • Getting Started • Citation • Acknowledgement
This is a fork of SkyRL for the OpenThoughts-Agent project.
We will soon merge the changes to the main SkyRL branch.
For the time being, we list the steps to run SkyRL+Harbor for reproducing the RL training of our first release, i.e.:
- Using open-thoughts/OpenThinker-Agent-v1-SFT as base
- GRPO with the data open-thoughts/OpenThoughts-Agent-v1-RL, while
- Evaluating with open-thoughts/OpenThoughts-TB-dev, and
- Getting the final open-thoughts/OpenThinker-Agent-v1
Install SkyRL
conda create -n otagent python=3.12
conda activate otagent
pip install --index-url https://download.pytorch.org/whl/cu128 torch==2.7.1 torchvision
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
git clone https://github.com/mlfoundations/SkyRL
cd SkyRL/skyrl-train/
pip install -e .
pip install "vllm==0.10.1.1"
cd ../..Install Harbor
git clone https://github.com/CharlieFRuan/harbor
cd harbor
git checkout 112425-terminus2-messages
pip install -e .Remainings
pip install fastapi uvicornWe will soon make things uv-syncable.
conda activate otagent
# Download the eval dataset (OTTB-dev)
hf download open-thoughts/OpenThoughts-TB-dev --repo-type=dataset
# Download the train dataset
hf download open-thoughts/OpenThoughts-Agent-v1-RL --repo-type=dataset
# cd into the downloaded folder, say /path/to/.cache/huggingface/hub/datasets--open-thoughts--OpenThoughts-Agent-v1-RL/snapshots/hash_code
cd /path/to/.cache/huggingface/hub/datasets--open-thoughts--OpenThoughts-Agent-v1-RL/snapshots/hash_code
python extract_parquet_tasks.py tasks_new.parquet ./extracted_tasksThen configure the paths and API keys at the top of the script, and run:
cd SkyRL/skyrl-train
bash run_otagent.shThe script is designed to run on 8 GPUs single-node. If that is not your setup, modify these configs correspondingly:
trainer.placement.policy_num_nodes=1 \
trainer.placement.ref_num_nodes=1 \
trainer.placement.policy_num_gpus_per_node=8 \
trainer.placement.ref_num_gpus_per_node=8 \
generator.num_inference_engines=8 \
generator.inference_engine_tensor_parallel_size=1 \