My implementations of RL algorithms like GRPO/GSPO with minimal code.
- Supported models
- Qwen2/Qwen2.5/Qwen3 language models
- Qwen2.5 vision language models
- Supported algorithms
- GRPO
- Dr-GRPO
- GSPO
- KL-Conv
- StableReinforce
- Supported tricks
- clip higher from DAPO
- token level policy loss
- dual clip
- kl term removal
pip install -r requirements
bash scripts/run_logic.sh
https://github.com/Unakar/Logic-RL/tree/main/data/kk/instruct
| Size | Algorithm | Bits | LR | KL | Group Size | Steps | Test Score |
|---|---|---|---|---|---|---|---|
| 3B | GRPO | AMP | 1e-6 | 0 | 8 | 1600 | 0.12->0.54 |
| 7B | GRPO | AMP | 1e-6 | 0 | 8 | 1350 | 0.23->0.89 |
| Model | 2ppl | 3ppl | 4ppl | 5ppl | 6ppl | 7ppl | 8ppl |
|---|---|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 0.37 | 0.13 | 0.17 | 0.12 | 0.04 | 0.02 | 0.02 |
| Qwen2.5-3B-Instruct-GRPO | 0.76 | 0.70 | 0.68 | 0.50 | 0.47 | 0.33 | 0.33 |
| Qwen2.5-7B-Instruct | 0.56 | 0.35 | 0.23 | 0.25 | 0.14 | 0.09 | 0.02 |
| Qwen2.5-7B-Instruct-GRPO | 0.97 | 0.96 | 0.96 | 0.94 | 0.88 | 0.79 | 0.72 |
bash scripts/run_geometry3k.sh
https://huggingface.co/datasets/hiyouga/geometry3k
| Size | Algorithm | Bits | LR | KL | Group Size | Steps | Test Score |
|---|---|---|---|---|---|---|---|
| 3B | GRPO | AMP | 1e-6 | 0 | 8 | 700 | 0.24->0.43 |
| 3B | GSPO | AMP | 1e-6 | 0 | 8 | 750 | 0.24->0.43 |
| 3B | stable reinforce | AMP | 1e-6 | 0 | 12 | 1200 | 0.25->0.44 |
| 3B | kl-conv | AMP | 1e-6 | 0 | 12 | 900 | 0.23->0.45 |
| 7B | GRPO | AMP | 1e-6 | 0 | 8 | 800 | 0.38->0.50 |
- train on math
train with Dr-GRPO/GSPO/KL-Conv/StableReinforce algos- support dynamic sampling from dapo
- support ppo/reinforce++/RLOO
support vision language models- support Retrieval-Augmented Reasoning
- support agent training
- support code eval




