Language-based Trial and Error Fails in the Era of Experience

Sub-Scale Collaboration On Unseen Task (SCOUT)
Decoupling Exploration from Exploitation for Efficient LLM Agent Training

📖 Overview

SCOUT is a novel framework that addresses the inefficiency of Large Language Models (LLMs) in exploring unseen, non-linguistic environments (e.g., symbolic or spatial tasks).

While LLMs excel at exploitation (reasoning based on knowledge), they are computationally expensive and inefficient at exploration (trial-and-error). SCOUT decouples these two processes:

Lightweight Scouts: Use small networks (MLPs/CNNs) to rapidly master environmental dynamics via standard RL.
Sub-Scale Collaboration: Distill the scout's expert trajectories into the LLM via SFT.
Evolution: Activate the LLM's latent world knowledge through multi-turn RL (PPO).

Empirically, SCOUT enables a Qwen2.5-3B model to achieve an average score of 0.86 on complex tasks (including Rubik's Cube and 2048), significantly performing proprietary models like Gemini-2.5-Pro (0.60), while reducing GPU hours by ~60%.

This repository is built upon the RAGEN framework.

🎁 Updates

2026.01.30: We release our paper: Language-based Trial and Error Falls Behind in the Era of Experience at arxiv and code.
2026.01.31: We release the multi-task models at huggingface.

🚀 The SCOUT Framework

The training pipeline consists of three distinct stages:

Exploration Stage (Scout Training):
- Agents: Small MLPs or CNNs ($~10^{-5}$B parameters).
- Algorithm: DQN or PPO.
- Goal: Efficiently map transition dynamics and generate expert trajectories ($\tau_{scout}$).
Distillation Stage (SFT):
- Process: Transform $\tau_{scout}$ into text-based dialogue formats using a deterministic Textualizer.
- Goal: "Warm up" the LLM to understand the physics of the unseen task.
Evolving Stage (Multi-turn RL):
- Algorithm: Multi-turn PPO (via RAGEN).
- Goal: Refine reasoning and enable the LLM to self-evolve beyond the scout's capabilities.

🛠️ Installation

# Clone the repository
git clone https://github.com/Harry-mic/SCOUT.git
cd SCOUT

# Setup the environment (based on RAGEN)
bash scripts/setup_ragen.sh

🎮 Environments

We introduce several OOD (Out-of-Distribution) symbolic and spatial tasks:

Rubik's Cube: Restore a 2x2 scrambled cube (spatial reasoning).

2048: Long-horizon planning (>800 turns).

Sudoku: Logic-based constraint satisfaction.

Sokoban: Box-pushing planning task.

FrozenLake: Stochastic navigation (Static & Slippery variants).

Bandit: Fundamental RL benchmark.

⚡ Usage

1. Exploration Stage (Train Scouts)

Train lightweight scouts (MLP/CNN) to collect expert trajectories.

# Example: Train a DQN scout for Frozenlake and collected the trajectories as runs_scouts
python scout_dqn/dqn_frozenlake.py --track

2. Distillation Stage (SFT)

Textualizer the collected datasets.

# Textualizer from one-hot vectors to language dialogues.
python scripts/Textualizer_frozenlake.py  runs_scouts/Frozenlake_dqn_*** --step step_***

Fine-tune the base LLM on the collected trajectories. We utilize LLaMA-Factory for this stage.

# Run SFT on previous collected dialogues.
llama-factory train xxx.yaml

3. Evolving Stage (Multi-turn RL)

Run multi-turn PPO on the SFT model using the RAGEN infrastructure.

Start Training:

bash scripts/example_bash.sh

📊 Performance

SCOUT achieves state-of-the-art performance on unseen tasks while saving 60% of computational costs compared to direct RL training.

📂 Repository Structure

SCOUT/
├── ragen/                  # Core RAGEN framework (Env Manager, Context Manager)
├── scout_dqn/              # Lightweight scout training (DQN) & Textualizers
├── config/                 # Hydra configurations for PPO/GRPO
├── scripts/                # Setup and utility scripts
└── train.py                # Main entry point for Evolving Stage

📜 Citation

If you find SCOUT useful for your research, please cite our paper:

@article{wang2026language,
  title={Language-based Trial and Error Falls Behind in the Era of Experience},
  author={Wang, Haoyu and Ma, Guozheng and Cui, Shugang and Kong, Yilun and Luo, Haotian and Shen, Li and Gao, Mengya and Wu, Yichao and Wang, Xiaogang and Tao, Dacheng},
  journal={arXiv preprint arXiv:2601.21754},
  year={2026}
}

Acknowledgements

This codebase is built upon RAGEN. We thank the RAGEN team for their infrastructure support.

Name		Name	Last commit message	Last commit date
Latest commit History 583 Commits
cases		cases
config		config
drawpic		drawpic
external		external
public		public
ragen		ragen
scout_dqn		scout_dqn
scout_ppo		scout_ppo
scripts		scripts
tests		tests
verl @ cf619d6		verl @ cf619d6
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
RAGEN.pdf		RAGEN.pdf
README.md		README.md
kill.sh		kill.sh
pipeline117.pdf		pipeline117.pdf
pipeline117.png		pipeline117.png
pytest.ini		pytest.ini
requirements.txt		requirements.txt
results_table.png		results_table.png
setup.py		setup.py
train.py		train.py
train_all.sh		train_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-based Trial and Error Fails in the Era of Experience

📖 Overview

🎁 Updates

🚀 The SCOUT Framework

🛠️ Installation

🎮 Environments

⚡ Usage

1. Exploration Stage (Train Scouts)

2. Distillation Stage (SFT)

3. Evolving Stage (Multi-turn RL)

📊 Performance

📂 Repository Structure

📜 Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Harry-mic/SCOUT

Folders and files

Latest commit

History

Repository files navigation

Language-based Trial and Error Fails in the Era of Experience

📖 Overview

🎁 Updates

🚀 The SCOUT Framework

🛠️ Installation

🎮 Environments

⚡ Usage

1. Exploration Stage (Train Scouts)

2. Distillation Stage (SFT)

3. Evolving Stage (Multi-turn RL)

📊 Performance

📂 Repository Structure

📜 Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages