Sub-Scale Collaboration On Unseen Task (SCOUT)
Decoupling Exploration from Exploitation for Efficient LLM Agent Training
SCOUT is a novel framework that addresses the inefficiency of Large Language Models (LLMs) in exploring unseen, non-linguistic environments (e.g., symbolic or spatial tasks).
While LLMs excel at exploitation (reasoning based on knowledge), they are computationally expensive and inefficient at exploration (trial-and-error). SCOUT decouples these two processes:
- Lightweight Scouts: Use small networks (MLPs/CNNs) to rapidly master environmental dynamics via standard RL.
- Sub-Scale Collaboration: Distill the scout's expert trajectories into the LLM via SFT.
- Evolution: Activate the LLM's latent world knowledge through multi-turn RL (PPO).
Empirically, SCOUT enables a Qwen2.5-3B model to achieve an average score of 0.86 on complex tasks (including Rubik's Cube and 2048), significantly performing proprietary models like Gemini-2.5-Pro (0.60), while reducing GPU hours by ~60%.
This repository is built upon the RAGEN framework.
- 2026.01.30: We release our paper: Language-based Trial and Error Falls Behind in the Era of Experience at arxiv and code.
- 2026.01.31: We release the multi-task models at huggingface.
The training pipeline consists of three distinct stages:
-
Exploration Stage (Scout Training):
- Agents: Small MLPs or CNNs ($~10^{-5}$B parameters).
- Algorithm: DQN or PPO.
- Goal: Efficiently map transition dynamics and generate expert trajectories (
$\tau_{scout}$ ).
-
Distillation Stage (SFT):
- Process: Transform
$\tau_{scout}$ into text-based dialogue formats using a deterministic Textualizer. - Goal: "Warm up" the LLM to understand the physics of the unseen task.
- Process: Transform
-
Evolving Stage (Multi-turn RL):
- Algorithm: Multi-turn PPO (via RAGEN).
- Goal: Refine reasoning and enable the LLM to self-evolve beyond the scout's capabilities.
# Clone the repository
git clone https://github.com/Harry-mic/SCOUT.git
cd SCOUT
# Setup the environment (based on RAGEN)
bash scripts/setup_ragen.shWe introduce several OOD (Out-of-Distribution) symbolic and spatial tasks:
Rubik's Cube: Restore a 2x2 scrambled cube (spatial reasoning).
2048: Long-horizon planning (>800 turns).
Sudoku: Logic-based constraint satisfaction.
Sokoban: Box-pushing planning task.
FrozenLake: Stochastic navigation (Static & Slippery variants).
Bandit: Fundamental RL benchmark.
Train lightweight scouts (MLP/CNN) to collect expert trajectories.
# Example: Train a DQN scout for Frozenlake and collected the trajectories as runs_scouts
python scout_dqn/dqn_frozenlake.py --trackTextualizer the collected datasets.
# Textualizer from one-hot vectors to language dialogues.
python scripts/Textualizer_frozenlake.py runs_scouts/Frozenlake_dqn_*** --step step_***Fine-tune the base LLM on the collected trajectories. We utilize LLaMA-Factory for this stage.
# Run SFT on previous collected dialogues.
llama-factory train xxx.yamlRun multi-turn PPO on the SFT model using the RAGEN infrastructure.
Start Training:
bash scripts/example_bash.shSCOUT achieves state-of-the-art performance on unseen tasks while saving 60% of computational costs compared to direct RL training.
SCOUT/
├── ragen/ # Core RAGEN framework (Env Manager, Context Manager)
├── scout_dqn/ # Lightweight scout training (DQN) & Textualizers
├── config/ # Hydra configurations for PPO/GRPO
├── scripts/ # Setup and utility scripts
└── train.py # Main entry point for Evolving Stage
If you find SCOUT useful for your research, please cite our paper:
@article{wang2026language,
title={Language-based Trial and Error Falls Behind in the Era of Experience},
author={Wang, Haoyu and Ma, Guozheng and Cui, Shugang and Kong, Yilun and Luo, Haotian and Shen, Li and Gao, Mengya and Wu, Yichao and Wang, Xiaogang and Tao, Dacheng},
journal={arXiv preprint arXiv:2601.21754},
year={2026}
}
This codebase is built upon RAGEN. We thank the RAGEN team for their infrastructure support.

