Policy Collapse and Gibberish Rollouts in Search-R1 Early Training when TIS is Enabled (3B Model)

When reproducing Search-R1 using a 3B model on the Slime framework, I encountered a policy collapse during the early stages of training specifically when Token Importance Sampling (TIS) is enabled. I followed the tutorial settings in [Enabling TIS (Trajectory Importance Sampling)](https://thudm.github.io/slime/_examples_synced/search-r1/README.html#enabling-tis-trajectory-importance-sampling)

The model starts to generate chaotic, multilingual gibberish and completely ignores the system prompt and formatting constraints (e.g., <think> and <answer> tags).

Comparison: TIS ON vs. OFF
TIS Enabled: The model collapses within the first few rollouts. Outputs become high-entropy "word salad" (as shown in the logs below).

TIS Disabled: Training is stable. The model follows the prompt instructions and converges normally.

In the early rollouts, the model fails to maintain the reasoning format and generate abnormal outputs:
<img width="2443" height="689" alt="Image" src="https://github.com/user-attachments/assets/8800d92c-8176-4261-bdb4-4af40e156b2f" />
<img width="2429" height="331" alt="Image" src="https://github.com/user-attachments/assets/3a971e51-3462-481e-be80-6128fb50d4b1" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy Collapse and Gibberish Rollouts in Search-R1 Early Training when TIS is Enabled (3B Model) #1533

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Policy Collapse and Gibberish Rollouts in Search-R1 Early Training when TIS is Enabled (3B Model) #1533

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions