Skip to content

Conversation

@millioniron
Copy link

我进行了合并修订。具体来讲,对于新版本中有关infer-log-prob的获取部分我遵循了官方的版本。但是在具体的训推修复部分,是使用我自己的修订。

✨ What's Changed

What does this PR do?

✨ What's Changed

1. 核心组件重构

  • 新增 InferCorrectionHandler (roll/utils/infer_correction.py)类:专注处理IS校正+样本拒绝,替代原loss_func中混杂逻辑
    handler = InferCorrectionHandler(pipeline_config)
    weighted_loss, final_mask, metrics = handler(
        old_log_probs, infer_log_probs, response_mask, pg_loss
    )

2. 三级拒绝策略体系

策略类型 触发条件 保护目标 关键参数
Token级拒绝 IS比率超出合理范围 防止单点梯度爆炸 infer_token_mask_threshold_{min,max}
序列级拒绝 序列整体IS比率异常 保证序列级一致性 enable_seq_reject, infer_seq_mask_threshold_{min,max}
灾难性拒绝 IS比率 < 1e-3 (指数级概率差异) 防止训练完全崩溃 infer_catastrophic_threshold

3. 智能重要性采样

  • 模式动态切换
    infer_is_mode: Literal["token", "sequence", "geometric", "none"]
    • token:传统token级IS(默认)
    • sequence:序列总log-ratio(稳定长序列训练)
    • geometric:几何平均比率(平衡极端值)
    • none:关闭IS(基准测试用)
  • 自适应裁剪
    is_weight = raw_is_weight.clamp(
        min=infer_is_threshold_min, 
        max=infer_is_threshold_max 

4. 工业级诊断系统

  • StatsCollector 集中管理指标,分三类:
    • 基础分布token_ratio_mean/std/min/max
    • 拒绝分析token_reject_frac, seq_reject_frac, catastrophic_seq_frac
    • 训练健康度inferkl (原始KL), inferkl_reject (拒绝后KL)
  • 延迟计算优化:避免频繁GPU-CPU同步
    self.stats.add_tensor_stat("token_ratio", ratio, mask)  # 注册但不立即计算
    self.stats.compute_tensor_stats()  # 批量计算

	new file:   examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml
	new file:   examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml
	new file:   examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh
	new file:   examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh
	modified:   roll/configs/base_config.py
	modified:   roll/configs/generating_args.py
	modified:   roll/distributed/scheduler/generate_scheduler.py
	modified:   roll/pipeline/agentic/env_manager/step_env_manager.py
	modified:   roll/pipeline/agentic/env_manager/traj_env_manager.py
	modified:   roll/pipeline/base_worker.py
	modified:   roll/pipeline/rlvr/actor_pg_worker.py
	modified:   roll/pipeline/rlvr/actor_worker.py
	new file:   examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml
	new file:   examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml
	new file:   examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh
	new file:   examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh
	modified:   roll/configs/base_config.py
	modified:   roll/configs/generating_args.py
	modified:   roll/distributed/scheduler/generate_scheduler.py
	modified:   roll/pipeline/agentic/env_manager/step_env_manager.py
	modified:   roll/pipeline/agentic/env_manager/traj_env_manager.py
	modified:   roll/pipeline/base_worker.py
	modified:   roll/pipeline/rlvr/actor_pg_worker.py
	modified:   roll/pipeline/rlvr/actor_worker.py
@millioniron
Copy link
Author

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant