Skip to content

hscspring/rl-llm-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

RL-LLM-NLP

A curated, opinionated index of post-R1 LLM × Reinforcement Learning. Every paper is read, classified, cross-linked, and connected back to a Chinese deep-dive blog post by hscspring.

I consider RL to be a pivotal technology in the field of AI, and NLP (particularly LLM) to be a direction well worth exploring. This repo focuses on post-R1 LLM RL specifically.

Why this repo (and not another Awesome list)

Awesome lists are retrievers. This repo is a curator.

  • Has a verdict, not a vote. Each of the 5 tracks ends with the author's bottom-line judgment ("data sets the ceiling, algorithms approach it"; "most RL is sampling polish, not extrapolation"; "all internal-feedback methods compress entropy → exploration crisis"), not a paper summary.
  • Cross-links across papers. RLOO ≡ GRPO with Fnorm=1 (derived in GiGPO); REINFORCE + IS → CISPO loss (extended in CISPO); Activation Steering & Context Engineering are the upstream signals of Training-Free RL. These connections only surface when one person reads everything.
  • A narrative, not a database. The Chronological Blog Timeline reads as a year-and-a-half editorial arc through post-R1 RL × LLM: Feb-2025 R1 → GRPO family → Reward modeling → MoE stability → Training-Free RL.
  • Personal historical anchors. Pre-R1 works (RL4LMs, FTHP, Quark, DT) live in their own corner with a one-line why I personally cared, not as bibliography filler.

Topics covered: DeepSeek-R1 reproduction · GRPO family (DAPO / Dr.GRPO / VAPO / CISPO / GiGPO / GSPO / GMPO / GTPO / Reinforce++) · PPO · RLHF · DPO · Reward Modeling · Verifier-Free RL · MoE RL Stability · Training-Free RL · Activation Steering · Agentic RL.


5 Tracks · The Author's Verdicts

Blog posts on yam.gift are grouped into 5 tracks. Each verdict below is the author's own bottom-line judgment, not a paper summary.

Track 01 — R1 Full-Chain (2025)

Parsing the original report, then digging into the data / paradigm / experiment side.

  • Core thesis: data sets the ceiling, algorithms only approach it. Pure rule-based RL is finally validated as a viable path.
  • The frame that survived: "Base+SFT / Base+RL / SFT+RL" can absorb almost all subsequent variations.
  • Loose ends still being chased: R1-Zero behavior differs sharply across base models (SimpleRL-Zoo, Yarz-Logic); LIMO/s1 confirm "less is more on activation, not on teaching".

Track 02 — GRPO Family & Engineering Refinements (2025–2026)

Every GRPO variant — DAPO / Dr.GRPO / VAPO / CISPO / GiGPO / GSPO / GMPO / GTPO / Reinforce++ / industry showcases.

  • Core thesis: every variant is paying off the same engineering debt — token vs. sequence-level, clip tighter/wider, length normalization, KL choice (k2 vs k3), advantage global-normalization.
  • Convergence: the GRPO objective increasingly looks like a "people's edition" of PPO with global advantage and no critic.
  • Sub-thread: clip is not just a stability knob — it directly shapes the explore/exploit boundary. Spurious Rewards, Clip-Higher (DAPO), Clip-Wider (GMPO) are all moves on the same axis.

Track 03 — Reward Modeling, Data & Verifiers (2025)

RM / RM-Data / Verifier-Free / Self-Verified / Verify-Free RL.

  • Drift 1 (modeling): from single scalar → "principles + critique + self-verification" (DeepSeek-GRM → DeepSeekMath-V2).
  • Drift 2 (data): good reward data is more like unlocking the base model's existing capabilities than teaching new ones (Skywork-Reward-V2, Spurious Rewards).
  • Drift 3 (Verify-Free): when no external verifier exists, all internal-feedback methods (TTRL / EM / RENT / EMPO / Intuitor) end up compressing entropy. Long-term, an exploration crisis is inevitable — ETTRL / Darling / EVOL-RL / RESTRAIN are all band-aids on the same wound.

Track 04 — MoE RL Stability (2026)

R3 / IcePop / TIS / KAT-Coder.

  • Surface diagnosis: train-infer router mismatch is what everyone first noticed.
  • Deeper cause: logprob estimation noise on MoE is not neutral; even recomputing logprobs drifts. The importance ratio (π_new/π_old) — heart of GRPO — is silently diluted on MoE.
  • Open bet: GSPO/GMPO's sequence-level + geometric-mean might be MoE-RL-friendly; not yet validated at production scale.

Track 05 — Paradigm Frontier (2025–2026)

Training-Free / Experiential / Real-time / Planning / RL Boundary.

  • Boundary realization: most RL is just sampling polish (Yue), not true pass@k extrapolation.
  • Counter-evidence: ProRL / DELTA show extrapolation is possible — but only with edge data + process reward + avoiding "all-zero pass@k" cold start.
  • Upstream signals (already in 2025 H2): "Activation Steering" and "Context Engineering" both pointed in this direction before Training-Free RL had a name — behavior can be shaped without touching weights.
  • New paradigm A — Training-Free RL: advantage lives in text/context, not in weight space (TRT, Training-Free GRPO, MemAPO, Update-Free Steering).
  • New paradigm B — Experience-as-RL: the loop becomes "trajectory → information gain → re-supervision". Reflection, meta-search and open-ended learning are all data-construction tricks in disguise.
  • Higher-level question: "reasoning" should be studied as a data format, not as an RL task (Think-Strategy / LEPA).

Library

GitHub From Year Description
prime-rl PrimeIntellect-ai 2025 Decentralized large-scale RL training framework
PRIME PRIME-RL 2025 Scalable RL recipe for reasoning
rStar Microsoft 2025 Self-evolved deep reasoning for SLMs
veRL ByteDance 2024 LLM RL training framework (Volcano Engine)
trl HuggingFace 2024 Train language models with RL
RL4LMs Allen 2023 Aligning LMs to human preference via RL
alignment-handbook HuggingFace 2023 Recipes for aligning to human/AI preference

Papers

Notation for the My Notes column:

  • [short title](url) — full Chinese deep-dive available (yam.gift blog or book chapter)
  • (omnibus → ...) — covered as a main thread in a survey/overview blog
  • (<verb: derived/extended/contrasted/described/framed/criticized/...> in [blog]: ...) — touched on as a sub-topic inside another deep-dive, with a one-line pointer; multiple pointers can be chained with ;
  • to-write — not yet written

RL Reasoning Reproduction (R1 and Beyond)

Abbr Title From Year Link My Notes
R1 DeepSeek-R1: Incentivizing Reasoning Capability via RL DeepSeek 2025 paper DeepSeek R1 深度技术解析及其影响
LIMO LIMO: Less Is More for Reasoning SJTU 2025 paper 少量高质量数据 SFT 激活推理
s1 s1: Simple test-time scaling Stanford 2025 paper (omnibus → SFT-Data)
R1 Survey The R1-era LLM new paradigm 2025 DeepSeek R1 后 LLM 新范式
R1-Zero+ Further understanding of R1-Zero 2025 R1-Zero 的进一步理解和探索
SimpleRL-Zoo SimpleRL-Zoo: R1-Zero RL across diverse base models HKUST 2025 paper (omnibus → Think-More-about-R1-Zero)
FastCuRL FastCuRL: Curriculum RL with Stage-wise Context Scaling 2025 paper (omnibus → Think-More-about-R1-Zero)
Logic-RL Logic-RL: Unleashing LLM Reasoning with Rule-Based RL 2025 paper Yarz-Logic:R1-Zero 相关实验报告
Seed-Thinking Seed-Thinking-v1.5: Advancing Superb Reasoning ByteDance 2025 paper R1 后范式最佳实践:Seed-Thinking 和 Qwen3
Qwen3 Qwen3 Technical Report Qwen 2025 paper (omnibus → Seed-Thinking-Qwen3)

RL Data Selection & Scaling

Abbr Title From Year Link My Notes
LIMR LIMR: Less is More for RL Scaling GAIR-NLP 2025 paper, GitHub R1 相关:RL 数据选择与 Scaling
ORZ Open-Reasoner-Zero StepFun 2025 paper, GitHub (omnibus → PPO-Data)
Online-DPO-R1 Online-DPO-R1: Effective Reasoning Without the PPO Overhead Salesforce 2025 paper, GitHub R1 相关:DPO 数据选择与 DPO 等 RL 算法
LIMD LIMD: Less is More on DPO Data 2025 (omnibus → DPO-Data)
OREAL Exploring the Limit of Outcome Reward for Math Reasoning InternLM 2025 paper, GitHub to-write
DeepScaleR DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL Agentica 2025 paper, GitHub (omnibus → R1-New-Paradigm)
L1 / LCPO Controlling How Long A Reasoning Model Thinks With RL CMU 2025 paper, GitHub (omnibus → R1-New-Paradigm)
MRT Optimizing Test-Time Compute via Meta RL Fine-Tuning CMU 2025 paper, GitHub to-write
ScalingLaw Value-Based Deep RL Scales Predictably Berkeley 2025 paper to-write

SLM Reasoning

Abbr Title From Year Link My Notes
PRIME Process Reinforcement through Implicit Rewards PRIME-RL 2025 paper, GitHub (described in R1-New-Paradigm: implicit PRM — trained as an ORM, used as a PRM)
rStar-Math rStar-Math: Small LLMs Can Master Math Reasoning Microsoft 2025 paper, GitHub (described in R1-New-Paradigm: rule-based verification on intermediate results at key steps, via Python code execution)
rStar rStar: Mutual Reasoning Makes Smaller LLMs Stronger Microsoft 2024 paper, GitHub to-write

Reward Model (modeling / data / verifier)

Abbr Title From Year Link My Notes
GRM Inference-Time Scaling for Generalist Reward Modeling DeepSeek 2025 paper Reward Model 建模
Skywork-Reward-V2 Skywork-Reward-V2 Skywork 2025 paper Reward 数据如何塑造与激发推理策略
Spurious Rewards Spurious Rewards: Rethinking Training Signals in RLVR Allen 2025 paper (omnibus → RM-Data / GRPO-Clip)
ICM Anthropic Internal Coherence Maximization Anthropic 2025 blog (omnibus → RM-Data)
DeepSeekMath-V2 Towards Self-Verifiable Mathematical Reasoning DeepSeek 2025 paper, GitHub DeepSeekMath-V2 自我验证:搞数据的风吹到了 RM

Verifier-Free RL (internal-feedback RL)

One blog (Verify-Free RL) covers the algorithms below — listed individually for searchability.

Abbr Title From Year Link My Notes
NOVER NOVER: Incentive Training without External Verifiers 2025 paper 无验证器 RL 与 Reference 的妙用
TTRL TTRL: Test-Time Reinforcement Learning 2025 paper 无验证 RL——当模型只能相信自己
SRT Can Large Reasoning Models Self-Train? 2025 paper (omnibus → Verify-Free RL)
EM The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning 2025 paper (omnibus → Verify-Free RL; covers EM-FT / EM-RL / EM-INF)
RENT Maximizing Confidence Alone Improves Reasoning 2025 paper (omnibus → Verify-Free RL)
EMPO Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization 2025 paper (omnibus → Verify-Free RL)
Intuitor Learning to Reason without External Rewards 2025 paper (omnibus → Verify-Free RL)
ETTRL ETTRL: Balancing Exploration and Exploitation via Entropy Mechanism 2025 paper (omnibus → Verify-Free RL)
Darling Jointly Reinforcing Diversity and Quality in Language Model Generations 2025 paper (omnibus → Verify-Free RL)
EVOL-RL Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation 2025 paper (omnibus → Verify-Free RL)
RESTRAIN RESTRAIN: From Spurious Votes to Signals — Self-Driven RL with Self-Penalization 2025 paper (omnibus → Verify-Free RL)
No Free Lunch No Free Lunch: Rethinking Internal Feedback for LLM Reasoning 2025 paper (theoretical critique of Verify-Free; omnibus → Verify-Free RL)

Alignment Classics

Abbr Title From Year Link My Notes
RLHF Training language models to follow instructions with human feedback OpenAI 2022 paper HuggingLLM 1.3.3:RLHF 流程与思想
RLOO Back to Basics: Revisiting REINFORCE Style Optimization for RLHF Cohere 2024 paper (derived in GiGPO: GRPO with Fnorm=1 ≡ RLOO)
ReMax ReMax: A Simple, Effective, and Efficient RL Method for LLM CUHK 2024 paper (contrasted in Reinforce++: greedy-baseline variant — inefficient because greedy response is unused for training)

MoE RL Stability

Abbr Title From Year Link My Notes
R3 Stabilizing MoE RL by Aligning Training and Inference Routers Xiaomi 2025 paper 稳定压倒一切:MoE RL 训推不一致问题及解决策略
IcePop Small Leak Can Sink a Great Ship — Boost RL Training on MoE Ant 2025 paper (omnibus → RL-MoE-Stable)
TIS Your Efficient RL Framework Secretly Brings You Off-Policy RL Training UCSD 2025 paper (omnibus → RL-MoE-Stable)
KAT KAT-Coder Tech Report Kuaishou 2026 blog MoE RL 训练不稳定性再思考:训推不一致,还是采样噪声?

Optimization Algorithms (GRPO Family + Classics)

Abbr Title From Year Link My Notes
COPO Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents Tencent 2026 paper, GitHub COPO:基于认知模式的 Step-Level RL 优化
GiGPO Group-in-Group Policy Optimization for LLM Agent Training NTU, Skywork AI 2025 paper, GitHub GiGPO:双层级优势函数驱动的 Agent RL 新范式
GRPO DeepSeekMath: Pushing the Limits of Mathematical Reasoning DeepSeek 2024 paper (covered in DAPO and R1)
DAPO DAPO: An Open-Source LLM RL System at Scale ByteDance Seed 2025 paper, GitHub DAPO:为 GRPO 锦上添四点花
Dr.GRPO Understanding R1-Zero-Like Training: A Critical Perspective Sea AI Lab 2025 paper, GitHub 异曲同工的 Dr.GRPO
VAPO VAPO: Efficient and Reliable RL for Advanced Reasoning ByteDance Seed 2025 paper VAPO:基于价值方法的新突破
CISPO MiniMax-M1: Scaling Test-Time Compute Efficiently MiniMax 2025 paper, GitHub GRPO 优化在继续:CISPO 和熵
GSPO Group Sequence Policy Optimization Qwen 2025 paper Token Level X:DAPO/DrGRPO 与 GSPO/GMPO 的殊途同归
GMPO Geometric-Mean Policy Optimization UCAS, Microsoft 2025 paper, GitHub (omnibus → Token-Level-GSPO-GMPO)
GTPO GTPO: Trajectory-Based Policy Optimization in LLMs 2025 paper GRPO「第一背锅侠」X2:GTPO 双 T 傍地走
Reinforce++ REINFORCE++: Stabilizing Critic-Free Policy Optimization OpenRLHF 2025 paper, GitHub Reinforce++ 和它的 KL Loss 选择
KimiRL Kimi k1.5: Scaling RL with LLMs Kimi 2025 paper (omnibus → Open-LLM-RL-ShowCase)
AGAPO EXAONE 4.0: Unified LLM Integrating Non-reasoning and Reasoning Modes LG AI 2025 paper (omnibus → Open-LLM-RL-ShowCase)
K-EXAONE EXAONE-2 Tech Report LG AI 2026 paper (omnibus → Open-LLM-RL-ShowCase)
MOPD MiMo-V2-Flash Technical Report Xiaomi 2026 paper (omnibus → Open-LLM-RL-ShowCase)
SAPO Soft Adaptive Policy Optimization Qwen 2026 paper (omnibus → Open-LLM-RL-ShowCase)
DCPO DCPO: Dynamic Clipping Policy Optimization Baichuan 2025 paper, GitHub (described in GRPO-Clip: adaptive clip bounds based on token prior probability — expanding exploration room for low-probability tokens)
OPO On-Policy RL with Optimal Reward Baseline Microsoft 2025 paper (described in Token-Level-GSPO-GMPO: optimal reward baseline minimizes gradient variance; contrasted in GTPO: focuses on advantage/reward level rather than token level)
SRPO SRPO: Cross-Domain Implementation of Large-Scale RL on LLM Kuaishou 2025 paper, HF (contrasted in Token-Level-GSPO-GMPO: historical resampling retains key samples to improve sample efficiency)
DPO Direct Preference Optimization Stanford 2024 paper (compared inside DPO-Data)
PPO Proximal Policy Optimization Algorithms OpenAI 2017 paper (extended in VAPO: Value-based Augmented PPO with GAE refinements / value-pretraining; contrasted in Reinforce++: PPO with critic vs critic-free Reinforce-style)
REINFORCE Simple Statistical Gradient-Following Algorithms Northeastern 1992 paper (extended in CISPO: REINFORCE + IS → CISPO loss; framed in Open-LLM-RL-ShowCase: REINFORCE-with-baseline as analytic frame for all GRPO variants)

Pre-R1 Foundations (Historical Anchors)

Pre-R1 RL × NLP works the author considers personally important. Kept here for sentimental and historical reasons rather than as active reading. The author's full reflection: 《通向 AGI 的技术路径:多模态、强化学习与新架构的交汇点》"22 年 RL4LMs 出来后我兴奋的晚上觉都没睡着,第一时间就读了他们的代码。"

Abbr Title From Year Link Note
RL4LMs / NLPO RL (Not) for NLP: Benchmarks, Baselines, Building Blocks Allen 2022 paper, GitHub Personally cited milestone in AI-Future-Framework — first felt RL × NLP could really land
FTHP Fine-Tuning Language Models from Human Preferences OpenAI 2020 paper, GitHub OpenAI's earliest RLHF experiment; the seed of InstructGPT/ChatGPT
Quark Quark: Controllable Text Generation with Reinforced [Un]learning Allen 2022 paper, GitHub Early attempt at RL for controllable generation × unlearning — niche but conceptually clean
DT Decision Transformer: RL via Sequence Modeling Berkeley 2021 paper, GitHub The "RL = sequence modeling" reframing — a parallel branch that diverged from the LLM-RL trunk

Frontier RL — Boundary, Process Reward & Experience

Pure RL frontier: where the training loop itself is being pushed (boundary, process reward, experience-as-data, planning-as-data).

Abbr Title From Year Link My Notes
DeepSeek-V3.2 Post-train DeepSeek-V3.2 Tech Report DeepSeek 2025 paper DeepSeek V3.2 后训练:稳定压倒一切
RL Boundary (Yue) Does RL Really Incentivize Reasoning Capacity Beyond the Base Model? 2025 paper RL 究竟能不能突破 Base 边界
Invisible Leash Invisible Leash Wu et al. 2025 paper (omnibus → RL-Are-You-OK)
ProRL ProRL: Prolonged RL Expands Reasoning Boundaries NVIDIA 2025 paper (omnibus → RL-Are-You-OK)
DELTA DELTA: Dense Process Reward for RL Boundary Extrapolation 2025 (omnibus → RL-Are-You-OK)
ERL Experience-as-RL 2026 paper RL 新范式:从经验到更高质量数据
MR-Search MR-Search: Meta-Reasoning Search 2026 paper (omnibus → RL-New-Paradigm-Data)
OEL Open-Ended Learning 2026 paper (omnibus → RL-New-Paradigm-Data)
LEPA LEPA: Learn to Plan before Answering 2025 paper 从「会答」到「会想」:Planning as Data 与思考范式重构
Self-Steering Self-Steering 2025 paper (omnibus → Think-Strategy)

Beyond RL — Training-Free / Behavior Shaping / Real-time PEFT

Post-RL directions covered in the same blogs. The paradigm is moving away from classical RL training, into context, behavior, and parameter-efficient adaptation. They are not RL by the textbook definition, but they share the same goal — shape model behavior — and several of them (Activation Steering, Context Engineering) are the upstream signals that Training-Free RL only later named explicitly.

Abbr Title From Year Link My Notes
TRT Test-time Recursive Thinking: Self-Improvement without External Feedback Microsoft 2026 paper Training-Free RL:当训练不再更新参数,而是更新上下文
Training-Free GRPO Training-Free Group Relative Policy Optimization 2025 paper (omnibus → Training-Free RL)
MemAPO MemAPO: Memory-Augmented Policy Optimization 2026 paper (omnibus → Training-Free RL)
Update-Free Steering Update-Free On-Policy Steering via Verifiers 2026 paper (omnibus → Training-Free RL)
Activation Engineering Steering Language Models With Activation Engineering 2023 paper 激活诱导 LLM 指令跟随
Activation Steering (IF) Improving Instruction-Following in Language Models through Activation Steering Microsoft 2024 paper (omnibus → 激活诱导 LLM 指令跟随)
Context Engineering Context Engineering for AI Agents: Lessons from Building Manus Manus 2025 blog 重识 LLM 法则:上下文工程与数据进化
MiCA MiCA: Minor-Component Adaptation 2026 paper 实时学习:极致高效的子空间微调
TinyLoRA TinyLoRA 2026 (omnibus → Real-time-Learning-from-PEFT)

Chronological Blog Timeline

All blog posts in publishing order, with the author's one-sentence takeaway. Use this when you want to follow the narrative arc rather than search by paper.

Date Track Blog (Chinese) One-sentence takeaway
2025-02-17 01 DeepSeek R1 深度技术解析及其影响 Data sets the ceiling, algorithms approach it; pure rule-based RL works.
2025-02-18 01 少量高质量数据 SFT 激活推理 LIMO/s1: small high-quality SFT activates reasoning, doesn't teach it.
2025-02-27 01 R1 相关:RL 数据选择与 Scaling LIMR/ORZ: less-is-more applies to RL data, not just SFT.
2025-03-02 01 R1 相关:DPO 数据选择与 DPO 等 RL 算法 Online-DPO can rival PPO when paired with the right data pipeline.
2025-03-15 01 DeepSeek R1 后 LLM 新范式 The post-R1 path forks into multiple parallel lines (length, scaling, MRT, …).
2025-03-19 02 DAPO:为 GRPO 锦上添四点花 DAPO = Clip-Higher + Dynamic Sampling + Token-Level Loss + Overlong Reward Shaping.
2025-03-28 02 异曲同工的 Dr.GRPO Dr.GRPO removes the length & std normalization biases hidden in vanilla GRPO.
2025-04-10 01 R1-Zero 的进一步理解和探索 R1-Zero behavior depends heavily on base model; "Aha moment" is partly base-pretrain artifact.
2025-04-19 02 VAPO:基于价值方法的新突破 Value-based methods come back to compete with critic-free GRPO.
2025-04-26 01 Yarz-Logic:R1-Zero 相关实验报告 Hands-on Logic-RL replication: where R1-Zero's edges are in practice.
2025-05-01 01 R1 后范式最佳实践:Seed-Thinking 和 Qwen3 Seed-Thinking + Qwen3 are the two most complete industrial post-R1 recipes.
2025-06-09 03 Reward Model 建模 General-domain RM needs principles+critique, not a single scalar (DeepSeek-GRM).
2025-06-19 02 GRPO 优化在继续:CISPO 和熵 CISPO shows clip is not just stability — it shapes the explore/exploit edge.
2025-07-01 05 激活诱导 LLM 指令跟随 Activation Steering: behavior shaping without weight updates — the prequel to Update-Free Steering.
2025-07-13 03 Reward 数据如何塑造与激发推理策略 Good reward data unlocks pre-existing strategies; even spurious rewards can do this.
2025-07-25 02 GiGPO:双层级优势函数驱动的 Agent RL 新范式 Agent RL needs hierarchical (group-in-group) advantages for proper credit assignment.
2025-07-27 05 重识 LLM 法则:上下文工程与数据进化 "Everything is context" — the early manifesto behind Training-Free RL.
2025-08-14 02 Token Level X:DAPO/DrGRPO 与 GSPO/GMPO 的殊途同归 Token-level vs sequence-level is THE axis of the GRPO family.
2025-08-30 02 GRPO「第一背锅侠」X2:GTPO 双 T 傍地走 GTPO: trajectory-level view exposes more of GRPO's hidden assumptions.
2025-09-12 02 GRPO-Clip:DAPO/GMPO/Spurious Rewards 等 clip 变体对照 Side-by-side: Clip-Higher vs Clip-Wider vs Spurious Rewards on the same axis.
2025-10-24 02 Reinforce++ 和它的 KL Loss 选择 KL Loss choice (k2 vs k3) matters more than usually credited.
2025-11-11 03 无验证器 RL 与 Reference 的妙用 Without verifiers, use PPL / reference-likelihood / reverse-self-eval as proxies.
2025-11-29 03 DeepSeekMath-V2 自我验证:搞数据的风吹到了 RM Reward should model "where the answer is wrong"; generation ↔ verification co-evolve.
2025-12-03 02 DeepSeek V3.2 后训练:稳定压倒一切 Industry's MoE post-train recipe: stability above all else.
2025-12-21 03 无验证 RL——当模型只能相信自己 All internal-feedback methods compress entropy → exploration crisis sooner or later.
2025-12-31 05 RL 究竟能不能突破 Base 边界 Most RL is sampling polish; true extrapolation needs edge data + process reward.
2026-01-14 02 开源大模型 RL Showcase:Kimi/EXAONE/MiMo/MiniMax/Qwen 5 industrial GRPO variants compared side-by-side — every team is patching the same holes.
2026-01-17 04 稳定压倒一切:MoE RL 训推不一致问题及解决策略 Train-infer router mismatch is the surface; R3 / IcePop / TIS each take a different angle.
2026-01-22 04 MoE RL 训练不稳定性再思考:训推不一致,还是采样噪声? Even recomputing logprobs on MoE drifts; the deeper cause is sampling noise, not routing.
2026-03-24 05 Training-Free RL:当训练不再更新参数,而是更新上下文 Advantage in text/context, not in weight space — fixed model can still "RL".
2026-03-29 05 RL 新范式:从经验到更高质量数据 The loop becomes "trajectory → information gain → re-supervision".
2026-04-11 05 实时学习:极致高效的子空间微调 MiCA/TinyLoRA: pluggable real-time learning by occupying the minor singular directions.
2026-04-17 05 从「会答」到「会想」:Planning as Data 与思考范式重构 Reasoning becomes a data format; the next battle is how to construct planning data.

Appendix

① DeepSeek-R1 Reproduction Resources

② Citation / Feedback

If this index or the linked blog posts are useful, leave a note on GitHub Issues, or cite yam.gift when referencing.

"Don't chase the wind — chase the long view; measure one year by the yardstick of ten." — 长琴

About

Curated, opinionated index of post-R1 LLM × Reinforcement Learning. Many deep-dive blog posts cross-linked to many papers — GRPO, DAPO, DPO, PPO, RLHF, GSPO, CISPO, VAPO, Reward Modeling, MoE RL stability, Verifier-Free RL, Training-Free RL, Agentic RL, DeepSeek-R1 reproduction.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors