A curated, opinionated index of post-R1 LLM × Reinforcement Learning. Every paper is read, classified, cross-linked, and connected back to a Chinese deep-dive blog post by hscspring.
I consider RL to be a pivotal technology in the field of AI, and NLP (particularly LLM) to be a direction well worth exploring. This repo focuses on post-R1 LLM RL specifically.
Awesome lists are retrievers. This repo is a curator.
- Has a verdict, not a vote. Each of the 5 tracks ends with the author's bottom-line judgment ("data sets the ceiling, algorithms approach it"; "most RL is sampling polish, not extrapolation"; "all internal-feedback methods compress entropy → exploration crisis"), not a paper summary.
- Cross-links across papers. RLOO ≡ GRPO with Fnorm=1 (derived in GiGPO); REINFORCE + IS → CISPO loss (extended in CISPO); Activation Steering & Context Engineering are the upstream signals of Training-Free RL. These connections only surface when one person reads everything.
- A narrative, not a database. The Chronological Blog Timeline reads as a year-and-a-half editorial arc through post-R1 RL × LLM: Feb-2025 R1 → GRPO family → Reward modeling → MoE stability → Training-Free RL.
- Personal historical anchors. Pre-R1 works (RL4LMs, FTHP, Quark, DT) live in their own corner with a one-line why I personally cared, not as bibliography filler.
Topics covered: DeepSeek-R1 reproduction · GRPO family (DAPO / Dr.GRPO / VAPO / CISPO / GiGPO / GSPO / GMPO / GTPO / Reinforce++) · PPO · RLHF · DPO · Reward Modeling · Verifier-Free RL · MoE RL Stability · Training-Free RL · Activation Steering · Agentic RL.
Blog posts on yam.gift are grouped into 5 tracks. Each verdict below is the author's own bottom-line judgment, not a paper summary.
Parsing the original report, then digging into the data / paradigm / experiment side.
- Core thesis: data sets the ceiling, algorithms only approach it. Pure rule-based RL is finally validated as a viable path.
- The frame that survived: "Base+SFT / Base+RL / SFT+RL" can absorb almost all subsequent variations.
- Loose ends still being chased: R1-Zero behavior differs sharply across base models (SimpleRL-Zoo, Yarz-Logic); LIMO/s1 confirm "less is more on activation, not on teaching".
Every GRPO variant — DAPO / Dr.GRPO / VAPO / CISPO / GiGPO / GSPO / GMPO / GTPO / Reinforce++ / industry showcases.
- Core thesis: every variant is paying off the same engineering debt — token vs. sequence-level, clip tighter/wider, length normalization, KL choice (k2 vs k3), advantage global-normalization.
- Convergence: the GRPO objective increasingly looks like a "people's edition" of PPO with global advantage and no critic.
- Sub-thread: clip is not just a stability knob — it directly shapes the explore/exploit boundary. Spurious Rewards, Clip-Higher (DAPO), Clip-Wider (GMPO) are all moves on the same axis.
RM / RM-Data / Verifier-Free / Self-Verified / Verify-Free RL.
- Drift 1 (modeling): from single scalar → "principles + critique + self-verification" (DeepSeek-GRM → DeepSeekMath-V2).
- Drift 2 (data): good reward data is more like unlocking the base model's existing capabilities than teaching new ones (Skywork-Reward-V2, Spurious Rewards).
- Drift 3 (Verify-Free): when no external verifier exists, all internal-feedback methods (TTRL / EM / RENT / EMPO / Intuitor) end up compressing entropy. Long-term, an exploration crisis is inevitable — ETTRL / Darling / EVOL-RL / RESTRAIN are all band-aids on the same wound.
R3 / IcePop / TIS / KAT-Coder.
- Surface diagnosis: train-infer router mismatch is what everyone first noticed.
- Deeper cause: logprob estimation noise on MoE is not neutral; even recomputing logprobs drifts. The importance ratio (π_new/π_old) — heart of GRPO — is silently diluted on MoE.
- Open bet: GSPO/GMPO's sequence-level + geometric-mean might be MoE-RL-friendly; not yet validated at production scale.
Training-Free / Experiential / Real-time / Planning / RL Boundary.
- Boundary realization: most RL is just sampling polish (Yue), not true pass@k extrapolation.
- Counter-evidence: ProRL / DELTA show extrapolation is possible — but only with edge data + process reward + avoiding "all-zero pass@k" cold start.
- Upstream signals (already in 2025 H2): "Activation Steering" and "Context Engineering" both pointed in this direction before Training-Free RL had a name — behavior can be shaped without touching weights.
- New paradigm A — Training-Free RL: advantage lives in text/context, not in weight space (TRT, Training-Free GRPO, MemAPO, Update-Free Steering).
- New paradigm B — Experience-as-RL: the loop becomes "trajectory → information gain → re-supervision". Reflection, meta-search and open-ended learning are all data-construction tricks in disguise.
- Higher-level question: "reasoning" should be studied as a data format, not as an RL task (Think-Strategy / LEPA).
| GitHub | From | Year | Description |
|---|---|---|---|
| prime-rl | PrimeIntellect-ai | 2025 | Decentralized large-scale RL training framework |
| PRIME | PRIME-RL | 2025 | Scalable RL recipe for reasoning |
| rStar | Microsoft | 2025 | Self-evolved deep reasoning for SLMs |
| veRL | ByteDance | 2024 | LLM RL training framework (Volcano Engine) |
| trl | HuggingFace | 2024 | Train language models with RL |
| RL4LMs | Allen | 2023 | Aligning LMs to human preference via RL |
| alignment-handbook | HuggingFace | 2023 | Recipes for aligning to human/AI preference |
Notation for the My Notes column:
[short title](url)— full Chinese deep-dive available (yam.gift blog or book chapter)(omnibus → ...)— covered as a main thread in a survey/overview blog(<verb: derived/extended/contrasted/described/framed/criticized/...> in [blog]: ...)— touched on as a sub-topic inside another deep-dive, with a one-line pointer; multiple pointers can be chained with;- to-write — not yet written
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| R1 | DeepSeek-R1: Incentivizing Reasoning Capability via RL | DeepSeek | 2025 | paper | DeepSeek R1 深度技术解析及其影响 |
| LIMO | LIMO: Less Is More for Reasoning | SJTU | 2025 | paper | 少量高质量数据 SFT 激活推理 |
| s1 | s1: Simple test-time scaling | Stanford | 2025 | paper | (omnibus → SFT-Data) |
| R1 Survey | The R1-era LLM new paradigm | — | 2025 | — | DeepSeek R1 后 LLM 新范式 |
| R1-Zero+ | Further understanding of R1-Zero | — | 2025 | — | R1-Zero 的进一步理解和探索 |
| SimpleRL-Zoo | SimpleRL-Zoo: R1-Zero RL across diverse base models | HKUST | 2025 | paper | (omnibus → Think-More-about-R1-Zero) |
| FastCuRL | FastCuRL: Curriculum RL with Stage-wise Context Scaling | — | 2025 | paper | (omnibus → Think-More-about-R1-Zero) |
| Logic-RL | Logic-RL: Unleashing LLM Reasoning with Rule-Based RL | — | 2025 | paper | Yarz-Logic:R1-Zero 相关实验报告 |
| Seed-Thinking | Seed-Thinking-v1.5: Advancing Superb Reasoning | ByteDance | 2025 | paper | R1 后范式最佳实践:Seed-Thinking 和 Qwen3 |
| Qwen3 | Qwen3 Technical Report | Qwen | 2025 | paper | (omnibus → Seed-Thinking-Qwen3) |
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| LIMR | LIMR: Less is More for RL Scaling | GAIR-NLP | 2025 | paper, GitHub | R1 相关:RL 数据选择与 Scaling |
| ORZ | Open-Reasoner-Zero | StepFun | 2025 | paper, GitHub | (omnibus → PPO-Data) |
| Online-DPO-R1 | Online-DPO-R1: Effective Reasoning Without the PPO Overhead | Salesforce | 2025 | paper, GitHub | R1 相关:DPO 数据选择与 DPO 等 RL 算法 |
| LIMD | LIMD: Less is More on DPO Data | — | 2025 | — | (omnibus → DPO-Data) |
| OREAL | Exploring the Limit of Outcome Reward for Math Reasoning | InternLM | 2025 | paper, GitHub | to-write |
| DeepScaleR | DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL | Agentica | 2025 | paper, GitHub | (omnibus → R1-New-Paradigm) |
| L1 / LCPO | Controlling How Long A Reasoning Model Thinks With RL | CMU | 2025 | paper, GitHub | (omnibus → R1-New-Paradigm) |
| MRT | Optimizing Test-Time Compute via Meta RL Fine-Tuning | CMU | 2025 | paper, GitHub | to-write |
| ScalingLaw | Value-Based Deep RL Scales Predictably | Berkeley | 2025 | paper | to-write |
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| PRIME | Process Reinforcement through Implicit Rewards | PRIME-RL | 2025 | paper, GitHub | (described in R1-New-Paradigm: implicit PRM — trained as an ORM, used as a PRM) |
| rStar-Math | rStar-Math: Small LLMs Can Master Math Reasoning | Microsoft | 2025 | paper, GitHub | (described in R1-New-Paradigm: rule-based verification on intermediate results at key steps, via Python code execution) |
| rStar | rStar: Mutual Reasoning Makes Smaller LLMs Stronger | Microsoft | 2024 | paper, GitHub | to-write |
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| GRM | Inference-Time Scaling for Generalist Reward Modeling | DeepSeek | 2025 | paper | Reward Model 建模 |
| Skywork-Reward-V2 | Skywork-Reward-V2 | Skywork | 2025 | paper | Reward 数据如何塑造与激发推理策略 |
| Spurious Rewards | Spurious Rewards: Rethinking Training Signals in RLVR | Allen | 2025 | paper | (omnibus → RM-Data / GRPO-Clip) |
| ICM | Anthropic Internal Coherence Maximization | Anthropic | 2025 | blog | (omnibus → RM-Data) |
| DeepSeekMath-V2 | Towards Self-Verifiable Mathematical Reasoning | DeepSeek | 2025 | paper, GitHub | DeepSeekMath-V2 自我验证:搞数据的风吹到了 RM |
One blog (Verify-Free RL) covers the algorithms below — listed individually for searchability.
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| NOVER | NOVER: Incentive Training without External Verifiers | — | 2025 | paper | 无验证器 RL 与 Reference 的妙用 |
| TTRL | TTRL: Test-Time Reinforcement Learning | — | 2025 | paper | 无验证 RL——当模型只能相信自己 |
| SRT | Can Large Reasoning Models Self-Train? | — | 2025 | paper | (omnibus → Verify-Free RL) |
| EM | The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning | — | 2025 | paper | (omnibus → Verify-Free RL; covers EM-FT / EM-RL / EM-INF) |
| RENT | Maximizing Confidence Alone Improves Reasoning | — | 2025 | paper | (omnibus → Verify-Free RL) |
| EMPO | Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization | — | 2025 | paper | (omnibus → Verify-Free RL) |
| Intuitor | Learning to Reason without External Rewards | — | 2025 | paper | (omnibus → Verify-Free RL) |
| ETTRL | ETTRL: Balancing Exploration and Exploitation via Entropy Mechanism | — | 2025 | paper | (omnibus → Verify-Free RL) |
| Darling | Jointly Reinforcing Diversity and Quality in Language Model Generations | — | 2025 | paper | (omnibus → Verify-Free RL) |
| EVOL-RL | Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation | — | 2025 | paper | (omnibus → Verify-Free RL) |
| RESTRAIN | RESTRAIN: From Spurious Votes to Signals — Self-Driven RL with Self-Penalization | — | 2025 | paper | (omnibus → Verify-Free RL) |
| No Free Lunch | No Free Lunch: Rethinking Internal Feedback for LLM Reasoning | — | 2025 | paper | (theoretical critique of Verify-Free; omnibus → Verify-Free RL) |
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| RLHF | Training language models to follow instructions with human feedback | OpenAI | 2022 | paper | HuggingLLM 1.3.3:RLHF 流程与思想 |
| RLOO | Back to Basics: Revisiting REINFORCE Style Optimization for RLHF | Cohere | 2024 | paper | (derived in GiGPO: GRPO with Fnorm=1 ≡ RLOO) |
| ReMax | ReMax: A Simple, Effective, and Efficient RL Method for LLM | CUHK | 2024 | paper | (contrasted in Reinforce++: greedy-baseline variant — inefficient because greedy response is unused for training) |
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| R3 | Stabilizing MoE RL by Aligning Training and Inference Routers | Xiaomi | 2025 | paper | 稳定压倒一切:MoE RL 训推不一致问题及解决策略 |
| IcePop | Small Leak Can Sink a Great Ship — Boost RL Training on MoE | Ant | 2025 | paper | (omnibus → RL-MoE-Stable) |
| TIS | Your Efficient RL Framework Secretly Brings You Off-Policy RL Training | UCSD | 2025 | paper | (omnibus → RL-MoE-Stable) |
| KAT | KAT-Coder Tech Report | Kuaishou | 2026 | blog | MoE RL 训练不稳定性再思考:训推不一致,还是采样噪声? |
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| COPO | Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents | Tencent | 2026 | paper, GitHub | COPO:基于认知模式的 Step-Level RL 优化 |
| GiGPO | Group-in-Group Policy Optimization for LLM Agent Training | NTU, Skywork AI | 2025 | paper, GitHub | GiGPO:双层级优势函数驱动的 Agent RL 新范式 |
| GRPO | DeepSeekMath: Pushing the Limits of Mathematical Reasoning | DeepSeek | 2024 | paper | (covered in DAPO and R1) |
| DAPO | DAPO: An Open-Source LLM RL System at Scale | ByteDance Seed | 2025 | paper, GitHub | DAPO:为 GRPO 锦上添四点花 |
| Dr.GRPO | Understanding R1-Zero-Like Training: A Critical Perspective | Sea AI Lab | 2025 | paper, GitHub | 异曲同工的 Dr.GRPO |
| VAPO | VAPO: Efficient and Reliable RL for Advanced Reasoning | ByteDance Seed | 2025 | paper | VAPO:基于价值方法的新突破 |
| CISPO | MiniMax-M1: Scaling Test-Time Compute Efficiently | MiniMax | 2025 | paper, GitHub | GRPO 优化在继续:CISPO 和熵 |
| GSPO | Group Sequence Policy Optimization | Qwen | 2025 | paper | Token Level X:DAPO/DrGRPO 与 GSPO/GMPO 的殊途同归 |
| GMPO | Geometric-Mean Policy Optimization | UCAS, Microsoft | 2025 | paper, GitHub | (omnibus → Token-Level-GSPO-GMPO) |
| GTPO | GTPO: Trajectory-Based Policy Optimization in LLMs | — | 2025 | paper | GRPO「第一背锅侠」X2:GTPO 双 T 傍地走 |
| Reinforce++ | REINFORCE++: Stabilizing Critic-Free Policy Optimization | OpenRLHF | 2025 | paper, GitHub | Reinforce++ 和它的 KL Loss 选择 |
| KimiRL | Kimi k1.5: Scaling RL with LLMs | Kimi | 2025 | paper | (omnibus → Open-LLM-RL-ShowCase) |
| AGAPO | EXAONE 4.0: Unified LLM Integrating Non-reasoning and Reasoning Modes | LG AI | 2025 | paper | (omnibus → Open-LLM-RL-ShowCase) |
| K-EXAONE | EXAONE-2 Tech Report | LG AI | 2026 | paper | (omnibus → Open-LLM-RL-ShowCase) |
| MOPD | MiMo-V2-Flash Technical Report | Xiaomi | 2026 | paper | (omnibus → Open-LLM-RL-ShowCase) |
| SAPO | Soft Adaptive Policy Optimization | Qwen | 2026 | paper | (omnibus → Open-LLM-RL-ShowCase) |
| DCPO | DCPO: Dynamic Clipping Policy Optimization | Baichuan | 2025 | paper, GitHub | (described in GRPO-Clip: adaptive clip bounds based on token prior probability — expanding exploration room for low-probability tokens) |
| OPO | On-Policy RL with Optimal Reward Baseline | Microsoft | 2025 | paper | (described in Token-Level-GSPO-GMPO: optimal reward baseline minimizes gradient variance; contrasted in GTPO: focuses on advantage/reward level rather than token level) |
| SRPO | SRPO: Cross-Domain Implementation of Large-Scale RL on LLM | Kuaishou | 2025 | paper, HF | (contrasted in Token-Level-GSPO-GMPO: historical resampling retains key samples to improve sample efficiency) |
| DPO | Direct Preference Optimization | Stanford | 2024 | paper | (compared inside DPO-Data) |
| PPO | Proximal Policy Optimization Algorithms | OpenAI | 2017 | paper | (extended in VAPO: Value-based Augmented PPO with GAE refinements / value-pretraining; contrasted in Reinforce++: PPO with critic vs critic-free Reinforce-style) |
| REINFORCE | Simple Statistical Gradient-Following Algorithms | Northeastern | 1992 | paper | (extended in CISPO: REINFORCE + IS → CISPO loss; framed in Open-LLM-RL-ShowCase: REINFORCE-with-baseline as analytic frame for all GRPO variants) |
Pre-R1 RL × NLP works the author considers personally important. Kept here for sentimental and historical reasons rather than as active reading. The author's full reflection: 《通向 AGI 的技术路径:多模态、强化学习与新架构的交汇点》 — "22 年 RL4LMs 出来后我兴奋的晚上觉都没睡着,第一时间就读了他们的代码。"
| Abbr | Title | From | Year | Link | Note |
|---|---|---|---|---|---|
| RL4LMs / NLPO | RL (Not) for NLP: Benchmarks, Baselines, Building Blocks | Allen | 2022 | paper, GitHub | Personally cited milestone in AI-Future-Framework — first felt RL × NLP could really land |
| FTHP | Fine-Tuning Language Models from Human Preferences | OpenAI | 2020 | paper, GitHub | OpenAI's earliest RLHF experiment; the seed of InstructGPT/ChatGPT |
| Quark | Quark: Controllable Text Generation with Reinforced [Un]learning | Allen | 2022 | paper, GitHub | Early attempt at RL for controllable generation × unlearning — niche but conceptually clean |
| DT | Decision Transformer: RL via Sequence Modeling | Berkeley | 2021 | paper, GitHub | The "RL = sequence modeling" reframing — a parallel branch that diverged from the LLM-RL trunk |
Pure RL frontier: where the training loop itself is being pushed (boundary, process reward, experience-as-data, planning-as-data).
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| DeepSeek-V3.2 Post-train | DeepSeek-V3.2 Tech Report | DeepSeek | 2025 | paper | DeepSeek V3.2 后训练:稳定压倒一切 |
| RL Boundary (Yue) | Does RL Really Incentivize Reasoning Capacity Beyond the Base Model? | — | 2025 | paper | RL 究竟能不能突破 Base 边界 |
| Invisible Leash | Invisible Leash | Wu et al. | 2025 | paper | (omnibus → RL-Are-You-OK) |
| ProRL | ProRL: Prolonged RL Expands Reasoning Boundaries | NVIDIA | 2025 | paper | (omnibus → RL-Are-You-OK) |
| DELTA | DELTA: Dense Process Reward for RL Boundary Extrapolation | — | 2025 | — | (omnibus → RL-Are-You-OK) |
| ERL | Experience-as-RL | — | 2026 | paper | RL 新范式:从经验到更高质量数据 |
| MR-Search | MR-Search: Meta-Reasoning Search | — | 2026 | paper | (omnibus → RL-New-Paradigm-Data) |
| OEL | Open-Ended Learning | — | 2026 | paper | (omnibus → RL-New-Paradigm-Data) |
| LEPA | LEPA: Learn to Plan before Answering | — | 2025 | paper | 从「会答」到「会想」:Planning as Data 与思考范式重构 |
| Self-Steering | Self-Steering | — | 2025 | paper | (omnibus → Think-Strategy) |
Post-RL directions covered in the same blogs. The paradigm is moving away from classical RL training, into context, behavior, and parameter-efficient adaptation. They are not RL by the textbook definition, but they share the same goal — shape model behavior — and several of them (Activation Steering, Context Engineering) are the upstream signals that Training-Free RL only later named explicitly.
| Abbr | Title | From | Year | Link | My Notes |
|---|---|---|---|---|---|
| TRT | Test-time Recursive Thinking: Self-Improvement without External Feedback | Microsoft | 2026 | paper | Training-Free RL:当训练不再更新参数,而是更新上下文 |
| Training-Free GRPO | Training-Free Group Relative Policy Optimization | — | 2025 | paper | (omnibus → Training-Free RL) |
| MemAPO | MemAPO: Memory-Augmented Policy Optimization | — | 2026 | paper | (omnibus → Training-Free RL) |
| Update-Free Steering | Update-Free On-Policy Steering via Verifiers | — | 2026 | paper | (omnibus → Training-Free RL) |
| Activation Engineering | Steering Language Models With Activation Engineering | — | 2023 | paper | 激活诱导 LLM 指令跟随 |
| Activation Steering (IF) | Improving Instruction-Following in Language Models through Activation Steering | Microsoft | 2024 | paper | (omnibus → 激活诱导 LLM 指令跟随) |
| Context Engineering | Context Engineering for AI Agents: Lessons from Building Manus | Manus | 2025 | blog | 重识 LLM 法则:上下文工程与数据进化 |
| MiCA | MiCA: Minor-Component Adaptation | — | 2026 | paper | 实时学习:极致高效的子空间微调 |
| TinyLoRA | TinyLoRA | — | 2026 | — | (omnibus → Real-time-Learning-from-PEFT) |
All blog posts in publishing order, with the author's one-sentence takeaway. Use this when you want to follow the narrative arc rather than search by paper.
| Date | Track | Blog (Chinese) | One-sentence takeaway |
|---|---|---|---|
| 2025-02-17 | 01 | DeepSeek R1 深度技术解析及其影响 | Data sets the ceiling, algorithms approach it; pure rule-based RL works. |
| 2025-02-18 | 01 | 少量高质量数据 SFT 激活推理 | LIMO/s1: small high-quality SFT activates reasoning, doesn't teach it. |
| 2025-02-27 | 01 | R1 相关:RL 数据选择与 Scaling | LIMR/ORZ: less-is-more applies to RL data, not just SFT. |
| 2025-03-02 | 01 | R1 相关:DPO 数据选择与 DPO 等 RL 算法 | Online-DPO can rival PPO when paired with the right data pipeline. |
| 2025-03-15 | 01 | DeepSeek R1 后 LLM 新范式 | The post-R1 path forks into multiple parallel lines (length, scaling, MRT, …). |
| 2025-03-19 | 02 | DAPO:为 GRPO 锦上添四点花 | DAPO = Clip-Higher + Dynamic Sampling + Token-Level Loss + Overlong Reward Shaping. |
| 2025-03-28 | 02 | 异曲同工的 Dr.GRPO | Dr.GRPO removes the length & std normalization biases hidden in vanilla GRPO. |
| 2025-04-10 | 01 | R1-Zero 的进一步理解和探索 | R1-Zero behavior depends heavily on base model; "Aha moment" is partly base-pretrain artifact. |
| 2025-04-19 | 02 | VAPO:基于价值方法的新突破 | Value-based methods come back to compete with critic-free GRPO. |
| 2025-04-26 | 01 | Yarz-Logic:R1-Zero 相关实验报告 | Hands-on Logic-RL replication: where R1-Zero's edges are in practice. |
| 2025-05-01 | 01 | R1 后范式最佳实践:Seed-Thinking 和 Qwen3 | Seed-Thinking + Qwen3 are the two most complete industrial post-R1 recipes. |
| 2025-06-09 | 03 | Reward Model 建模 | General-domain RM needs principles+critique, not a single scalar (DeepSeek-GRM). |
| 2025-06-19 | 02 | GRPO 优化在继续:CISPO 和熵 | CISPO shows clip is not just stability — it shapes the explore/exploit edge. |
| 2025-07-01 | 05 | 激活诱导 LLM 指令跟随 | Activation Steering: behavior shaping without weight updates — the prequel to Update-Free Steering. |
| 2025-07-13 | 03 | Reward 数据如何塑造与激发推理策略 | Good reward data unlocks pre-existing strategies; even spurious rewards can do this. |
| 2025-07-25 | 02 | GiGPO:双层级优势函数驱动的 Agent RL 新范式 | Agent RL needs hierarchical (group-in-group) advantages for proper credit assignment. |
| 2025-07-27 | 05 | 重识 LLM 法则:上下文工程与数据进化 | "Everything is context" — the early manifesto behind Training-Free RL. |
| 2025-08-14 | 02 | Token Level X:DAPO/DrGRPO 与 GSPO/GMPO 的殊途同归 | Token-level vs sequence-level is THE axis of the GRPO family. |
| 2025-08-30 | 02 | GRPO「第一背锅侠」X2:GTPO 双 T 傍地走 | GTPO: trajectory-level view exposes more of GRPO's hidden assumptions. |
| 2025-09-12 | 02 | GRPO-Clip:DAPO/GMPO/Spurious Rewards 等 clip 变体对照 | Side-by-side: Clip-Higher vs Clip-Wider vs Spurious Rewards on the same axis. |
| 2025-10-24 | 02 | Reinforce++ 和它的 KL Loss 选择 | KL Loss choice (k2 vs k3) matters more than usually credited. |
| 2025-11-11 | 03 | 无验证器 RL 与 Reference 的妙用 | Without verifiers, use PPL / reference-likelihood / reverse-self-eval as proxies. |
| 2025-11-29 | 03 | DeepSeekMath-V2 自我验证:搞数据的风吹到了 RM | Reward should model "where the answer is wrong"; generation ↔ verification co-evolve. |
| 2025-12-03 | 02 | DeepSeek V3.2 后训练:稳定压倒一切 | Industry's MoE post-train recipe: stability above all else. |
| 2025-12-21 | 03 | 无验证 RL——当模型只能相信自己 | All internal-feedback methods compress entropy → exploration crisis sooner or later. |
| 2025-12-31 | 05 | RL 究竟能不能突破 Base 边界 | Most RL is sampling polish; true extrapolation needs edge data + process reward. |
| 2026-01-14 | 02 | 开源大模型 RL Showcase:Kimi/EXAONE/MiMo/MiniMax/Qwen | 5 industrial GRPO variants compared side-by-side — every team is patching the same holes. |
| 2026-01-17 | 04 | 稳定压倒一切:MoE RL 训推不一致问题及解决策略 | Train-infer router mismatch is the surface; R3 / IcePop / TIS each take a different angle. |
| 2026-01-22 | 04 | MoE RL 训练不稳定性再思考:训推不一致,还是采样噪声? | Even recomputing logprobs on MoE drifts; the deeper cause is sampling noise, not routing. |
| 2026-03-24 | 05 | Training-Free RL:当训练不再更新参数,而是更新上下文 | Advantage in text/context, not in weight space — fixed model can still "RL". |
| 2026-03-29 | 05 | RL 新范式:从经验到更高质量数据 | The loop becomes "trajectory → information gain → re-supervision". |
| 2026-04-11 | 05 | 实时学习:极致高效的子空间微调 | MiCA/TinyLoRA: pluggable real-time learning by occupying the minor singular directions. |
| 2026-04-17 | 05 | 从「会答」到「会想」:Planning as Data 与思考范式重构 | Reasoning becomes a data format; the next battle is how to construct planning data. |
- Jiayi-Pan/TinyZero — clean, minimal reproduction of DeepSeek R1-Zero
- huggingface/open-r1 — fully open reproduction of DeepSeek-R1
- hkust-nlp/simpleRL-reason
- ZihanWang314/RAGEN — first open-source reproduction of DeepSeek-R1 on agent training
If this index or the linked blog posts are useful, leave a note on GitHub Issues, or cite yam.gift when referencing.
"Don't chase the wind — chase the long view; measure one year by the yardstick of ten." — 长琴