RL-LLM-NLP

A curated, opinionated index of post-R1 LLM × Reinforcement Learning. Every paper is read, classified, cross-linked, and connected back to a Chinese deep-dive blog post by hscspring.

I consider RL to be a pivotal technology in the field of AI, and NLP (particularly LLM) to be a direction well worth exploring. This repo focuses on post-R1 LLM RL specifically.

Why this repo (and not another Awesome list)

Awesome lists are retrievers. This repo is a curator.

Has a verdict, not a vote. Each of the 5 tracks ends with the author's bottom-line judgment ("data sets the ceiling, algorithms approach it"; "most RL is sampling polish, not extrapolation"; "all internal-feedback methods compress entropy → exploration crisis"), not a paper summary.
Cross-links across papers. RLOO ≡ GRPO with Fnorm=1 (derived in GiGPO); REINFORCE + IS → CISPO loss (extended in CISPO); Activation Steering & Context Engineering are the upstream signals of Training-Free RL. These connections only surface when one person reads everything.
A narrative, not a database. The Chronological Blog Timeline reads as a year-and-a-half editorial arc through post-R1 RL × LLM: Feb-2025 R1 → GRPO family → Reward modeling → MoE stability → Training-Free RL.
Personal historical anchors. Pre-R1 works (RL4LMs, FTHP, Quark, DT) live in their own corner with a one-line why I personally cared, not as bibliography filler.

Topics covered: DeepSeek-R1 reproduction · GRPO family (DAPO / Dr.GRPO / VAPO / CISPO / GiGPO / GSPO / GMPO / GTPO / Reinforce++) · PPO · RLHF · DPO · Reward Modeling · Verifier-Free RL · MoE RL Stability · Training-Free RL · Activation Steering · Agentic RL.

5 Tracks · The Author's Verdicts

Blog posts on yam.gift are grouped into 5 tracks. Each verdict below is the author's own bottom-line judgment, not a paper summary.

Track 01 — R1 Full-Chain (2025)

Parsing the original report, then digging into the data / paradigm / experiment side.

Core thesis: data sets the ceiling, algorithms only approach it. Pure rule-based RL is finally validated as a viable path.
The frame that survived: "Base+SFT / Base+RL / SFT+RL" can absorb almost all subsequent variations.
Loose ends still being chased: R1-Zero behavior differs sharply across base models (SimpleRL-Zoo, Yarz-Logic); LIMO/s1 confirm "less is more on activation, not on teaching".

Track 02 — GRPO Family & Engineering Refinements (2025–2026)

Every GRPO variant — DAPO / Dr.GRPO / VAPO / CISPO / GiGPO / GSPO / GMPO / GTPO / Reinforce++ / industry showcases.

Core thesis: every variant is paying off the same engineering debt — token vs. sequence-level, clip tighter/wider, length normalization, KL choice (k2 vs k3), advantage global-normalization.
Convergence: the GRPO objective increasingly looks like a "people's edition" of PPO with global advantage and no critic.
Sub-thread: clip is not just a stability knob — it directly shapes the explore/exploit boundary. Spurious Rewards, Clip-Higher (DAPO), Clip-Wider (GMPO) are all moves on the same axis.

Track 03 — Reward Modeling, Data & Verifiers (2025)

RM / RM-Data / Verifier-Free / Self-Verified / Verify-Free RL.

Drift 1 (modeling): from single scalar → "principles + critique + self-verification" (DeepSeek-GRM → DeepSeekMath-V2).
Drift 2 (data): good reward data is more like unlocking the base model's existing capabilities than teaching new ones (Skywork-Reward-V2, Spurious Rewards).
Drift 3 (Verify-Free): when no external verifier exists, all internal-feedback methods (TTRL / EM / RENT / EMPO / Intuitor) end up compressing entropy. Long-term, an exploration crisis is inevitable — ETTRL / Darling / EVOL-RL / RESTRAIN are all band-aids on the same wound.

Track 04 — MoE RL Stability (2026)

R3 / IcePop / TIS / KAT-Coder.

Surface diagnosis: train-infer router mismatch is what everyone first noticed.
Deeper cause: logprob estimation noise on MoE is not neutral; even recomputing logprobs drifts. The importance ratio (π_new/π_old) — heart of GRPO — is silently diluted on MoE.
Open bet: GSPO/GMPO's sequence-level + geometric-mean might be MoE-RL-friendly; not yet validated at production scale.

Track 05 — Paradigm Frontier (2025–2026)

Training-Free / Experiential / Real-time / Planning / RL Boundary.

Boundary realization: most RL is just sampling polish (Yue), not true pass@k extrapolation.
Counter-evidence: ProRL / DELTA show extrapolation is possible — but only with edge data + process reward + avoiding "all-zero pass@k" cold start.
Upstream signals (already in 2025 H2): "Activation Steering" and "Context Engineering" both pointed in this direction before Training-Free RL had a name — behavior can be shaped without touching weights.
New paradigm A — Training-Free RL: advantage lives in text/context, not in weight space (TRT, Training-Free GRPO, MemAPO, Update-Free Steering).
New paradigm B — Experience-as-RL: the loop becomes "trajectory → information gain → re-supervision". Reflection, meta-search and open-ended learning are all data-construction tricks in disguise.
Higher-level question: "reasoning" should be studied as a data format, not as an RL task (Think-Strategy / LEPA).

Library

GitHub	From	Year	Description
prime-rl	PrimeIntellect-ai	2025	Decentralized large-scale RL training framework
PRIME	PRIME-RL	2025	Scalable RL recipe for reasoning
rStar	Microsoft	2025	Self-evolved deep reasoning for SLMs
veRL	ByteDance	2024	LLM RL training framework (Volcano Engine)
trl	HuggingFace	2024	Train language models with RL
RL4LMs	Allen	2023	Aligning LMs to human preference via RL
alignment-handbook	HuggingFace	2023	Recipes for aligning to human/AI preference

Papers

Notation for the My Notes column:

[short title](url) — full Chinese deep-dive available (yam.gift blog or book chapter)
(omnibus → ...) — covered as a main thread in a survey/overview blog
(<verb: derived/extended/contrasted/described/framed/criticized/...> in [blog]: ...) — touched on as a sub-topic inside another deep-dive, with a one-line pointer; multiple pointers can be chained with ;
to-write — not yet written

RL Reasoning Reproduction (R1 and Beyond)

Abbr	Title	From	Year	Link	My Notes
R1	DeepSeek-R1: Incentivizing Reasoning Capability via RL	DeepSeek	2025	paper	DeepSeek R1 深度技术解析及其影响
LIMO	LIMO: Less Is More for Reasoning	SJTU	2025	paper	少量高质量数据 SFT 激活推理
s1	s1: Simple test-time scaling	Stanford	2025	paper	(omnibus → SFT-Data)
R1 Survey	The R1-era LLM new paradigm	—	2025	—	DeepSeek R1 后 LLM 新范式
R1-Zero+	Further understanding of R1-Zero	—	2025	—	R1-Zero 的进一步理解和探索
SimpleRL-Zoo	SimpleRL-Zoo: R1-Zero RL across diverse base models	HKUST	2025	paper	(omnibus → Think-More-about-R1-Zero)
FastCuRL	FastCuRL: Curriculum RL with Stage-wise Context Scaling	—	2025	paper	(omnibus → Think-More-about-R1-Zero)
Logic-RL	Logic-RL: Unleashing LLM Reasoning with Rule-Based RL	—	2025	paper	Yarz-Logic：R1-Zero 相关实验报告
Seed-Thinking	Seed-Thinking-v1.5: Advancing Superb Reasoning	ByteDance	2025	paper	R1 后范式最佳实践：Seed-Thinking 和 Qwen3
Qwen3	Qwen3 Technical Report	Qwen	2025	paper	(omnibus → Seed-Thinking-Qwen3)

RL Data Selection & Scaling

Abbr	Title	From	Year	Link	My Notes
LIMR	LIMR: Less is More for RL Scaling	GAIR-NLP	2025	paper, GitHub	R1 相关：RL 数据选择与 Scaling
ORZ	Open-Reasoner-Zero	StepFun	2025	paper, GitHub	(omnibus → PPO-Data)
Online-DPO-R1	Online-DPO-R1: Effective Reasoning Without the PPO Overhead	Salesforce	2025	paper, GitHub	R1 相关：DPO 数据选择与 DPO 等 RL 算法
LIMD	LIMD: Less is More on DPO Data	—	2025	—	(omnibus → DPO-Data)
OREAL	Exploring the Limit of Outcome Reward for Math Reasoning	InternLM	2025	paper, GitHub	to-write
DeepScaleR	DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL	Agentica	2025	paper, GitHub	(omnibus → R1-New-Paradigm)
L1 / LCPO	Controlling How Long A Reasoning Model Thinks With RL	CMU	2025	paper, GitHub	(omnibus → R1-New-Paradigm)
MRT	Optimizing Test-Time Compute via Meta RL Fine-Tuning	CMU	2025	paper, GitHub	to-write
ScalingLaw	Value-Based Deep RL Scales Predictably	Berkeley	2025	paper	to-write

SLM Reasoning

Abbr	Title	From	Year	Link	My Notes
PRIME	Process Reinforcement through Implicit Rewards	PRIME-RL	2025	paper, GitHub	(described in R1-New-Paradigm: implicit PRM — trained as an ORM, used as a PRM)
rStar-Math	rStar-Math: Small LLMs Can Master Math Reasoning	Microsoft	2025	paper, GitHub	(described in R1-New-Paradigm: rule-based verification on intermediate results at key steps, via Python code execution)
rStar	rStar: Mutual Reasoning Makes Smaller LLMs Stronger	Microsoft	2024	paper, GitHub	to-write

Reward Model (modeling / data / verifier)

Abbr	Title	From	Year	Link	My Notes
GRM	Inference-Time Scaling for Generalist Reward Modeling	DeepSeek	2025	paper	Reward Model 建模
Skywork-Reward-V2	Skywork-Reward-V2	Skywork	2025	paper	Reward 数据如何塑造与激发推理策略
Spurious Rewards	Spurious Rewards: Rethinking Training Signals in RLVR	Allen	2025	paper	(omnibus → RM-Data / GRPO-Clip)
ICM	Anthropic Internal Coherence Maximization	Anthropic	2025	blog	(omnibus → RM-Data)
DeepSeekMath-V2	Towards Self-Verifiable Mathematical Reasoning	DeepSeek	2025	paper, GitHub	DeepSeekMath-V2 自我验证：搞数据的风吹到了 RM

Verifier-Free RL (internal-feedback RL)

One blog (Verify-Free RL) covers the algorithms below — listed individually for searchability.

Abbr	Title	From	Year	Link	My Notes
NOVER	NOVER: Incentive Training without External Verifiers	—	2025	paper	无验证器 RL 与 Reference 的妙用
TTRL	TTRL: Test-Time Reinforcement Learning	—	2025	paper	无验证 RL——当模型只能相信自己
SRT	Can Large Reasoning Models Self-Train?	—	2025	paper	(omnibus → Verify-Free RL)
EM	The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning	—	2025	paper	(omnibus → Verify-Free RL; covers EM-FT / EM-RL / EM-INF)
RENT	Maximizing Confidence Alone Improves Reasoning	—	2025	paper	(omnibus → Verify-Free RL)
EMPO	Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization	—	2025	paper	(omnibus → Verify-Free RL)
Intuitor	Learning to Reason without External Rewards	—	2025	paper	(omnibus → Verify-Free RL)
ETTRL	ETTRL: Balancing Exploration and Exploitation via Entropy Mechanism	—	2025	paper	(omnibus → Verify-Free RL)
Darling	Jointly Reinforcing Diversity and Quality in Language Model Generations	—	2025	paper	(omnibus → Verify-Free RL)
EVOL-RL	Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation	—	2025	paper	(omnibus → Verify-Free RL)
RESTRAIN	RESTRAIN: From Spurious Votes to Signals — Self-Driven RL with Self-Penalization	—	2025	paper	(omnibus → Verify-Free RL)
No Free Lunch	No Free Lunch: Rethinking Internal Feedback for LLM Reasoning	—	2025	paper	(theoretical critique of Verify-Free; omnibus → Verify-Free RL)

Alignment Classics

Abbr	Title	From	Year	Link	My Notes
RLHF	Training language models to follow instructions with human feedback	OpenAI	2022	paper	HuggingLLM 1.3.3：RLHF 流程与思想
RLOO	Back to Basics: Revisiting REINFORCE Style Optimization for RLHF	Cohere	2024	paper	(derived in GiGPO: GRPO with Fnorm=1 ≡ RLOO)
ReMax	ReMax: A Simple, Effective, and Efficient RL Method for LLM	CUHK	2024	paper	(contrasted in Reinforce++: greedy-baseline variant — inefficient because greedy response is unused for training)

MoE RL Stability

Abbr	Title	From	Year	Link	My Notes
R3	Stabilizing MoE RL by Aligning Training and Inference Routers	Xiaomi	2025	paper	稳定压倒一切：MoE RL 训推不一致问题及解决策略
IcePop	Small Leak Can Sink a Great Ship — Boost RL Training on MoE	Ant	2025	paper	(omnibus → RL-MoE-Stable)
TIS	Your Efficient RL Framework Secretly Brings You Off-Policy RL Training	UCSD	2025	paper	(omnibus → RL-MoE-Stable)
KAT	KAT-Coder Tech Report	Kuaishou	2026	blog	MoE RL 训练不稳定性再思考：训推不一致，还是采样噪声？

Optimization Algorithms (GRPO Family + Classics)

Abbr	Title	From	Year	Link	My Notes
COPO	Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents	Tencent	2026	paper, GitHub	COPO：基于认知模式的 Step-Level RL 优化
GiGPO	Group-in-Group Policy Optimization for LLM Agent Training	NTU, Skywork AI	2025	paper, GitHub	GiGPO：双层级优势函数驱动的 Agent RL 新范式

GRPO	DeepSeekMath: Pushing the Limits of Mathematical Reasoning	DeepSeek	2024	paper	(covered in DAPO and R1)
DAPO	DAPO: An Open-Source LLM RL System at Scale	ByteDance Seed	2025	paper, GitHub	DAPO：为 GRPO 锦上添四点花
Dr.GRPO	Understanding R1-Zero-Like Training: A Critical Perspective	Sea AI Lab	2025	paper, GitHub	异曲同工的 Dr.GRPO
VAPO	VAPO: Efficient and Reliable RL for Advanced Reasoning	ByteDance Seed	2025	paper	VAPO：基于价值方法的新突破
CISPO	MiniMax-M1: Scaling Test-Time Compute Efficiently	MiniMax	2025	paper, GitHub	GRPO 优化在继续：CISPO 和熵
GSPO	Group Sequence Policy Optimization	Qwen	2025	paper	Token Level X：DAPO/DrGRPO 与 GSPO/GMPO 的殊途同归
GMPO	Geometric-Mean Policy Optimization	UCAS, Microsoft	2025	paper, GitHub	(omnibus → Token-Level-GSPO-GMPO)
GTPO	GTPO: Trajectory-Based Policy Optimization in LLMs	—	2025	paper	GRPO「第一背锅侠」X2：GTPO 双 T 傍地走
Reinforce++	REINFORCE++: Stabilizing Critic-Free Policy Optimization	OpenRLHF	2025	paper, GitHub	Reinforce++ 和它的 KL Loss 选择
KimiRL	Kimi k1.5: Scaling RL with LLMs	Kimi	2025	paper	(omnibus → Open-LLM-RL-ShowCase)
AGAPO	EXAONE 4.0: Unified LLM Integrating Non-reasoning and Reasoning Modes	LG AI	2025	paper	(omnibus → Open-LLM-RL-ShowCase)
K-EXAONE	EXAONE-2 Tech Report	LG AI	2026	paper	(omnibus → Open-LLM-RL-ShowCase)
MOPD	MiMo-V2-Flash Technical Report	Xiaomi	2026	paper	(omnibus → Open-LLM-RL-ShowCase)
SAPO	Soft Adaptive Policy Optimization	Qwen	2026	paper	(omnibus → Open-LLM-RL-ShowCase)
DCPO	DCPO: Dynamic Clipping Policy Optimization	Baichuan	2025	paper, GitHub	(described in GRPO-Clip: adaptive clip bounds based on token prior probability — expanding exploration room for low-probability tokens)
OPO	On-Policy RL with Optimal Reward Baseline	Microsoft	2025	paper	(described in Token-Level-GSPO-GMPO: optimal reward baseline minimizes gradient variance; contrasted in GTPO: focuses on advantage/reward level rather than token level)
SRPO	SRPO: Cross-Domain Implementation of Large-Scale RL on LLM	Kuaishou	2025	paper, HF	(contrasted in Token-Level-GSPO-GMPO: historical resampling retains key samples to improve sample efficiency)
DPO	Direct Preference Optimization	Stanford	2024	paper	(compared inside DPO-Data)
PPO	Proximal Policy Optimization Algorithms	OpenAI	2017	paper	(extended in VAPO: Value-based Augmented PPO with GAE refinements / value-pretraining; contrasted in Reinforce++: PPO with critic vs critic-free Reinforce-style)
REINFORCE	Simple Statistical Gradient-Following Algorithms	Northeastern	1992	paper	(extended in CISPO: REINFORCE + IS → CISPO loss; framed in Open-LLM-RL-ShowCase: REINFORCE-with-baseline as analytic frame for all GRPO variants)

Pre-R1 Foundations (Historical Anchors)

Pre-R1 RL × NLP works the author considers personally important. Kept here for sentimental and historical reasons rather than as active reading. The author's full reflection: 《通向 AGI 的技术路径：多模态、强化学习与新架构的交汇点》 — "22 年 RL4LMs 出来后我兴奋的晚上觉都没睡着，第一时间就读了他们的代码。"

Abbr	Title	From	Year	Link	Note
RL4LMs / NLPO	RL (Not) for NLP: Benchmarks, Baselines, Building Blocks	Allen	2022	paper, GitHub	Personally cited milestone in AI-Future-Framework — first felt RL × NLP could really land
FTHP	Fine-Tuning Language Models from Human Preferences	OpenAI	2020	paper, GitHub	OpenAI's earliest RLHF experiment; the seed of InstructGPT/ChatGPT
Quark	Quark: Controllable Text Generation with Reinforced [Un]learning	Allen	2022	paper, GitHub	Early attempt at RL for controllable generation × unlearning — niche but conceptually clean
DT	Decision Transformer: RL via Sequence Modeling	Berkeley	2021	paper, GitHub	The "RL = sequence modeling" reframing — a parallel branch that diverged from the LLM-RL trunk

Frontier RL — Boundary, Process Reward & Experience

Pure RL frontier: where the training loop itself is being pushed (boundary, process reward, experience-as-data, planning-as-data).

Abbr	Title	From	Year	Link	My Notes
DeepSeek-V3.2 Post-train	DeepSeek-V3.2 Tech Report	DeepSeek	2025	paper	DeepSeek V3.2 后训练：稳定压倒一切
RL Boundary (Yue)	Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?	—	2025	paper	RL 究竟能不能突破 Base 边界
Invisible Leash	Invisible Leash	Wu et al.	2025	paper	(omnibus → RL-Are-You-OK)
ProRL	ProRL: Prolonged RL Expands Reasoning Boundaries	NVIDIA	2025	paper	(omnibus → RL-Are-You-OK)
DELTA	DELTA: Dense Process Reward for RL Boundary Extrapolation	—	2025	—	(omnibus → RL-Are-You-OK)
ERL	Experience-as-RL	—	2026	paper	RL 新范式：从经验到更高质量数据
MR-Search	MR-Search: Meta-Reasoning Search	—	2026	paper	(omnibus → RL-New-Paradigm-Data)
OEL	Open-Ended Learning	—	2026	paper	(omnibus → RL-New-Paradigm-Data)
LEPA	LEPA: Learn to Plan before Answering	—	2025	paper	从「会答」到「会想」：Planning as Data 与思考范式重构
Self-Steering	Self-Steering	—	2025	paper	(omnibus → Think-Strategy)

Beyond RL — Training-Free / Behavior Shaping / Real-time PEFT

Post-RL directions covered in the same blogs. The paradigm is moving away from classical RL training, into context, behavior, and parameter-efficient adaptation. They are not RL by the textbook definition, but they share the same goal — shape model behavior — and several of them (Activation Steering, Context Engineering) are the upstream signals that Training-Free RL only later named explicitly.

Abbr	Title	From	Year	Link	My Notes
TRT	Test-time Recursive Thinking: Self-Improvement without External Feedback	Microsoft	2026	paper	Training-Free RL：当训练不再更新参数，而是更新上下文
Training-Free GRPO	Training-Free Group Relative Policy Optimization	—	2025	paper	(omnibus → Training-Free RL)
MemAPO	MemAPO: Memory-Augmented Policy Optimization	—	2026	paper	(omnibus → Training-Free RL)
Update-Free Steering	Update-Free On-Policy Steering via Verifiers	—	2026	paper	(omnibus → Training-Free RL)
Activation Engineering	Steering Language Models With Activation Engineering	—	2023	paper	激活诱导 LLM 指令跟随
Activation Steering (IF)	Improving Instruction-Following in Language Models through Activation Steering	Microsoft	2024	paper	(omnibus → 激活诱导 LLM 指令跟随)
Context Engineering	Context Engineering for AI Agents: Lessons from Building Manus	Manus	2025	blog	重识 LLM 法则：上下文工程与数据进化
MiCA	MiCA: Minor-Component Adaptation	—	2026	paper	实时学习：极致高效的子空间微调
TinyLoRA	TinyLoRA	—	2026	—	(omnibus → Real-time-Learning-from-PEFT)

Chronological Blog Timeline

All blog posts in publishing order, with the author's one-sentence takeaway. Use this when you want to follow the narrative arc rather than search by paper.

Date	Track	Blog (Chinese)	One-sentence takeaway
2025-02-17	01	DeepSeek R1 深度技术解析及其影响	Data sets the ceiling, algorithms approach it; pure rule-based RL works.
2025-02-18	01	少量高质量数据 SFT 激活推理	LIMO/s1: small high-quality SFT activates reasoning, doesn't teach it.
2025-02-27	01	R1 相关：RL 数据选择与 Scaling	LIMR/ORZ: less-is-more applies to RL data, not just SFT.
2025-03-02	01	R1 相关：DPO 数据选择与 DPO 等 RL 算法	Online-DPO can rival PPO when paired with the right data pipeline.
2025-03-15	01	DeepSeek R1 后 LLM 新范式	The post-R1 path forks into multiple parallel lines (length, scaling, MRT, …).
2025-03-19	02	DAPO：为 GRPO 锦上添四点花	DAPO = Clip-Higher + Dynamic Sampling + Token-Level Loss + Overlong Reward Shaping.
2025-03-28	02	异曲同工的 Dr.GRPO	Dr.GRPO removes the length & std normalization biases hidden in vanilla GRPO.
2025-04-10	01	R1-Zero 的进一步理解和探索	R1-Zero behavior depends heavily on base model; "Aha moment" is partly base-pretrain artifact.
2025-04-19	02	VAPO：基于价值方法的新突破	Value-based methods come back to compete with critic-free GRPO.
2025-04-26	01	Yarz-Logic：R1-Zero 相关实验报告	Hands-on Logic-RL replication: where R1-Zero's edges are in practice.
2025-05-01	01	R1 后范式最佳实践：Seed-Thinking 和 Qwen3	Seed-Thinking + Qwen3 are the two most complete industrial post-R1 recipes.
2025-06-09	03	Reward Model 建模	General-domain RM needs principles+critique, not a single scalar (DeepSeek-GRM).
2025-06-19	02	GRPO 优化在继续：CISPO 和熵	CISPO shows clip is not just stability — it shapes the explore/exploit edge.
2025-07-01	05	激活诱导 LLM 指令跟随	Activation Steering: behavior shaping without weight updates — the prequel to Update-Free Steering.
2025-07-13	03	Reward 数据如何塑造与激发推理策略	Good reward data unlocks pre-existing strategies; even spurious rewards can do this.
2025-07-25	02	GiGPO：双层级优势函数驱动的 Agent RL 新范式	Agent RL needs hierarchical (group-in-group) advantages for proper credit assignment.
2025-07-27	05	重识 LLM 法则：上下文工程与数据进化	"Everything is context" — the early manifesto behind Training-Free RL.
2025-08-14	02	Token Level X：DAPO/DrGRPO 与 GSPO/GMPO 的殊途同归	Token-level vs sequence-level is THE axis of the GRPO family.
2025-08-30	02	GRPO「第一背锅侠」X2：GTPO 双 T 傍地走	GTPO: trajectory-level view exposes more of GRPO's hidden assumptions.
2025-09-12	02	GRPO-Clip：DAPO/GMPO/Spurious Rewards 等 clip 变体对照	Side-by-side: Clip-Higher vs Clip-Wider vs Spurious Rewards on the same axis.
2025-10-24	02	Reinforce++ 和它的 KL Loss 选择	KL Loss choice (k2 vs k3) matters more than usually credited.
2025-11-11	03	无验证器 RL 与 Reference 的妙用	Without verifiers, use PPL / reference-likelihood / reverse-self-eval as proxies.
2025-11-29	03	DeepSeekMath-V2 自我验证：搞数据的风吹到了 RM	Reward should model "where the answer is wrong"; generation ↔ verification co-evolve.
2025-12-03	02	DeepSeek V3.2 后训练：稳定压倒一切	Industry's MoE post-train recipe: stability above all else.
2025-12-21	03	无验证 RL——当模型只能相信自己	All internal-feedback methods compress entropy → exploration crisis sooner or later.
2025-12-31	05	RL 究竟能不能突破 Base 边界	Most RL is sampling polish; true extrapolation needs edge data + process reward.
2026-01-14	02	开源大模型 RL Showcase：Kimi/EXAONE/MiMo/MiniMax/Qwen	5 industrial GRPO variants compared side-by-side — every team is patching the same holes.
2026-01-17	04	稳定压倒一切：MoE RL 训推不一致问题及解决策略	Train-infer router mismatch is the surface; R3 / IcePop / TIS each take a different angle.
2026-01-22	04	MoE RL 训练不稳定性再思考：训推不一致，还是采样噪声？	Even recomputing logprobs on MoE drifts; the deeper cause is sampling noise, not routing.
2026-03-24	05	Training-Free RL：当训练不再更新参数，而是更新上下文	Advantage in text/context, not in weight space — fixed model can still "RL".
2026-03-29	05	RL 新范式：从经验到更高质量数据	The loop becomes "trajectory → information gain → re-supervision".
2026-04-11	05	实时学习：极致高效的子空间微调	MiCA/TinyLoRA: pluggable real-time learning by occupying the minor singular directions.
2026-04-17	05	从「会答」到「会想」：Planning as Data 与思考范式重构	Reasoning becomes a data format; the next battle is how to construct planning data.

Appendix

① DeepSeek-R1 Reproduction Resources

Jiayi-Pan/TinyZero — clean, minimal reproduction of DeepSeek R1-Zero
huggingface/open-r1 — fully open reproduction of DeepSeek-R1
hkust-nlp/simpleRL-reason
ZihanWang314/RAGEN — first open-source reproduction of DeepSeek-R1 on agent training

② Citation / Feedback

If this index or the linked blog posts are useful, leave a note on GitHub Issues, or cite yam.gift when referencing.

"Don't chase the wind — chase the long view; measure one year by the yardstick of ten." — 长琴

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL-LLM-NLP

Why this repo (and not another Awesome list)

5 Tracks · The Author's Verdicts

Track 01 — R1 Full-Chain (2025)

Track 02 — GRPO Family & Engineering Refinements (2025–2026)

Track 03 — Reward Modeling, Data & Verifiers (2025)

Track 04 — MoE RL Stability (2026)

Track 05 — Paradigm Frontier (2025–2026)

Library

Papers

RL Reasoning Reproduction (R1 and Beyond)

RL Data Selection & Scaling

SLM Reasoning

Reward Model (modeling / data / verifier)

Verifier-Free RL (internal-feedback RL)

Alignment Classics

MoE RL Stability

Optimization Algorithms (GRPO Family + Classics)

Pre-R1 Foundations (Historical Anchors)

Frontier RL — Boundary, Process Reward & Experience

Beyond RL — Training-Free / Behavior Shaping / Real-time PEFT

Chronological Blog Timeline

Appendix

① DeepSeek-R1 Reproduction Resources

② Citation / Feedback

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RL-LLM-NLP

Why this repo (and not another Awesome list)

5 Tracks · The Author's Verdicts

Track 01 — R1 Full-Chain (2025)

Track 02 — GRPO Family & Engineering Refinements (2025–2026)

Track 03 — Reward Modeling, Data & Verifiers (2025)

Track 04 — MoE RL Stability (2026)

Track 05 — Paradigm Frontier (2025–2026)

Library

Papers

RL Reasoning Reproduction (R1 and Beyond)

RL Data Selection & Scaling

SLM Reasoning

Reward Model (modeling / data / verifier)

Verifier-Free RL (internal-feedback RL)

Alignment Classics

MoE RL Stability

Optimization Algorithms (GRPO Family + Classics)

Pre-R1 Foundations (Historical Anchors)

Frontier RL — Boundary, Process Reward & Experience

Beyond RL — Training-Free / Behavior Shaping / Real-time PEFT

Chronological Blog Timeline

Appendix

① DeepSeek-R1 Reproduction Resources

② Citation / Feedback

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages