Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
387 changes: 387 additions & 0 deletions app/blog/qerl-quantization-reinforcement-learning/page.tsx

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions app/page.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,37 @@ export default function Home() {
</div>
</div>
</Link>

{/* QeRL Project */}
<Link
href="/blog/qerl-quantization-reinforcement-learning"
className="group relative bg-gradient-to-br from-slate-800/50 to-slate-700/50 backdrop-blur-sm border border-slate-600/50 rounded-xl p-6 hover:border-orange-500/50 hover:shadow-2xl hover:shadow-orange-500/10 transition-all duration-300"
>
<div className="absolute top-4 left-4">
<span className="bg-slate-600/50 text-slate-300 text-xs px-2 py-1 rounded-md">{getText('Research', '研究')}</span>
</div>
<div className="absolute top-4 right-4">
<span className="bg-orange-500/20 text-orange-400 text-xs px-2 py-1 rounded-md">{getText('Latest', '最新')}</span>
</div>

<div className="mt-8">
<h4 className="text-xl font-bold mb-3 group-hover:text-orange-400 transition-colors">
{getText('QeRL: Beyond Efficiency', 'QeRL:超越效率')}
</h4>
<p className="text-gray-400 text-sm mb-4 leading-relaxed">
{getText(
'Quantization-enhanced Reinforcement Learning for LLMs achieves 1.5x speedup and enables RL training of 32B models on a single H100 80GB GPU',
'量化增强强化学习在单个H100 80GB GPU上实现1.5倍加速并支持32B模型的强化学习训练'
)}
</p>
<div className="flex items-center justify-between">
<span className="text-xs text-gray-500">{getText('MIT-Han Lab', 'MIT韩松实验室')}</span>
<span className="text-orange-400 text-sm group-hover:text-orange-300 transition-colors">
{getText('Learn More', '了解更多')} →
</span>
</div>
</div>
</Link>
</div>
</div>
</main>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,457 changes: 1,457 additions & 0 deletions public/content/qerl-quantization-reinforcement-learning/paper.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
hero:
title: "Train 32B LLM Reasoning On 1 GPU - H100 80GB - QeRL"
subtitle: "LLM Reinforcement Learning With 4Bit Quantization"
tags:
- "⏱️ Technical Deep Dive"
- "📄 Research Article"
---

### 1. High-Level Summary (TL;DR)

The paper introduces **QeRL**, a new framework that makes Reinforcement Learning (RL) for Large Language Models (LLMs) significantly faster and more memory-efficient. The key insight is counter-intuitive: quantizing (compressing) the model to 4-bits not only saves memory and increases speed (achieving 1.5× rollout speedup and 1.8× end-to-end training speedup), but the noise introduced as a consequence of quantization can be leveraged for exploration - it makes next token a bit more random, encouraging the model to discover better reasoning strategies and leading to superior final performance. QeRL combines a high-speed 4-bit format (NVFP4) with a dynamic noise-injection technique, achieving drastic memory savings and accuracy that matches or even exceeds traditional, more resource-heavy training methods.

![QeRL Performance Overview](/content/qerl-quantization-reinforcement-learning/images/performance.png)
*Figure 1: QeRL achieves superior rollout speed and end-to-end training efficiency while delivering better accuracy than vanilla LoRA and QLoRA, matching full-parameter RL training.*

---

### 2. The Problem Being Solved

Reinforcement Learning (RL) is a powerful technique for teaching LLMs complex reasoning skills (like solving math problems). Unlike Supervised Fine-Tuning (SFT) which just mimics examples, RL allows the model to try different solutions and learn from a reward signal (e.g., "was the final answer correct?"). However, RL for LLMs is extremely demanding:

1. **High GPU Memory:** RL training often requires multiple copies of the model to be in memory simultaneously (e.g., the policy model being trained, a reference model, etc.). The policy model generates responses and gets updated based on rewards, while the reference model (a frozen copy of the policy model) is kept to measure how much the policy has changed - by comparing their output probabilities, the training adds a penalty (KL divergence) if the policy deviates too much, preventing instability.
2. **Slow Training Time:** The training process has a major bottleneck called the **rollout phase**. During rollouts, the model must generate multiple candidate solutions for each problem (e.g., 8-16 different reasoning paths per question), creating long sequences of tokens (up to 4,096 tokens per solution). All these candidates are then evaluated for correctness to compute rewards. This generation and evaluation process is computationally expensive and time-consuming, dominating the training time.
3. **Ineffective Existing Solutions:**
* **LoRA (Low-Rank Adaptation):** Reduces the number of *trainable* parameters but doesn't shrink the main model. This saves some memory but does nothing to speed up the slow rollout phase, as the full-size model still has to run.
* **QLoRA (Quantized LoRA):** Shrinks the main model to 4-bits (NF4 format) to save memory, making it possible to train models that wouldn't otherwise fit on available GPUs. However, the NF4 format used by QLoRA is computationally slow. It requires "unpacking" the 4-bit values into a higher precision format before calculations, which actually makes the rollout phase **1.5x to 2x slower** than using a standard 16-bit model. Users accept this speed penalty for the ability to fit the model onto GPU.

In short, existing methods force a trade-off: save memory but slow down training, or keep training fast but require massive amounts of GPU memory.

![QeRL Framework Comparison](/content/qerl-quantization-reinforcement-learning/images/qerl-framework.png)
*Figure 2: Comparison of RL training approaches. (a) LoRA reduces trainable parameters but doesn't accelerate rollouts. (b) QLoRA uses slow NF4 quantization. (c) QeRL uses fast NVFP4 quantization with Adaptive Quantization Noise for enhanced exploration.*

---

### 3. The Core Idea of QeRL: Quantization is a Feature, Not a Bug

The authors of QeRL discovered something surprising. The small errors, or "noise," introduced by quantization can be beneficial for RL.

* **How it Works:** When a model is quantized, its weights are slightly altered due to compression - representing precise 16-bit floating-point numbers with only 4 bits means the original values must be rounded to the nearest representable value, introducing small errors. This adds a small amount of randomness to the model's output logits (the scores it gives to each possible next word).
* **Increased Policy Entropy:** This randomness makes the probability distribution over the next word "flatter." Instead of being overconfident in one single "best" word, the model assigns smoother probabilities to a wider range of plausible words. This is known as increasing the **policy entropy**.
* **Enhanced Exploration:** In RL, higher entropy is a good thing, especially early in training. It encourages the model to **explore** different paths and strategies instead of getting stuck on a single, potentially suboptimal one. This is similar to how humans brainstorm different ways to solve a problem before settling on the best one.

This turns the conventional wisdom on its head. In SFT, quantization noise is usually seen as a negative side effect to be minimized. In RL, QeRL shows it can be harnessed as a **computationally free exploration mechanism**.

![Entropy and Exploration](/content/qerl-quantization-reinforcement-learning/images/entropy-exploration.png)
*Figure 3: Quantization noise increases policy entropy early in training, leading to better exploration and faster reward growth. The higher initial entropy helps the model discover superior reasoning strategies. LoRA (Low-Rank Adaptation) works by decomposing weight updates into two smaller matrices with rank r (e.g., r=16, r=32, r=64). The rank controls the expressiveness of the adapter: higher ranks can capture more complex updates but require more memory and computation. In the experiments, "r32" means the LoRA rank is set to 32.*

![Reward Growth Comparison](/content/qerl-quantization-reinforcement-learning/images/reward-growth.png)
*Figure 4: Training curves showing QeRL achieves faster reward growth than 16-bit LoRA and QLoRA across multiple model sizes, demonstrating the benefit of quantization-enhanced exploration.*

---

### 4. How QeRL Works: The Key Components

QeRL is built on three main pillars to be both efficient and effective.

#### a) High-Performance Quantization (NVFP4 + Marlin Kernel)

Instead of the slow NF4 format from QLoRA, QeRL uses **NVFP4**, a modern 4-bit floating-point format with hardware support on both NVIDIA Hopper (H100) and Blackwell (B200) GPUs. All experiments in the paper were conducted on H100 GPUs.

* **Speed:** Combined with optimized kernels like **Marlin**, NVFP4 allows for matrix multiplication to be performed directly on the 4-bit weights without slow de-quantization steps. The hardware support enables these operations to run efficiently, which is what makes the rollout phase **faster** than standard 16-bit training.
* **Memory:** It still provides the massive memory savings of 4-bit quantization, reducing the model's memory footprint by about 75%.

This combination solves the efficiency problem: you get both memory savings *and* a speedup.

#### b) Adaptive Quantization Noise (AQN)

The inherent noise from quantization is **static**—it doesn't change during training. However, the ideal exploration strategy in RL is **dynamic**: explore a lot at the beginning, then exploit the best-found strategies later on.

To solve this, QeRL introduces **Adaptive Quantization Noise (AQN)**:

1. **Dynamic Noise Injection:** QeRL periodically injects a small amount of additional, random noise into the model's parameters during training. This is added **on top of** the inherent quantization noise that is always present.
2. **Noise Scheduler:** This extra noise is not constant. It follows a decay schedule (e.g., exponential decay). It starts high to encourage broad exploration and gradually decreases over training steps, approaching zero by the end.

**Important:** At the end of training, the additional injected noise becomes negligible, but the **base quantization noise remains** (since the model is still quantized). This means QeRL always has more noise than standard 16-bit training, but less than the high-noise exploration phase at the start. This transforms the static quantization noise into a controllable, dynamic exploration tool perfectly suited for RL.

**Doesn't the noise hurt final performance?** Surprisingly, no! The paper shows that QeRL achieves **better** final accuracy than 16-bit LoRA (90.8% vs 88.1% on GSM8K) and even matches full-parameter fine-tuning (91.2%). The reason: the exploration benefits during training help the model discover superior reasoning strategies. Once these better strategies are learned through RL updates, they persist even with the remaining quantization noise. The small amount of noise at the end doesn't prevent the model from being confident in the good solutions it discovered—it just prevents overconfidence in suboptimal ones during training.

![Noise Schedule](/content/qerl-quantization-reinforcement-learning/images/noise-schedule.png)
*Figure 5: The Adaptive Quantization Noise (AQN) scheduler uses exponential decay to gradually reduce exploration noise during training, balancing exploration early on with exploitation later.*

#### c) Zero-Overhead Noise Merging

Adding noise vectors for every layer would consume extra memory and slow things down. QeRL uses a clever trick to avoid this. It **merges the noise vector into the scaling parameters of the RMSNorm (Root Mean Square Normalization) layers** that are already part of the LLM architecture.

**How it works:** RMSNorm has a learnable scaling parameter `w`. Instead of adding noise to weights directly, QeRL simply adds the noise to this parameter: `w_noise = w + Z_noise`. Since the normalized activations are multiplied by this scaling factor before being fed to the quantized weights, adding noise to `w` has the same exploration effect as adding noise to the weights themselves. This achieves the same effect but requires **zero extra parameters and minimal computational overhead**.

![Noise Merge Diagram](/content/qerl-quantization-reinforcement-learning/images/noise-merge-diagram.png)
*Figure 6: Implementation detail showing how quantization noise is merged into layer normalization for zero-parameter overhead. This clever optimization maintains the benefits without additional memory cost.*

---

### 5. Key Experiments and Results

The paper demonstrates QeRL's superiority through extensive experiments on mathematical reasoning benchmarks (GSM8K, MATH).

* **Speed and Memory Efficiency:**
* QeRL provides over **1.5x end-to-end training speedup** compared to 16-bit LoRA and over 2x speedup compared to the slower QLoRA.
* It drastically reduces memory usage, enabling the **training of a 32B parameter model on a single 80GB H100 GPU**, a feat impossible with standard LoRA.

![Memory Comparison](/content/qerl-quantization-reinforcement-learning/images/memory-comparison.png)
*Figure 7: Training curves comparing QeRL (NVFP4-LoRA) with full-parameter training and 16-bit LoRA. QeRL achieves the same final reward as full-parameter training, which 16-bit LoRA fails to reach. While QeRL requires more training steps than full training to converge, each QeRL step is significantly faster, making it more efficient overall in wall-clock time.*

![Speed Comparison](/content/qerl-quantization-reinforcement-learning/images/speed-comparison.png)
*Figure 8: Adding AQN (Adaptive Quantization Noise) helps model learn faster.*

* **Performance and Accuracy:**
* **Faster Reward Growth:** QeRL models achieve higher rewards much faster than 16-bit LoRA and QLoRA, thanks to the enhanced exploration.
* **Higher Final Accuracy:** On benchmarks like GSM8K, the QeRL-trained 7B model scored **90.8%**, outperforming both 16-bit LoRA (88.1%) and QLoRA (85.0%).
* **Matches Full Fine-Tuning:** Critically, QeRL's performance **matches that of full-parameter fine-tuning** (91.2%), which uses vastly more resources. This shows there is no accuracy trade-off for the massive efficiency gains.

![7B Model Results](/content/qerl-quantization-reinforcement-learning/images/7b-results.png)
*Figure 9: Performance comparison on Qwen2.5-7B across multiple mathematical reasoning benchmarks (GSM8K, MATH 500, AIME 24, AMC 23). QeRL consistently outperforms other parameter-efficient methods.*


### 6. Conclusion & Significance

**QeRL is a significant advancement for training LLMs with Reinforcement Learning.**

1. **It breaks the efficiency-performance trade-off.** It is the first framework that is simultaneously faster, more memory-efficient, *and* achieves better results than standard parameter-efficient methods like LoRA.
2. **It democratizes RL for LLMs.** By enabling the training of large models on single GPUs, it makes powerful RL techniques accessible to a much wider range of researchers and developers who lack access to massive supercomputers.
3. **It reframes quantization.** It shows that quantization is not just a compression tool but can be an integral part of the learning algorithm itself, providing a "free" and effective mechanism for exploration in RL.

---

## Key Takeaways

✅ **1.7× speedup** in RL rollout phase
✅ **3× memory reduction** (62GB → 20GB for 32B models)
✅ **Better accuracy** than 16-bit LoRA and QLoRA
✅ **Matches full fine-tuning** with fraction of resources
✅ **First single-GPU** solution for 32B model RL training
✅ **Quantization noise enhances** exploration (paradigm shift!)

**Resources:**
- 📄 [Read the Paper](https://arxiv.org/pdf/2510.11696)
- 💻 [GitHub Repository](https://github.com/NVlabs/QeRL)
- 🏢 Research by NVIDIA, MIT, The University of Hong Kong (HKU), and Tsinghua University (THU)
102 changes: 102 additions & 0 deletions scripts/QUICKSTART.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Quick Start Guide

## Convert Figures for a New Blog Post

### 1. Install Dependencies (One Time)

```bash
pip install PyMuPDF pillow
```

### 2. Prepare Your Paper Figures

Download paper source from arXiv (if available):

```bash
# Example for arXiv paper 2510.11696
cd public/content/your-paper-slug/
wget https://arxiv.org/e-print/2510.11696
tar -xzf 2510.11696
```

Or manually place PDF figures in:
```
public/content/your-paper-slug/figures/
```

### 3. Create Conversion Script

Copy the template:

```bash
cp scripts/examples/convert_template.py scripts/examples/convert_YOUR_PAPER.py
```

Edit the new file and update:

```python
# Change this
PAPER_SLUG = "your-paper-name"

# And map your figures
figure_mapping = {
"fig1.pdf": "architecture-diagram.png",
"fig2.pdf": "results-comparison.png",
# ... add all figures
}
```

### 4. Run Conversion

```bash
python scripts/examples/convert_YOUR_PAPER.py
```

### 5. Use in Blog Post

In your markdown file:

```markdown
![System Architecture](/content/your-paper-slug/images/architecture-diagram.png)
*Figure 1: Overview of the proposed system architecture.*
```

## Command Line Usage (Single Files)

Quick conversion of a single PDF:

```bash
# Basic conversion
python scripts/convert_pdf_figures.py input.pdf output.png

# High resolution
python scripts/convert_pdf_figures.py input.pdf output.png --dpi 600

# No cropping
python scripts/convert_pdf_figures.py input.pdf output.png --no-crop
```

## Common Issues

**Issue**: Images have too much whitespace
**Solution**: Check if `crop=True` in your script

**Issue**: Figures too small
**Solution**: Increase DPI (e.g., `dpi=600`)

**Issue**: Files too large
**Solution**: Decrease DPI (e.g., `dpi=150`) for simple charts

**Issue**: "Module not found"
**Solution**: `pip install PyMuPDF pillow`

## Examples

See working examples in `scripts/examples/`:
- `convert_qerl.py` - Full example with 17 figures
- `convert_template.py` - Template for new papers

## Need Help?

Check the full documentation: [scripts/README.md](README.md)

Loading
Loading