Background
UniRL now supports mainly on diffusion models, but AR models and unified models are highly required. We plan to quickly improve our infrasturcture based on opensource AR performance enhancement technoligies.
Implement Table
| Priority |
# |
TECH |
Rating |
Modalities |
Benefit |
Effort |
| P0 |
1 |
AR rollout prefix-cache hits (RadixAttention / prefix cache) |
⚠️ |
AR/VLM |
High (long prompts + multi-sampling) |
Small to medium (expose config; real hits need fanout/routing changes) |
|
2 |
FSDP2 forward_prefetch communication/compute overlap |
✅ |
All |
Medium (needs benchmark) |
Small (config switch + adjacent-block prefetch) |
|
3 |
Enable and benchmark torch.compile |
⚠️ |
All |
Medium (already supported, needs measurement) |
Very small (config) |
|
4 |
Rollout engine tuning cleanup (SGLang arg correction + recommended defaults) |
⚠️ |
AR |
Low-medium |
Small (allowlist/recipes/docs) |
| P1 |
5 |
Sequence-length load balancing (Karmarkar-Karp bucketing) |
✅ |
AR/VLM |
High (reduces DP stragglers, especially variable-length GRPO) |
Medium (dispatch layer) |
|
6 |
Dynamic batching by token budget (max_token_len_per_gpu) |
✅ |
AR/VLM |
High (memory + throughput) |
Medium (train stack) |
| P2 |
7 |
Sequence packing / remove-padding (FlashAttention varlen) |
⚠️ |
AR/VLM |
High (removes padding waste) |
Large (per-model replay changes) |
|
8 |
Activation offload (FSDP) |
⚠️ |
All |
Medium (marginal gain after checkpointing) |
Medium |
|
9 |
LigerKernel / fused kernels (RMSNorm/SwiGLU/RoPE) |
⚠️ |
AR/VLM |
Medium (10-20% AR model speedup) |
Medium (model side) |
| P3 |
10 |
Ulysses sequence parallelism (long context) |
⚠️ |
AR |
High (only when >32k context) |
Large |
|
11 |
Rollout/train pipelining + one-step off-policy |
⚠️ |
All |
High (GPU utilization) |
Large |
|
12 |
FP8 training (TransformerEngine) |
⚠️ |
AR |
Medium |
Large |
Background
UniRL now supports mainly on diffusion models, but AR models and unified models are highly required. We plan to quickly improve our infrasturcture based on opensource AR performance enhancement technoligies.
Implement Table
forward_prefetchcommunication/compute overlaptorch.compilemax_token_len_per_gpu)