VLM/LLM Infra Updating

## Background
UniRL now supports mainly on diffusion models, but AR models and unified models are highly required. We plan to quickly improve our infrasturcture based on opensource AR performance enhancement technoligies.

## Implement Table

| Priority | # | TECH | Rating | Modalities | Benefit | Effort |
|---|---:|---|---|---|---|---|
| **P0** | 1 | AR rollout prefix-cache hits (RadixAttention / prefix cache) | ⚠️ | AR/VLM | High (long prompts + multi-sampling) | Small to medium (expose config; real hits need fanout/routing changes) |
|  | 2 | FSDP2 `forward_prefetch` communication/compute overlap | ✅ | All | Medium (needs benchmark) | Small (config switch + adjacent-block prefetch) |
|  | 3 | Enable and benchmark `torch.compile` | ⚠️ | All | Medium (already supported, needs measurement) | Very small (config) |
|  | 4 | Rollout engine tuning cleanup (SGLang arg correction + recommended defaults) | ⚠️ | AR | Low-medium | Small (allowlist/recipes/docs) |
| **P1** | 5 | Sequence-length load balancing (Karmarkar-Karp bucketing) | ✅ | AR/VLM | High (reduces DP stragglers, especially variable-length GRPO) | Medium (dispatch layer) |
|  | 6 | Dynamic batching by token budget (`max_token_len_per_gpu`) | ✅ | AR/VLM | High (memory + throughput) | Medium (train stack) |
| **P2** | 7 | Sequence packing / remove-padding (FlashAttention varlen) | ⚠️ | AR/VLM | High (removes padding waste) | Large (per-model replay changes) |
|  | 8 | Activation offload (FSDP) | ⚠️ | All | Medium (marginal gain after checkpointing) | Medium |
|  | 9 | LigerKernel / fused kernels (RMSNorm/SwiGLU/RoPE) | ⚠️ | AR/VLM | Medium (10-20% AR model speedup) | Medium (model side) |
| **P3** | 10 | Ulysses sequence parallelism (long context) | ⚠️ | AR | High (only when >32k context) | Large |
|  | 11 | Rollout/train pipelining + one-step off-policy | ⚠️ | All | High (GPU utilization) | Large |
|  | 12 | FP8 training (TransformerEngine) | ⚠️ | AR | Medium | Large |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM/LLM Infra Updating #40

Background

Implement Table

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	#	TECH	Rating	Modalities	Benefit	Effort
P0	1	AR rollout prefix-cache hits (RadixAttention / prefix cache)	⚠️	AR/VLM	High (long prompts + multi-sampling)	Small to medium (expose config; real hits need fanout/routing changes)
	2	FSDP2 `forward_prefetch` communication/compute overlap	✅	All	Medium (needs benchmark)	Small (config switch + adjacent-block prefetch)
	3	Enable and benchmark `torch.compile`	⚠️	All	Medium (already supported, needs measurement)	Very small (config)
	4	Rollout engine tuning cleanup (SGLang arg correction + recommended defaults)	⚠️	AR	Low-medium	Small (allowlist/recipes/docs)
P1	5	Sequence-length load balancing (Karmarkar-Karp bucketing)	✅	AR/VLM	High (reduces DP stragglers, especially variable-length GRPO)	Medium (dispatch layer)
	6	Dynamic batching by token budget (`max_token_len_per_gpu`)	✅	AR/VLM	High (memory + throughput)	Medium (train stack)
P2	7	Sequence packing / remove-padding (FlashAttention varlen)	⚠️	AR/VLM	High (removes padding waste)	Large (per-model replay changes)
	8	Activation offload (FSDP)	⚠️	All	Medium (marginal gain after checkpointing)	Medium
	9	LigerKernel / fused kernels (RMSNorm/SwiGLU/RoPE)	⚠️	AR/VLM	Medium (10-20% AR model speedup)	Medium (model side)
P3	10	Ulysses sequence parallelism (long context)	⚠️	AR	High (only when >32k context)	Large
	11	Rollout/train pipelining + one-step off-policy	⚠️	All	High (GPU utilization)	Large
	12	FP8 training (TransformerEngine)	⚠️	AR	Medium	Large

VLM/LLM Infra Updating #40

Description

Background

Implement Table

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions