Skip to content

VLM/LLM Infra Updating #40

@CjhHa1

Description

@CjhHa1

Background

UniRL now supports mainly on diffusion models, but AR models and unified models are highly required. We plan to quickly improve our infrasturcture based on opensource AR performance enhancement technoligies.

Implement Table

Priority # TECH Rating Modalities Benefit Effort
P0 1 AR rollout prefix-cache hits (RadixAttention / prefix cache) ⚠️ AR/VLM High (long prompts + multi-sampling) Small to medium (expose config; real hits need fanout/routing changes)
2 FSDP2 forward_prefetch communication/compute overlap All Medium (needs benchmark) Small (config switch + adjacent-block prefetch)
3 Enable and benchmark torch.compile ⚠️ All Medium (already supported, needs measurement) Very small (config)
4 Rollout engine tuning cleanup (SGLang arg correction + recommended defaults) ⚠️ AR Low-medium Small (allowlist/recipes/docs)
P1 5 Sequence-length load balancing (Karmarkar-Karp bucketing) AR/VLM High (reduces DP stragglers, especially variable-length GRPO) Medium (dispatch layer)
6 Dynamic batching by token budget (max_token_len_per_gpu) AR/VLM High (memory + throughput) Medium (train stack)
P2 7 Sequence packing / remove-padding (FlashAttention varlen) ⚠️ AR/VLM High (removes padding waste) Large (per-model replay changes)
8 Activation offload (FSDP) ⚠️ All Medium (marginal gain after checkpointing) Medium
9 LigerKernel / fused kernels (RMSNorm/SwiGLU/RoPE) ⚠️ AR/VLM Medium (10-20% AR model speedup) Medium (model side)
P3 10 Ulysses sequence parallelism (long context) ⚠️ AR High (only when >32k context) Large
11 Rollout/train pipelining + one-step off-policy ⚠️ All High (GPU utilization) Large
12 FP8 training (TransformerEngine) ⚠️ AR Medium Large

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions