Add multi-LoRA MoE inference support by kiddyboots216 · Pull Request #3 · togethercomputer/xorl-sglang

kiddyboots216 · 2026-03-30T21:21:46Z

Summary

Adds multi-adapter LoRA inference for Mixture-of-Experts models, enabling concurrent serving of multiple MoE LoRA adapters in a single batch
Implements full MoE LoRA forward pass in the unquantized Triton backend using CSGMV kernels for gate_up_proj and down_proj shrink/expand
Adds per-expert LoRA weight buffers (A/B matrices + presence masks) in the memory pool, per-token adapter index tracking in the chunked SGMV backend, and CUDA graph bypass when MoE LoRA adapters are active

Changes

File	Description
`lora/moe.py`	New module — MoE LoRA constants, weight normalization, chunked compound segment builder
`lora/lora.py`	Load and normalize MoE weights in LoRA adapters
`lora/mem_pool.py`	Allocate/load/evict per-expert LoRA weight buffers
`lora/lora_manager.py`	Wire MoE modules, set MoE LoRA info per batch
`lora/utils.py`	Per-token adapter index expansion, MoE target module normalization
`lora/backend/chunked_backend.py`	Token-level weight indices for CSGMV backend
`layers/moe/fused_moe_triton/layer.py`	MoE LoRA hooks on FusedMoE layer
`layers/quantization/unquant.py`	Full MoE LoRA forward pass (gate_up + down with validation)
`model_executor/cuda_graph_runner.py`	Disable CUDA graph for MoE LoRA batches
`model_executor/piecewise_cuda_graph_runner.py`	Same for piecewise runner
`test/registered/lora/test_moe_lora_utils.py`	Unit tests for MoE LoRA utility functions

Limitations

Only supports unquantized Triton MoE backend (not triton_kernels, FlashInfer CUTLASS, or TRT-LLM)
Requires --lora-backend csgmv
TP=1 and EP=1 only
CUDA graph is bypassed when MoE LoRA adapters are active
Does not support fused shared experts or no_combine=True

Implement multi-adapter LoRA inference for Mixture-of-Experts models, enabling concurrent serving of multiple MoE LoRA adapters in a single batch. Key changes: - New MoE LoRA module with weight normalization, segmented dispatch, and chunked compound segment builder for expert-adapter grouping - Full forward pass for MoE LoRA in the unquantized Triton backend (gate_up_proj shrink/expand + down_proj shrink/expand with CSGMV) - Memory pool buffers for per-expert LoRA A/B weights and presence masks - Per-token adapter index tracking in the chunked SGMV backend - CUDA graph bypass when MoE LoRA adapters are active in a batch - Unit tests for MoE LoRA utility functions

github-actions bot added lora quant labels Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-LoRA MoE inference support#3

Add multi-LoRA MoE inference support#3
kiddyboots216 wants to merge 1 commit intomainfrom
add-moe-lora-inference

kiddyboots216 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kiddyboots216 commented Mar 30, 2026

Summary

Changes

Limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant