Skip to content

Add multi-LoRA MoE inference support#3

Open
kiddyboots216 wants to merge 1 commit intomainfrom
add-moe-lora-inference
Open

Add multi-LoRA MoE inference support#3
kiddyboots216 wants to merge 1 commit intomainfrom
add-moe-lora-inference

Conversation

@kiddyboots216
Copy link
Copy Markdown
Contributor

Summary

  • Adds multi-adapter LoRA inference for Mixture-of-Experts models, enabling concurrent serving of multiple MoE LoRA adapters in a single batch
  • Implements full MoE LoRA forward pass in the unquantized Triton backend using CSGMV kernels for gate_up_proj and down_proj shrink/expand
  • Adds per-expert LoRA weight buffers (A/B matrices + presence masks) in the memory pool, per-token adapter index tracking in the chunked SGMV backend, and CUDA graph bypass when MoE LoRA adapters are active

Changes

File Description
lora/moe.py New module — MoE LoRA constants, weight normalization, chunked compound segment builder
lora/lora.py Load and normalize MoE weights in LoRA adapters
lora/mem_pool.py Allocate/load/evict per-expert LoRA weight buffers
lora/lora_manager.py Wire MoE modules, set MoE LoRA info per batch
lora/utils.py Per-token adapter index expansion, MoE target module normalization
lora/backend/chunked_backend.py Token-level weight indices for CSGMV backend
layers/moe/fused_moe_triton/layer.py MoE LoRA hooks on FusedMoE layer
layers/quantization/unquant.py Full MoE LoRA forward pass (gate_up + down with validation)
model_executor/cuda_graph_runner.py Disable CUDA graph for MoE LoRA batches
model_executor/piecewise_cuda_graph_runner.py Same for piecewise runner
test/registered/lora/test_moe_lora_utils.py Unit tests for MoE LoRA utility functions

Limitations

  • Only supports unquantized Triton MoE backend (not triton_kernels, FlashInfer CUTLASS, or TRT-LLM)
  • Requires --lora-backend csgmv
  • TP=1 and EP=1 only
  • CUDA graph is bypassed when MoE LoRA adapters are active
  • Does not support fused shared experts or no_combine=True

Implement multi-adapter LoRA inference for Mixture-of-Experts models,
enabling concurrent serving of multiple MoE LoRA adapters in a single
batch. Key changes:

- New MoE LoRA module with weight normalization, segmented dispatch,
  and chunked compound segment builder for expert-adapter grouping
- Full forward pass for MoE LoRA in the unquantized Triton backend
  (gate_up_proj shrink/expand + down_proj shrink/expand with CSGMV)
- Memory pool buffers for per-expert LoRA A/B weights and presence masks
- Per-token adapter index tracking in the chunked SGMV backend
- CUDA graph bypass when MoE LoRA adapters are active in a batch
- Unit tests for MoE LoRA utility functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant