-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Activated LoRA (aLoRA) is a new family of LoRA adapters that are invoked by including an invocation string in the prompt, and the weights are only adapted for the tokens in the sequence after the invocation prompt appears (think of this as an "adapter generation prompt"). All prior context tokens have unadapted weights in the transformer, allowing the inference engine to utilize any existing KV cache from prior base model inference calls. This means that one can apply the aLoRA adapter deep in a multi-turn interaction with the model without needing to recompute the entire KV cache. Instead, the adapter can use the KV cache from the base model right up until the adapter is invoked, thus significantly reducing TTFT.
This issue proposes supporting inference for aLoRA adapters trained with (or at least following the same format as) Huggingface PEFT (active PR). It extends work building out support in vLLM (active PR).
paper: https://arxiv.org/abs/2504.12397
blog: https://research.ibm.com/blog/inference-friendly-aloras-lora
Huggingface PEFT PR: huggingface/peft#2609
vLLM PR: vllm-project/vllm#19710
results from paper:

Motivation
Activated LoRA enables up to 10-30X faster inference than LoRA when using adapters for specialized turns in a multi-turn LLM use case with prefix caching. This is particularly relevant for agentic (including RAG) applications (see the paper for extensive examples and further discussion of use cases), enabling weight-tuning of key skills with (almost) no inference latency downsides relative to prompt optimization.
Inference support in llama.cpp is important to enable local inference for aLoRA adapters (typically trained using the HF PEFT PR), complementing the vLLM inference support in progress.
Possible Implementation
Modify the current support for hot-swapping LoRA adapters to recognize Activated LoRA adapters. Adjust the inference code for the adapter to activate weights only after indicated by the invocation prompt. Ensure that prefix caching correctly recognizes which substrings used which weights (base model or adapter weights).