Feature Request: Support for Activated LoRA

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Activated LoRA (aLoRA) is a new family of LoRA adapters that are invoked by including an invocation string in the prompt, and the weights are only adapted for the tokens in the sequence after the invocation prompt appears (think of this as an "adapter generation prompt"). All prior context tokens have unadapted weights in the transformer, allowing the inference engine to utilize any existing KV cache from prior base model inference calls. This means that one can apply the aLoRA adapter deep in a multi-turn interaction with the model without needing to recompute the entire KV cache. Instead, the adapter can use the KV cache from the base model right up until the adapter is invoked, thus significantly reducing TTFT. 

This issue proposes supporting inference for aLoRA adapters trained with (or at least following the same format as) Huggingface PEFT (active PR). It extends work building out support in vLLM (active PR).

paper: https://arxiv.org/abs/2504.12397
blog: https://research.ibm.com/blog/inference-friendly-aloras-lora
Huggingface PEFT PR: https://github.com/huggingface/peft/pull/2609
vLLM PR: https://github.com/vllm-project/vllm/pull/19710

results from paper:

<img width="1320" height="1012" alt="Image" src="https://github.com/user-attachments/assets/1e26885c-c7ee-426d-a4be-8f216907d478" />

### Motivation

Activated LoRA enables up to 10-30X faster inference than LoRA when using adapters for specialized turns in a multi-turn LLM use case with prefix caching. This is particularly relevant for agentic (including RAG) applications (see the paper for extensive examples and further discussion of use cases), enabling weight-tuning of key skills with (almost) no inference latency downsides relative to prompt optimization. 

Inference support in llama.cpp is important to enable local inference for aLoRA adapters (typically trained using the HF PEFT PR), complementing the vLLM inference support in progress. 

### Possible Implementation

Modify the current support for hot-swapping LoRA adapters to recognize Activated LoRA adapters. Adjust the inference code for the adapter to activate weights only after indicated by the invocation prompt. Ensure that prefix caching correctly recognizes which substrings used which weights (base model or adapter weights). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Support for Activated LoRA #15212

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Support for Activated LoRA #15212

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions