Skip to content

Feature Request: Support for Activated LoRA #15212

@kgreenewald

Description

@kgreenewald

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Activated LoRA (aLoRA) is a new family of LoRA adapters that are invoked by including an invocation string in the prompt, and the weights are only adapted for the tokens in the sequence after the invocation prompt appears (think of this as an "adapter generation prompt"). All prior context tokens have unadapted weights in the transformer, allowing the inference engine to utilize any existing KV cache from prior base model inference calls. This means that one can apply the aLoRA adapter deep in a multi-turn interaction with the model without needing to recompute the entire KV cache. Instead, the adapter can use the KV cache from the base model right up until the adapter is invoked, thus significantly reducing TTFT.

This issue proposes supporting inference for aLoRA adapters trained with (or at least following the same format as) Huggingface PEFT (active PR). It extends work building out support in vLLM (active PR).

paper: https://arxiv.org/abs/2504.12397
blog: https://research.ibm.com/blog/inference-friendly-aloras-lora
Huggingface PEFT PR: huggingface/peft#2609
vLLM PR: vllm-project/vllm#19710

results from paper:

Image

Motivation

Activated LoRA enables up to 10-30X faster inference than LoRA when using adapters for specialized turns in a multi-turn LLM use case with prefix caching. This is particularly relevant for agentic (including RAG) applications (see the paper for extensive examples and further discussion of use cases), enabling weight-tuning of key skills with (almost) no inference latency downsides relative to prompt optimization.

Inference support in llama.cpp is important to enable local inference for aLoRA adapters (typically trained using the HF PEFT PR), complementing the vLLM inference support in progress.

Possible Implementation

Modify the current support for hot-swapping LoRA adapters to recognize Activated LoRA adapters. Adjust the inference code for the adapter to activate weights only after indicated by the invocation prompt. Ensure that prefix caching correctly recognizes which substrings used which weights (base model or adapter weights).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions