-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Weight LoRA #2406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Weight LoRA #2406
Conversation
Thanks for this PR that proposes to add Weight LoRA to PEFT. Do you have a link to the full paper? I only skimmed the implementation, but from what I saw, this is basically LoRA but with the only difference being that the scaling parameter is trainable? Just from the abstract you pasted, it appears that there should be additional constraints on |
Yes, you are right. However, the fact that the parameters
These constraints should not be taken into account in the WeightLoRA method, but in the implementation of the optimizer step (e.g. SGD with projection). In our paper, we provide the WeightAdam optimizer, where we project weights
Unfortunately we submitted this paper to the ACL 2025 conference and it has double-blind review, therefore I cannot send the full text, but I can share, for example, the results of the experiment in the form of a table. |
Note that this optimizer can be included in the PR, to
I would suggest to wait with this PR until the paper is accepted, otherwise it's hard for us to review the PR. Moreover, there could be useful changes during the review process that should be reflected in the integration. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
not stale |
This PR brings a new method -- Weight LoRA.
WeightLoRA
Weight LoRA is a less complex, but important, PEFT method that adds a weight$w_i$ to each LoRA adapter (here i -- adapter number). This is done in order to perform, in addition to the classical optimisation over all LoRAs $A_1, B_1, ..., A_n, B_n$ , an alternative optimisation over a vector of weights $w := (w_1, ..., w_n)^T \in R^n$ with a wide variety of constraints. In our research paper, we consider two approaches: 1) the vector $w$ must be in simplex $\Delta_{n-1}$ , and 2) the vector $w$ has only $K$ non-zero coordinates. Both of these methods solve the problem of finding the most important LoRA adapters in the model and concentrating training on them while disabling the rest.
The abstract from the paper is:
The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation (LoRA), which adds trainable adapters to selected layers. Although LoRA may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, WeightLoRA, which overcomes this issue by adaptive selection of the most critical LoRA heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. Finally, we conduct experiments for the series of competitive benchmarks and DeBERTa and BART models, comparing our approach with the most popular LoRA modifications. The experimental results demonstrate the efficacy of WeightLoRA and the superior performance of WeightLoRA+ in comparison to the baselines in nearly all cases.
Original code