Variable LoRA rank size and uneven expert distribution per layers #9

johnr14 · 2025-01-24T17:00:23Z

Hi, very interesting !

I will definitely play with this.

I have been reading AlphaLoRA and I think it's a great idea. Why not look at PL_Alpha_Hill for each layers and depending on the perplexity, use smaller or larger r value as well as more or less experts ?

Check out in your next research paper if variable numbers of experts following a dispersion based on perplexity and variable LoRA rank can have a positive influence on the model. If 16 experts is better than 8, then having some layers have 32-64 and others 4-8 could provide more flexibility to the model ? Like having CoT, coding and role playing all in one without changing model and without it spewing things at the wrong time.

Why not try a Camelidae with LoRAs spread unevenly with different dimensions and compare it to current for a quick test ? It should not be too hard to implement like that.

Train once and experiment after

Big LoRAs

You could train all experts at r2048 and compress them to smaller by merging them back to the layer and extracting a smaller LoRA. That would be quicker than training multiple versions. It would allow testing performance of r2048 vs r64 for all or specific layers AND it could be an interesting research paper.

Many LoRAs

Train with lots of experts and I mean A LOT, like 64-128. It would require more epoch or more iterations. Then you shrink it by merging experts ! You can merge experts by merging 2 or more LoRA back to the model and extracting a single LoRA. But then your expert selection router will not work unless you divide the output by the equal number of experts division. Lets say you train 64 experts, then merge them to 16, you would have to update your routing for virtual experts that are not there anymore. This could be interesting in embedded situation where the model would need to be even smaller. It could also help determining the point of diminished returns on the number of experts (too many vs too few).

Naming method for uneven ranks and distribution of LoRA

With 16x34b at 60layers, that equates to 960 experts. There is surely a more optimal expert distribution than equal per layer. But naming convention should be different : E960x34b ? But then to compensate for lora rank, for r256 (average experts rank on total experts) 16E ( total experts (960) / (60) total layers) would be (r256e16) Camelidae-r256e16x34b-v0.1 ? It would be easier to understand that r512e32x34b would be bigger and have more knowledge variety and capabilities than r64e8x34b.

More ideas

Also left a issue in mergekit if you want some more ideas. Like have a planner for forcing some experts (I am looking at ModernBert) to classify the input and activate or deactivate some experts. Forcing them to be a domain expert, something like selecting a few to force coding skills into.

P.S. I didn't see what LoRA ranks you used for Camelidae, perhaps I haven't read enough ?

Thanks for your work !

The text was updated successfully, but these errors were encountered:

johnr14 mentioned this issue Jan 24, 2025

Great contribution ! peijunallin/alphalora#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable LoRA rank size and uneven expert distribution per layers #9

Variable LoRA rank size and uneven expert distribution per layers #9

johnr14 commented Jan 24, 2025

Variable LoRA rank size and uneven expert distribution per layers #9

Variable LoRA rank size and uneven expert distribution per layers #9

Comments

johnr14 commented Jan 24, 2025

Train once and experiment after

Big LoRAs

Many LoRAs

Naming method for uneven ranks and distribution of LoRA

More ideas