Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable LoRA rank size and uneven expert distribution per layers #9

Open
johnr14 opened this issue Jan 24, 2025 · 0 comments
Open

Comments

@johnr14
Copy link

johnr14 commented Jan 24, 2025

Hi, very interesting !

I will definitely play with this.

I have been reading AlphaLoRA and I think it's a great idea. Why not look at PL_Alpha_Hill for each layers and depending on the perplexity, use smaller or larger r value as well as more or less experts ?

Check out in your next research paper if variable numbers of experts following a dispersion based on perplexity and variable LoRA rank can have a positive influence on the model. If 16 experts is better than 8, then having some layers have 32-64 and others 4-8 could provide more flexibility to the model ? Like having CoT, coding and role playing all in one without changing model and without it spewing things at the wrong time.

Why not try a Camelidae with LoRAs spread unevenly with different dimensions and compare it to current for a quick test ? It should not be too hard to implement like that.

Train once and experiment after

Big LoRAs

You could train all experts at r2048 and compress them to smaller by merging them back to the layer and extracting a smaller LoRA. That would be quicker than training multiple versions. It would allow testing performance of r2048 vs r64 for all or specific layers AND it could be an interesting research paper.

Many LoRAs

Train with lots of experts and I mean A LOT, like 64-128. It would require more epoch or more iterations. Then you shrink it by merging experts ! You can merge experts by merging 2 or more LoRA back to the model and extracting a single LoRA. But then your expert selection router will not work unless you divide the output by the equal number of experts division. Lets say you train 64 experts, then merge them to 16, you would have to update your routing for virtual experts that are not there anymore. This could be interesting in embedded situation where the model would need to be even smaller. It could also help determining the point of diminished returns on the number of experts (too many vs too few).

Naming method for uneven ranks and distribution of LoRA

With 16x34b at 60layers, that equates to 960 experts. There is surely a more optimal expert distribution than equal per layer. But naming convention should be different : E960x34b ? But then to compensate for lora rank, for r256 (average experts rank on total experts) 16E ( total experts (960) / (60) total layers) would be (r256e16) Camelidae-r256e16x34b-v0.1 ? It would be easier to understand that r512e32x34b would be bigger and have more knowledge variety and capabilities than r64e8x34b.

More ideas

Also left a issue in mergekit if you want some more ideas. Like have a planner for forcing some experts (I am looking at ModernBert) to classify the input and activate or deactivate some experts. Forcing them to be a domain expert, something like selecting a few to force coding skills into.

P.S. I didn't see what LoRA ranks you used for Camelidae, perhaps I haven't read enough ?

Thanks for your work !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant