You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been reading AlphaLoRA and I think it's a great idea. Why not look at PL_Alpha_Hill for each layers and depending on the perplexity, use smaller or larger r value as well as more or less experts ?
Check out in your next research paper if variable numbers of experts following a dispersion based on perplexity and variable LoRA rank can have a positive influence on the model. If 16 experts is better than 8, then having some layers have 32-64 and others 4-8 could provide more flexibility to the model ? Like having CoT, coding and role playing all in one without changing model and without it spewing things at the wrong time.
Why not try a Camelidae with LoRAs spread unevenly with different dimensions and compare it to current for a quick test ? It should not be too hard to implement like that.
Train once and experiment after
Big LoRAs
You could train all experts at r2048 and compress them to smaller by merging them back to the layer and extracting a smaller LoRA. That would be quicker than training multiple versions. It would allow testing performance of r2048 vs r64 for all or specific layers AND it could be an interesting research paper.
Many LoRAs
Train with lots of experts and I mean A LOT, like 64-128. It would require more epoch or more iterations. Then you shrink it by merging experts ! You can merge experts by merging 2 or more LoRA back to the model and extracting a single LoRA. But then your expert selection router will not work unless you divide the output by the equal number of experts division. Lets say you train 64 experts, then merge them to 16, you would have to update your routing for virtual experts that are not there anymore. This could be interesting in embedded situation where the model would need to be even smaller. It could also help determining the point of diminished returns on the number of experts (too many vs too few).
Naming method for uneven ranks and distribution of LoRA
With 16x34b at 60layers, that equates to 960 experts. There is surely a more optimal expert distribution than equal per layer. But naming convention should be different : E960x34b ? But then to compensate for lora rank, for r256 (average experts rank on total experts) 16E ( total experts (960) / (60) total layers) would be (r256e16)Camelidae-r256e16x34b-v0.1 ? It would be easier to understand that r512e32x34b would be bigger and have more knowledge variety and capabilities than r64e8x34b.
More ideas
Also left a issue in mergekit if you want some more ideas. Like have a planner for forcing some experts (I am looking at ModernBert) to classify the input and activate or deactivate some experts. Forcing them to be a domain expert, something like selecting a few to force coding skills into.
P.S. I didn't see what LoRA ranks you used for Camelidae, perhaps I haven't read enough ?
Thanks for your work !
The text was updated successfully, but these errors were encountered:
Hi, very interesting !
I will definitely play with this.
I have been reading AlphaLoRA and I think it's a great idea. Why not look at
PL_Alpha_Hill
for each layers and depending on the perplexity, use smaller or largerr
value as well asmore or less experts
?Check out in your next research paper if variable numbers of experts following a dispersion based on perplexity and variable LoRA rank can have a positive influence on the model. If 16 experts is better than 8, then having some layers have 32-64 and others 4-8 could provide more flexibility to the model ? Like having
CoT
,coding
androle playing
all in one without changing model and without it spewing things at the wrong time.Why not try a Camelidae with LoRAs spread unevenly with different dimensions and compare it to current for a quick test ? It should not be too hard to implement like that.
Train once and experiment after
Big LoRAs
You could train all experts at r2048 and compress them to smaller by merging them back to the layer and extracting a smaller LoRA. That would be quicker than training multiple versions. It would allow testing performance of r2048 vs r64 for all or specific layers AND it could be an interesting research paper.
Many LoRAs
Train with lots of experts and I mean A LOT, like 64-128. It would require more epoch or more iterations. Then you shrink it by merging experts ! You can merge experts by merging 2 or more LoRA back to the model and extracting a single LoRA. But then your expert selection router will not work unless you divide the output by the equal number of experts division. Lets say you train 64 experts, then merge them to 16, you would have to update your routing for
virtual experts
that are not there anymore. This could be interesting in embedded situation where the model would need to be even smaller. It could also help determining the point of diminished returns on the number of experts (too many vs too few).Naming method for uneven ranks and distribution of LoRA
With 16x34b at 60layers, that equates to
960
experts. There is surely a more optimal expert distribution than equal per layer. But naming convention should be different : E960x34b ? But then to compensate for lora rank, for r256 (average experts rank on total experts) 16E ( total experts (960) / (60) total layers) would be(r256e16)
Camelidae-r256e16x34b-v0.1
? It would be easier to understand thatr512e32x34b
would be bigger and have more knowledge variety and capabilities thanr64e8x34b
.More ideas
Also left a issue in mergekit if you want some more ideas. Like have a planner for forcing some experts (I am looking at ModernBert) to classify the input and activate or deactivate some experts. Forcing them to be a domain expert, something like selecting a few to force
coding
skills into.P.S. I didn't see what LoRA ranks you used for Camelidae, perhaps I haven't read enough ?
Thanks for your work !
The text was updated successfully, but these errors were encountered: