Does replacing some layers with linear attention (GQA or KDA) from FLA improves the model training?

1. measure baseline for 8M tokens and 20M tokens (follow SETUP_INSTRUCTIONS)
2. replace some layers with linear attention, measure the new training fairly (val loss, time) - you may tell AI to measure it fairly
- From previous experience, last layer should probably be classic (softmax) attention and the ratio should be 20-25% softmax attention
- Make sure to confirm that your results actually train better or faster and the tradeoff is worthed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does replacing some layers with linear attention (GQA or KDA) from FLA improves the model training? #87

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Does replacing some layers with linear attention (GQA or KDA) from FLA improves the model training? #87

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions