Skip to content

Does replacing some layers with linear attention (GQA or KDA) from FLA improves the model training? #87

@vukrosic

Description

@vukrosic
  1. measure baseline for 8M tokens and 20M tokens (follow SETUP_INSTRUCTIONS)
  2. replace some layers with linear attention, measure the new training fairly (val loss, time) - you may tell AI to measure it fairly
  • From previous experience, last layer should probably be classic (softmax) attention and the ratio should be 20-25% softmax attention
  • Make sure to confirm that your results actually train better or faster and the tradeoff is worthed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions