Skip to content

Potentially incorrect attention flop calculation due to wrong head_dim? #1920

@gau-nernst

Description

@gau-nernst

Bug description

l, h, q, t = (
model_args.n_layers,
model_args.n_heads,
model_args.dim // model_args.n_heads,
seq_len,
)

However, head_dim is not necessarily equal to dim / n_heads

e.g. Qwen3-4B, dim=2560, n_heads=32, head_dim=128

Versions

latest main

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions