Add a new tile scheduler for varlen prefill to avoid launching empty work groups #516

carsonwang · 2025-09-18T01:54:49Z

This PR adds a valid-only scheduler for varlen prefill, a Zipf varlen initializer, and corrects varlen metrics. It removes 29K+ empty WGs in our skewed test, netting a consistent +0.5 TFLOP/s.

Problem:
For varlen (skewed) prefill, the existing “Individual” scheduler builds the grid using the batch max sequence length, so it launches many empty work groups that immediately return.

Example (our test case):

batch=16, max seq len=8096
Original scheduler grid → 1 × 64 × 512 = 32,768 WGs
But only 3,808 WGs actually have work; the rest are empty and only add scheduling overhead.
This wastes device-side scheduling bandwidth and inflates traces

Solution
Following the idea used by FlashInfer (launch only work that exists), this PR adds a valid-only individual tile scheduler for the prefill kernel. It constructs a compact list of tiles that actually do work and launches exactly that many WGs.

In addition, this PR:

Adds a Zipf-based varlen initializer to better emulate long-tailed sequence distributions.
Fixes varlen performance metrics so FLOPs/GB/s reflect the actual per-sample effective lengths rather than a global max.

Test case:

./build/examples/sycl/06_bmg_flash_attention/06_bmg_prefill_attention_hdim128 --batch=16 --num_heads_q=32 --num_heads_kv=8 --seq_len_qo=1024 --seq_len_kv=1024 --head_size_vo=128  --head_size_qk=128 --varlen --iterations=10000 --scheduler="ValidOnlyIndividual" --varlen_dist="zipf"
[VarLen Init/Zipf] batches=16 Zipf(Nmax=16384, s=1.098) AlignQ(elems)=32 AlignKV(elems)=32
  batch 0 : Q=672, KV=672
  batch 1 : Q=192, KV=192
  batch 2 : Q=32, KV=32
  batch 3 : Q=320, KV=320
  batch 4 : Q=32, KV=32
  batch 5 : Q=96, KV=96
  batch 6 : Q=32, KV=32
  batch 7 : Q=32, KV=32
  batch 8 : Q=8096, KV=8096
  batch 9 : Q=32, KV=32
  batch 10 : Q=4032, KV=4032
  batch 11 : Q=32, KV=32
  batch 12 : Q=32, KV=32
  batch 13 : Q=32, KV=32
  batch 14 : Q=160, KV=160
  batch 15 : Q=32, KV=32
  cumulative_seqlen_q : 0 672 864 896 1216 1248 1344 1376 1408 9504 9536 13568 13600 13632 13664 13824 13856
  cumulative_seqlen_kv: 0 672 864 896 1216 1248 1344 1376 1408 9504 9536 13568 13600 13632 13664 13824 13856
  totals: Q=13856 (max 8096), KV=13856 (max 8096)
Total tile number: 3808
Disposition: Passed
Batch: 16       NumHeads_q: 32  NumHeads_kv: 8  Seq Length QO: 1024     Seq Length KV: 1024     Head Size QK: 128       Head Size VO: 128        Causal Mask: false      Variable Sequence Length: true   Scheduler: ValidOnlyIndividual
Performance:   18.909  GB/s,    64.285  TFlop/s,   21.0103  ms

Performance
Launched WGs: reduced from 32,768 to 3,808

Kernel throughput: +0.5 TFLOP/s

Before (Individual): 63.74 TFLOP/s

After (ValidOnlyIndividual): 64.28 TFLOP/s

This gain is modest but consistent for this workload; more importantly, the launch now reflects actual work, improving scheduling efficiency.

User-facing options
No default behavior change.
--scheduler="ValidOnlyIndividual": new valid-only scheduler for varlen prefill.
--varlen_dist="zipf": new varlen distribution option, default is the existing normal distribution.

carsonwang added 2 commits September 17, 2025 17:59

add valid only individual tile scheduler

334ce05

Merge branch 'main' into validonly

10a03ad

rolandschulz requested a review from tdeng5 September 19, 2025 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a new tile scheduler for varlen prefill to avoid launching empty work groups #516

Add a new tile scheduler for varlen prefill to avoid launching empty work groups #516

Uh oh!

carsonwang commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add a new tile scheduler for varlen prefill to avoid launching empty work groups #516

Are you sure you want to change the base?

Add a new tile scheduler for varlen prefill to avoid launching empty work groups #516

Uh oh!

Conversation

carsonwang commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant