Skip to content

Conversation

carsonwang
Copy link

This PR adds a valid-only scheduler for varlen prefill, a Zipf varlen initializer, and corrects varlen metrics. It removes 29K+ empty WGs in our skewed test, netting a consistent +0.5 TFLOP/s.

Problem:
For varlen (skewed) prefill, the existing “Individual” scheduler builds the grid using the batch max sequence length, so it launches many empty work groups that immediately return.

Example (our test case):

  • batch=16, max seq len=8096
  • Original scheduler grid → 1 × 64 × 512 = 32,768 WGs
  • But only 3,808 WGs actually have work; the rest are empty and only add scheduling overhead.
    This wastes device-side scheduling bandwidth and inflates traces

Solution
Following the idea used by FlashInfer (launch only work that exists), this PR adds a valid-only individual tile scheduler for the prefill kernel. It constructs a compact list of tiles that actually do work and launches exactly that many WGs.

In addition, this PR:

  • Adds a Zipf-based varlen initializer to better emulate long-tailed sequence distributions.
  • Fixes varlen performance metrics so FLOPs/GB/s reflect the actual per-sample effective lengths rather than a global max.

Test case:

./build/examples/sycl/06_bmg_flash_attention/06_bmg_prefill_attention_hdim128 --batch=16 --num_heads_q=32 --num_heads_kv=8 --seq_len_qo=1024 --seq_len_kv=1024 --head_size_vo=128  --head_size_qk=128 --varlen --iterations=10000 --scheduler="ValidOnlyIndividual" --varlen_dist="zipf"
[VarLen Init/Zipf] batches=16 Zipf(Nmax=16384, s=1.098) AlignQ(elems)=32 AlignKV(elems)=32
  batch 0 : Q=672, KV=672
  batch 1 : Q=192, KV=192
  batch 2 : Q=32, KV=32
  batch 3 : Q=320, KV=320
  batch 4 : Q=32, KV=32
  batch 5 : Q=96, KV=96
  batch 6 : Q=32, KV=32
  batch 7 : Q=32, KV=32
  batch 8 : Q=8096, KV=8096
  batch 9 : Q=32, KV=32
  batch 10 : Q=4032, KV=4032
  batch 11 : Q=32, KV=32
  batch 12 : Q=32, KV=32
  batch 13 : Q=32, KV=32
  batch 14 : Q=160, KV=160
  batch 15 : Q=32, KV=32
  cumulative_seqlen_q : 0 672 864 896 1216 1248 1344 1376 1408 9504 9536 13568 13600 13632 13664 13824 13856
  cumulative_seqlen_kv: 0 672 864 896 1216 1248 1344 1376 1408 9504 9536 13568 13600 13632 13664 13824 13856
  totals: Q=13856 (max 8096), KV=13856 (max 8096)
Total tile number: 3808
Disposition: Passed
Batch: 16       NumHeads_q: 32  NumHeads_kv: 8  Seq Length QO: 1024     Seq Length KV: 1024     Head Size QK: 128       Head Size VO: 128        Causal Mask: false      Variable Sequence Length: true   Scheduler: ValidOnlyIndividual
Performance:   18.909  GB/s,    64.285  TFlop/s,   21.0103  ms

Performance
Launched WGs: reduced from 32,768 to 3,808

Kernel throughput: +0.5 TFLOP/s

Before (Individual): 63.74 TFLOP/s

After (ValidOnlyIndividual): 64.28 TFLOP/s

This gain is modest but consistent for this workload; more importantly, the launch now reflects actual work, improving scheduling efficiency.

User-facing options
No default behavior change.
--scheduler="ValidOnlyIndividual": new valid-only scheduler for varlen prefill.
--varlen_dist="zipf": new varlen distribution option, default is the existing normal distribution.

@rolandschulz rolandschulz requested a review from tdeng5 September 19, 2025 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant