Add a new tile scheduler for varlen prefill to avoid launching empty work groups #516
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a valid-only scheduler for varlen prefill, a Zipf varlen initializer, and corrects varlen metrics. It removes 29K+ empty WGs in our skewed test, netting a consistent +0.5 TFLOP/s.
Problem:
For varlen (skewed) prefill, the existing “Individual” scheduler builds the grid using the batch max sequence length, so it launches many empty work groups that immediately return.
Example (our test case):
This wastes device-side scheduling bandwidth and inflates traces
Solution
Following the idea used by FlashInfer (launch only work that exists), this PR adds a valid-only individual tile scheduler for the prefill kernel. It constructs a compact list of tiles that actually do work and launches exactly that many WGs.
In addition, this PR:
Test case:
Performance
Launched WGs: reduced from 32,768 to 3,808
Kernel throughput: +0.5 TFLOP/s
Before (Individual): 63.74 TFLOP/s
After (ValidOnlyIndividual): 64.28 TFLOP/s
This gain is modest but consistent for this workload; more importantly, the launch now reflects actual work, improving scheduling efficiency.
User-facing options
No default behavior change.
--scheduler="ValidOnlyIndividual": new valid-only scheduler for varlen prefill.
--varlen_dist="zipf": new varlen distribution option, default is the existing normal distribution.