Unable to reproduce SpecPrefill results #651
apetersson
started this conversation in
General
Replies: 1 comment
-
|
note: #652 fixes this for Qwen 122B - but not for gemma4. for gemma 4, https://github.com/apetersson/omlx/tree/feature/gemma4-specprefill were some experiments, but those did not lead to a concrete improvement |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I cannot reliably reproduce the reported SpecPrefill speedups in
oMLXunder apples-to-apples local benchmarks.The strongest mismatch is on Qwen3.5-122B, where the paper reports large TTFT improvements using a 2B draft model at 20% keep rate, but my local
oMLXresults so far showSpecPrefill often tying or underperforming baseline prefill.
I also tested Gemma 4 models as a control, and those results suggest SpecPrefill behavior is inconsistent across model families and prompt lengths:
gemma-4-31b-it-4bit: no measurable PP benefit from SpecPrefillgemma-4-31b-it-8bit: apparent win on a very long prompt (~9k tokens), but that disappeared under tighter apples-to-apples reruns at ~5k tokensThis makes me suspect either:
oMLXSpecPrefill path differs from the implementation used for the paper,To Reproduce
oMLXlocally on Apple Silicon with large models installed.Qwen3.5-122B-A10B-5bitQwen3.5-2B-6bitQwen3.5-0.8B-5bit,Qwen3.5-0.8B-8bitspecprefill_enabled=truespecprefill_draft_model=<draft path>specprefill_keep_pct=0.20specprefill_threshold=1024prompt_eval_duration,time_to_first_token, andprompt_tokens_per_second.I used local benchmark scripts to automate the above with clean reloads and unique prompts.
Expected behavior
I expected:
Instead:
Desktop:
oMLX 0.3.5.dev1(source checkout)Additional context
Hardware:
oMLXversion:0.3.5.dev1Qwen target model:
Qwen3.5-122B-A10B-5bitmodel_type: qwen3_5_moearchitecture: Qwen3_5MoeForConditionalGenerationQwen draft models tested:
Qwen3.5-2B-6bitQwen3.5-0.8B-5bitQwen3.5-0.8B-8bitAll tested Qwen draft models:
model_type: qwen3_5architecture: Qwen3_5ForConditionalGenerationTokenizersBackendmodel_max_length: 262144Qwen 122B early results:
8k_base_t1:8035prompt tokens,41.48sTTFT,193.7prompt tok/s8k_sp_Qwen3.5-2B-6bit_t1:8046prompt tokens,47.98sTTFT,167.7prompt tok/s8k_sp_Qwen3.5-0.8B-5bit_t1:8048prompt tokens,41.57sTTFT,193.61prompt tok/s8k_sp_Qwen3.5-0.8B-8bit_t1:8048prompt tokens,42.14sTTFT,190.98prompt tok/sGemma 4 findings:
gemma-4-31b-it-4bitgemma-4-31b-it-4bitgemma-4-e2b-it-4bit0.10,0.20,0.308954tokens102.33prompt tok/ssp10:100.20(-2.1%)sp20:101.14(-1.2%)sp30:101.17(-1.1%)gemma-4-31b-it-8bit, initial long-prompt run (~8954tokens)gemma-4-31b-it-8bitgemma-4-e2b-it-4bit55.27prompt tok/ssp10:76.73(+38.8%)sp20:100.50(+81.8%)sp30:101.00(+82.7%)gemma-4-31b-it-8bit, apples-to-apples rerun at ~5060tokensgemma-4-e2b-it-4bitgemma-4-e4b-it-4bit0.20,0.30105.53prompt tok/se2b, 0.20: avg104.88e2b, 0.30: avg105.50e4b, 0.20: avg104.78e4b, 0.30: avg105.26gemma-4-31b-it-8bit, threshold sweep withe2bdraft andkeep_pct=0.305kprompt tokens:106.09th512:105.34th1024:105.78th2048:105.41th4096:105.70th8192:105.495k8ksweep was also effectively flat so far:104.71th512:104.83th1024:105.22th2048:105.12th4096:104.71Questions:
Qwen3.5-2B-6bitthe correct draft model to reproduce the paper’s “2B draft” result in currentoMLX?oMLXSpecPrefill path equivalent to the implementation used for the paper?oMLX?Beta Was this translation helpful? Give feedback.
All reactions