Unable to reproduce SpecPrefill results #651

apetersson · 2026-04-07T19:27:33Z

apetersson
Apr 7, 2026

I cannot reliably reproduce the reported SpecPrefill speedups in oMLX under apples-to-apples local benchmarks.

The strongest mismatch is on Qwen3.5-122B, where the paper reports large TTFT improvements using a 2B draft model at 20% keep rate, but my local oMLX results so far show
SpecPrefill often tying or underperforming baseline prefill.

I also tested Gemma 4 models as a control, and those results suggest SpecPrefill behavior is inconsistent across model families and prompt lengths:

gemma-4-31b-it-4bit: no measurable PP benefit from SpecPrefill
gemma-4-31b-it-8bit: apparent win on a very long prompt (~9k tokens), but that disappeared under tighter apples-to-apples reruns at ~5k tokens
threshold sweeps at ~5k and partial ~8k did not surface a meaningful SpecPrefill advantage

This makes me suspect either:

the current oMLX SpecPrefill path differs from the implementation used for the paper,
there are model-specific conditions that are not documented,
or there is a bug / regression affecting activation or efficiency.

To Reproduce

Run oMLX locally on Apple Silicon with large models installed.
For Qwen reproduction, use:
- target: Qwen3.5-122B-A10B-5bit
- draft: Qwen3.5-2B-6bit
- comparison drafts: Qwen3.5-0.8B-5bit, Qwen3.5-0.8B-8bit
Configure SpecPrefill:
- specprefill_enabled=true
- specprefill_draft_model=<draft path>
- specprefill_keep_pct=0.20
- specprefill_threshold=1024
Before each run:
- clear SSD cache
- unload the target model
- reload the target model
- use a unique long prompt to avoid cache reuse
Run a no-SP baseline and then a SpecPrefill run at the same prompt size.
Compare prompt_eval_duration, time_to_first_token, and prompt_tokens_per_second.

I used local benchmark scripts to automate the above with clean reloads and unique prompts.

Expected behavior
I expected:

Qwen3.5-122B with a 2B Qwen-family draft at 20% keep rate to show a clear PP/TTFT improvement over baseline, consistent with the paper.
Gemma-family results to at least show a monotonic or explainable trend with prompt size and keep rate.

Instead:

Qwen 122B + 2B draft was slower than baseline in the first 8K trial.
Gemma results were inconsistent and highly prompt-size dependent.
Threshold tuning alone did not reveal a clear crossover at ~5k or partial ~8k tests.

Desktop:

OS: macOS 26.3.1 (Build 25D2128)
Browser: Chrome 146 (admin UI), but reproduction itself is API/script driven
Version: oMLX 0.3.5.dev1 (source checkout)

Additional context
Hardware:

Mac Studio
Apple M1 Ultra
20 cores (16P + 4E)
128 GB unified memory

oMLX version:

0.3.5.dev1

Qwen target model:

Qwen3.5-122B-A10B-5bit
model_type: qwen3_5_moe
architecture: Qwen3_5MoeForConditionalGeneration
quantization: affine, 5-bit, group size 64

Qwen draft models tested:

Qwen3.5-2B-6bit
Qwen3.5-0.8B-5bit
Qwen3.5-0.8B-8bit

All tested Qwen draft models:

model_type: qwen3_5
architecture: Qwen3_5ForConditionalGeneration
tokenizer backend: TokenizersBackend
model_max_length: 262144

Qwen 122B early results:

8k_base_t1: 8035 prompt tokens, 41.48s TTFT, 193.7 prompt tok/s
8k_sp_Qwen3.5-2B-6bit_t1: 8046 prompt tokens, 47.98s TTFT, 167.7 prompt tok/s
8k_sp_Qwen3.5-0.8B-5bit_t1: 8048 prompt tokens, 41.57s TTFT, 193.61 prompt tok/s
8k_sp_Qwen3.5-0.8B-8bit_t1: 8048 prompt tokens, 42.14s TTFT, 190.98 prompt tok/s

Gemma 4 findings:

gemma-4-31b-it-4bit
- target: gemma-4-31b-it-4bit
- draft: gemma-4-e2b-it-4bit
- keep rates tested: 0.10, 0.20, 0.30
- prompt size: ~8954 tokens
- result: no SpecPrefill benefit
- baseline: 102.33 prompt tok/s
- sp10: 100.20 (-2.1%)
- sp20: 101.14 (-1.2%)
- sp30: 101.17 (-1.1%)
gemma-4-31b-it-8bit, initial long-prompt run (~8954 tokens)
- target: gemma-4-31b-it-8bit
- draft: gemma-4-e2b-it-4bit
- result appeared positive
- baseline: 55.27 prompt tok/s
- sp10: 76.73 (+38.8%)
- sp20: 100.50 (+81.8%)
- sp30: 101.00 (+82.7%)
gemma-4-31b-it-8bit, apples-to-apples rerun at ~5060 tokens
- same target
- drafts tested:
  - gemma-4-e2b-it-4bit
  - gemma-4-e4b-it-4bit
- keep rates: 0.20, 0.30
- 2 runs each
- fresh no-SP baseline at same prompt size: 105.53 prompt tok/s
- results were effectively tied with baseline:
  - e2b, 0.20: avg 104.88
  - e2b, 0.30: avg 105.50
  - e4b, 0.20: avg 104.78
  - e4b, 0.30: avg 105.26
gemma-4-31b-it-8bit, threshold sweep with e2b draft and keep_pct=0.30
- ~5k prompt tokens:
  - baseline: 106.09
  - th512: 105.34
  - th1024: 105.78
  - th2048: 105.41
  - th4096: 105.70
  - th8192: 105.49
- no threshold beat baseline at ~5k
- partial ~8k sweep was also effectively flat so far:
  - baseline: 104.71
  - th512: 104.83
  - th1024: 105.22
  - th2048: 105.12
  - th4096: 104.71

Questions:

Is Qwen3.5-2B-6bit the correct draft model to reproduce the paper’s “2B draft” result in current oMLX?
Is the current oMLX SpecPrefill path equivalent to the implementation used for the paper?
Are there additional settings, code paths, or benchmark conditions required to reproduce the reported Qwen-122B results?
Is there a known issue or regression where current SpecPrefill underperforms or fails to activate effectively for Qwen3.5-122B or Gemma 4 in oMLX?

apetersson · 2026-04-07T20:54:25Z

apetersson
Apr 7, 2026
Author

note: #652 fixes this for Qwen 122B - but not for gemma4. for gemma 4, https://github.com/apetersson/omlx/tree/feature/gemma4-specprefill were some experiments, but those did not lead to a concrete improvement

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce SpecPrefill results #651

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unable to reproduce SpecPrefill results #651

Uh oh!

apetersson Apr 7, 2026

Replies: 1 comment

Uh oh!

apetersson Apr 7, 2026 Author

apetersson
Apr 7, 2026

apetersson
Apr 7, 2026
Author