Skip to content

Cleanup follow-ups for Qwen3-ASR batched decode#236

Open
ivan-digital wants to merge 7 commits into
mainfrom
cleanup/qwen3asr-batch-decode-followups
Open

Cleanup follow-ups for Qwen3-ASR batched decode#236
ivan-digital wants to merge 7 commits into
mainfrom
cleanup/qwen3asr-batch-decode-followups

Conversation

@ivan-digital

Copy link
Copy Markdown
Member

Stacks on ml-explore/mlx-swift-examples#453 (from a fork). This PR's diff includes ml-explore/mlx-swift-examples#453's commits because GitHub can't use the fork branch as a base. Once ml-explore/mlx-swift-examples#453 lands in main, this branch will rebase cleanly and the diff will shrink to the cleanup-only changes (5 commits, ~−52 lines source + ~120 lines new tests/docs).

Summary

Five atomic commits, all stacked on top of ml-explore/mlx-swift-examples#453:

  • refactor(Qwen3ASR) — Five new constants in Qwen3ASRTokens (asrText/newline/system/user/assistant); the four prompt-construction sites (generateText, buildPromptTokenIds, transcribe, transcribeNoMLX) now read them from one place instead of redeclaring nine raw token IDs each. PR Suggestion: Add an example for TTS (Text-to-Speech) ml-explore/mlx-swift-examples#453 added a third copy; this reduces it to one. Drops buildBatchedCausalMask plus its unit test — the helper is unused (only its own test referenced it).
  • fix(transcribe-batch) — JSONL time is now an even split across rows in a group instead of an audio-duration proration. Default transcribeBatch runs decoder forwards serially per row and only batches host token sync, so each row contributes ≈ equally per decode step. batch_time still surfaces real wall-clock.
  • test(Qwen3ASR) — Regression test for the experimental [B,1,H] row-truncation bug at batch size 2. Skips when QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE is unset; when set, asserts transcribeBatch == serial inside XCTExpectFailure so the bug is locked in until fixed (then the expectation goes unsatisfied and the test flips red).
  • test(Qwen3ASR) — Locks in the non-greedy fall-back. Five unit tests for isGreedyFastPath covering default-true and each non-greedy variant. E2E testBatchFallsBackToSerialForNonGreedyOptions asserts batched output matches per-chunk serial output when repetitionPenalty=1.15, noRepeatNgramSize=3 are set. Without these, removing the isGreedyFastPath guard would silently bypass sampling/repetition logic for batch callers.
  • docs(qwen3-asr) — Define "non-greedy" precisely in qwen3-asr-inference.md; document JSONL time/batch_time semantics; reframe the experimental-path limitation from soft warning to regression-tested constraint and reference the new tests.

Validation (cleanup branch)

Check Result
Build clean
Unit tests (Qwen3ASR + CLI) 14/14 (was 10 in ml-explore/mlx-swift-examples#453, +5 gate, −1 dead-code mask test)
E2E batched-decode correctness (greedy + non-greedy fall-back) 2/2
E2E greedy determinism (snapshot) 2/2 — serial decoder bit-stable post-dedupe
Qwen3ASR regression suite 132/132
Experimental test (no env) skips
Experimental test (QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1) captures truncation diff via XCTExpectFailure
Benchmark digest 56444fd6d4b9 identical to ml-explore/mlx-swift-examples#453
Benchmark speedup within run variance

Test plan

  • swift test --filter 'Qwen3ASRBatchedDecodeTests|TranscribeBatchCommandTests'
  • swift test --filter E2EQwen3ASRBatchedDecodeCorrectnessTests
  • swift test --filter E2EQwen3ASRGreedyDeterminismTests
  • swift test --filter Qwen3ASR --skip E2E
  • swift test --filter E2EQwen3ASRExperimentalBatchedDecodeTests (no env → skips)
  • QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1 swift test --filter E2EQwen3ASRExperimentalBatchedDecodeTests (passes via XCTExpectFailure)
  • python3 scripts/benchmark_qwen3_batch_decode.py --batch-sizes 1,2,4,6,8 --num-chunks 24 --require-identical-output

hhh2210 and others added 6 commits May 7, 2026 10:20
…elper

Five new constants land in Qwen3ASRTokens (asrText/newline/system/user/
assistant) and the four prompt-construction sites — generateText,
buildPromptTokenIds, transcribe, transcribeNoMLX — now read them from
the single source of truth instead of redeclaring nine raw token IDs
each. PR #234 added a third copy of these magic numbers; this reduces
that to one and prevents the next prompt-template change from silently
diverging across paths.

buildBatchedCausalMask was wired up for the experimental [B,1,H] path
but is not consumed by either generateTextBatched or
generateTextBatchedBulkSync — its only caller is its own unit test.
Drop the helper plus the test so the decode loop has no unreachable
helpers riding along.
The default transcribeBatch path runs decoder forwards serially per
row and only batches the host token sync, so each row in a batch
contributes ≈ equally per decode step. Prorating elapsed by audio
duration over-credited longer chunks; an even split is more honest
and matches what the BulkSync path actually does. The true wall-clock
group time still surfaces in batch_time, so existing dashboards that
care about real elapsed are unaffected.
PR #234 documents that QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1 truncates
row 1 at batch size 2 on repeated identical chunks, but ships no test.
Add an E2E test that skips when the env var is unset (default CI is
unaffected) and runs transcribeBatch vs serial transcribe when set,
asserting equality inside XCTExpectFailure. As long as the bug exists
the expected failure makes the test pass; once the path is fixed the
expectation goes unsatisfied and the test flips red — a forcing
function to drop the gate and the experimental code, instead of
letting the broken path silently linger.
The doc claim that non-greedy options always fall back to per-chunk
serial transcribe had zero CI coverage — if a future refactor drops
the isGreedyFastPath guard, repetition penalty / n-gram mask /
temperature would be silently bypassed for batch callers and the
batched argMax would override sampling.

Add five unit tests covering isGreedyFastPath: true for default
options, false for repetitionPenalty != 1.0, noRepeatNgramSize > 0,
temperature > 0, and the combined non-greedy case. Add an E2E test
testBatchFallsBackToSerialForNonGreedyOptions that asserts batched
output matches per-chunk serial output when non-greedy options are
set, so behavioral equivalence is checked end to end.
Define non-greedy options precisely (any of repetitionPenalty != 1.0,
noRepeatNgramSize > 0, temperature > 0.0) instead of leaving readers
to derive it from the prose. Reference the new
isGreedyFastPath unit tests and testBatchFallsBackToSerialForNonGreedyOptions
E2E so the fall-back contract is auditable.

Document JSONL time / batch_time semantics — time is now an even
attribution across rows in the group rather than an audio-duration
proration; batch_time is the only true wall-clock.

Reframe the experimental [B,1,H] limitation from soft warning
('must be benchmarked before being treated as production-safe') to
a regression-tested constraint, citing
E2EQwen3ASRExperimentalBatchedDecodeTests.
Bisects the row-asymmetric-output bug in our experimental [B,1,H]
batched decoder down to MLXNN.RoPE itself. With B=2 row-symmetric
input at T=1, the op produces row-asymmetric output even at
offset=0; T>1 and B=1 are correct. Filed upstream as
ml-explore/mlx-swift#401.

The three broken shapes are wrapped in XCTExpectFailure citing the
upstream issue. When mlx-swift#401 is fixed, the equality assertions
succeed, the expected failures go unsatisfied, and these tests flip
red — that's the trigger to drop the workaround in transcribeBatch
and remove the env-var gate around generateTextBatched.

The probe is intentionally synthetic (no model download) so it runs
in the unit-test sweep and serves as a tight, attributable repro for
anyone reading the upstream issue.
@ivan-digital

Copy link
Copy Markdown
Member Author

Added a sixth commit: Tests/Qwen3ASRTests/QuantizedBatchInvarianceProbe.swift — synthetic op-level probe that pins down the bug to MLXNN.RoPE at [B>1, H, T=1, D] (no model download, no quantization needed to reproduce). Three broken-shape tests wrapped in XCTExpectFailure citing ml-explore/mlx#3496, so when upstream lands the fix the expectations go unsatisfied and the tests flip red — the signal to drop the experimental-path env-var gate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants