Cleanup follow-ups for Qwen3-ASR batched decode by ivan-digital · Pull Request #236 · soniqo/speech-swift

ivan-digital · 2026-05-07T05:47:35Z

Stacks on ml-explore/mlx-swift-examples#453 (from a fork). This PR's diff includes ml-explore/mlx-swift-examples#453's commits because GitHub can't use the fork branch as a base. Once ml-explore/mlx-swift-examples#453 lands in main, this branch will rebase cleanly and the diff will shrink to the cleanup-only changes (5 commits, ~−52 lines source + ~120 lines new tests/docs).

Summary

Five atomic commits, all stacked on top of ml-explore/mlx-swift-examples#453:

refactor(Qwen3ASR) — Five new constants in Qwen3ASRTokens (asrText/newline/system/user/assistant); the four prompt-construction sites (generateText, buildPromptTokenIds, transcribe, transcribeNoMLX) now read them from one place instead of redeclaring nine raw token IDs each. PR Suggestion: Add an example for TTS (Text-to-Speech) ml-explore/mlx-swift-examples#453 added a third copy; this reduces it to one. Drops buildBatchedCausalMask plus its unit test — the helper is unused (only its own test referenced it).
fix(transcribe-batch) — JSONL time is now an even split across rows in a group instead of an audio-duration proration. Default transcribeBatch runs decoder forwards serially per row and only batches host token sync, so each row contributes ≈ equally per decode step. batch_time still surfaces real wall-clock.
test(Qwen3ASR) — Regression test for the experimental [B,1,H] row-truncation bug at batch size 2. Skips when QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE is unset; when set, asserts transcribeBatch == serial inside XCTExpectFailure so the bug is locked in until fixed (then the expectation goes unsatisfied and the test flips red).
test(Qwen3ASR) — Locks in the non-greedy fall-back. Five unit tests for isGreedyFastPath covering default-true and each non-greedy variant. E2E testBatchFallsBackToSerialForNonGreedyOptions asserts batched output matches per-chunk serial output when repetitionPenalty=1.15, noRepeatNgramSize=3 are set. Without these, removing the isGreedyFastPath guard would silently bypass sampling/repetition logic for batch callers.
docs(qwen3-asr) — Define "non-greedy" precisely in qwen3-asr-inference.md; document JSONL time/batch_time semantics; reframe the experimental-path limitation from soft warning to regression-tested constraint and reference the new tests.

Validation (cleanup branch)

Check	Result
Build	clean
Unit tests (Qwen3ASR + CLI)	14/14 (was 10 in ml-explore/mlx-swift-examples#453, +5 gate, −1 dead-code mask test)
E2E batched-decode correctness (greedy + non-greedy fall-back)	2/2
E2E greedy determinism (snapshot)	2/2 — serial decoder bit-stable post-dedupe
Qwen3ASR regression suite	132/132
Experimental test (no env)	skips
Experimental test (`QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1`)	captures truncation diff via `XCTExpectFailure`
Benchmark digest	`56444fd6d4b9` identical to ml-explore/mlx-swift-examples#453
Benchmark speedup	within run variance

Test plan

swift test --filter 'Qwen3ASRBatchedDecodeTests|TranscribeBatchCommandTests'
swift test --filter E2EQwen3ASRBatchedDecodeCorrectnessTests
swift test --filter E2EQwen3ASRGreedyDeterminismTests
swift test --filter Qwen3ASR --skip E2E
swift test --filter E2EQwen3ASRExperimentalBatchedDecodeTests (no env → skips)
QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1 swift test --filter E2EQwen3ASRExperimentalBatchedDecodeTests (passes via XCTExpectFailure)
python3 scripts/benchmark_qwen3_batch_decode.py --batch-sizes 1,2,4,6,8 --num-chunks 24 --require-identical-output

…elper Five new constants land in Qwen3ASRTokens (asrText/newline/system/user/ assistant) and the four prompt-construction sites — generateText, buildPromptTokenIds, transcribe, transcribeNoMLX — now read them from the single source of truth instead of redeclaring nine raw token IDs each. PR #234 added a third copy of these magic numbers; this reduces that to one and prevents the next prompt-template change from silently diverging across paths. buildBatchedCausalMask was wired up for the experimental [B,1,H] path but is not consumed by either generateTextBatched or generateTextBatchedBulkSync — its only caller is its own unit test. Drop the helper plus the test so the decode loop has no unreachable helpers riding along.

The default transcribeBatch path runs decoder forwards serially per row and only batches the host token sync, so each row in a batch contributes ≈ equally per decode step. Prorating elapsed by audio duration over-credited longer chunks; an even split is more honest and matches what the BulkSync path actually does. The true wall-clock group time still surfaces in batch_time, so existing dashboards that care about real elapsed are unaffected.

PR #234 documents that QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1 truncates row 1 at batch size 2 on repeated identical chunks, but ships no test. Add an E2E test that skips when the env var is unset (default CI is unaffected) and runs transcribeBatch vs serial transcribe when set, asserting equality inside XCTExpectFailure. As long as the bug exists the expected failure makes the test pass; once the path is fixed the expectation goes unsatisfied and the test flips red — a forcing function to drop the gate and the experimental code, instead of letting the broken path silently linger.

The doc claim that non-greedy options always fall back to per-chunk serial transcribe had zero CI coverage — if a future refactor drops the isGreedyFastPath guard, repetition penalty / n-gram mask / temperature would be silently bypassed for batch callers and the batched argMax would override sampling. Add five unit tests covering isGreedyFastPath: true for default options, false for repetitionPenalty != 1.0, noRepeatNgramSize > 0, temperature > 0, and the combined non-greedy case. Add an E2E test testBatchFallsBackToSerialForNonGreedyOptions that asserts batched output matches per-chunk serial output when non-greedy options are set, so behavioral equivalence is checked end to end.

Define non-greedy options precisely (any of repetitionPenalty != 1.0, noRepeatNgramSize > 0, temperature > 0.0) instead of leaving readers to derive it from the prose. Reference the new isGreedyFastPath unit tests and testBatchFallsBackToSerialForNonGreedyOptions E2E so the fall-back contract is auditable. Document JSONL time / batch_time semantics — time is now an even attribution across rows in the group rather than an audio-duration proration; batch_time is the only true wall-clock. Reframe the experimental [B,1,H] limitation from soft warning ('must be benchmarked before being treated as production-safe') to a regression-tested constraint, citing E2EQwen3ASRExperimentalBatchedDecodeTests.

Bisects the row-asymmetric-output bug in our experimental [B,1,H] batched decoder down to MLXNN.RoPE itself. With B=2 row-symmetric input at T=1, the op produces row-asymmetric output even at offset=0; T>1 and B=1 are correct. Filed upstream as ml-explore/mlx-swift#401. The three broken shapes are wrapped in XCTExpectFailure citing the upstream issue. When mlx-swift#401 is fixed, the equality assertions succeed, the expected failures go unsatisfied, and these tests flip red — that's the trigger to drop the workaround in transcribeBatch and remove the env-var gate around generateTextBatched. The probe is intentionally synthetic (no model download) so it runs in the unit-test sweep and serves as a tight, attributable repro for anyone reading the upstream issue.

ivan-digital · 2026-05-07T06:49:00Z

Added a sixth commit: Tests/Qwen3ASRTests/QuantizedBatchInvarianceProbe.swift — synthetic op-level probe that pins down the bug to MLXNN.RoPE at [B>1, H, T=1, D] (no model download, no quantization needed to reproduce). Three broken-shape tests wrapped in XCTExpectFailure citing ml-explore/mlx#3496, so when upstream lands the fix the expectations go unsatisfied and the tests flip red — the signal to drop the experimental-path env-var gate.

hhh2210 and others added 6 commits May 7, 2026 10:20

perf(qwen3asr): optimize batch decode sync

92d50fe

ivan-digital mentioned this pull request May 7, 2026

Optimize Qwen3-ASR batch decode CPU sync #234

Open

ivan-digital mentioned this pull request Jun 14, 2026

MLXNN.RoPE produces row-asymmetric output for byte-identical row inputs at [B>1, H, 1, D] ml-explore/mlx#3496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cleanup follow-ups for Qwen3-ASR batched decode#236

Cleanup follow-ups for Qwen3-ASR batched decode#236
ivan-digital wants to merge 7 commits into
mainfrom
cleanup/qwen3asr-batch-decode-followups

ivan-digital commented May 7, 2026

Uh oh!

ivan-digital commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

ivan-digital commented May 7, 2026

Summary

Validation (cleanup branch)

Test plan

Uh oh!

ivan-digital commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants