Cleanup follow-ups for Qwen3-ASR batched decode#236
Open
ivan-digital wants to merge 7 commits into
Open
Conversation
…elper Five new constants land in Qwen3ASRTokens (asrText/newline/system/user/ assistant) and the four prompt-construction sites — generateText, buildPromptTokenIds, transcribe, transcribeNoMLX — now read them from the single source of truth instead of redeclaring nine raw token IDs each. PR #234 added a third copy of these magic numbers; this reduces that to one and prevents the next prompt-template change from silently diverging across paths. buildBatchedCausalMask was wired up for the experimental [B,1,H] path but is not consumed by either generateTextBatched or generateTextBatchedBulkSync — its only caller is its own unit test. Drop the helper plus the test so the decode loop has no unreachable helpers riding along.
The default transcribeBatch path runs decoder forwards serially per row and only batches the host token sync, so each row in a batch contributes ≈ equally per decode step. Prorating elapsed by audio duration over-credited longer chunks; an even split is more honest and matches what the BulkSync path actually does. The true wall-clock group time still surfaces in batch_time, so existing dashboards that care about real elapsed are unaffected.
PR #234 documents that QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1 truncates row 1 at batch size 2 on repeated identical chunks, but ships no test. Add an E2E test that skips when the env var is unset (default CI is unaffected) and runs transcribeBatch vs serial transcribe when set, asserting equality inside XCTExpectFailure. As long as the bug exists the expected failure makes the test pass; once the path is fixed the expectation goes unsatisfied and the test flips red — a forcing function to drop the gate and the experimental code, instead of letting the broken path silently linger.
The doc claim that non-greedy options always fall back to per-chunk serial transcribe had zero CI coverage — if a future refactor drops the isGreedyFastPath guard, repetition penalty / n-gram mask / temperature would be silently bypassed for batch callers and the batched argMax would override sampling. Add five unit tests covering isGreedyFastPath: true for default options, false for repetitionPenalty != 1.0, noRepeatNgramSize > 0, temperature > 0, and the combined non-greedy case. Add an E2E test testBatchFallsBackToSerialForNonGreedyOptions that asserts batched output matches per-chunk serial output when non-greedy options are set, so behavioral equivalence is checked end to end.
Define non-greedy options precisely (any of repetitionPenalty != 1.0,
noRepeatNgramSize > 0, temperature > 0.0) instead of leaving readers
to derive it from the prose. Reference the new
isGreedyFastPath unit tests and testBatchFallsBackToSerialForNonGreedyOptions
E2E so the fall-back contract is auditable.
Document JSONL time / batch_time semantics — time is now an even
attribution across rows in the group rather than an audio-duration
proration; batch_time is the only true wall-clock.
Reframe the experimental [B,1,H] limitation from soft warning
('must be benchmarked before being treated as production-safe') to
a regression-tested constraint, citing
E2EQwen3ASRExperimentalBatchedDecodeTests.
Bisects the row-asymmetric-output bug in our experimental [B,1,H] batched decoder down to MLXNN.RoPE itself. With B=2 row-symmetric input at T=1, the op produces row-asymmetric output even at offset=0; T>1 and B=1 are correct. Filed upstream as ml-explore/mlx-swift#401. The three broken shapes are wrapped in XCTExpectFailure citing the upstream issue. When mlx-swift#401 is fixed, the equality assertions succeed, the expected failures go unsatisfied, and these tests flip red — that's the trigger to drop the workaround in transcribeBatch and remove the env-var gate around generateTextBatched. The probe is intentionally synthetic (no model download) so it runs in the unit-test sweep and serves as a tight, attributable repro for anyone reading the upstream issue.
Member
Author
|
Added a sixth commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on ml-explore/mlx-swift-examples#453 (from a fork). This PR's diff includes ml-explore/mlx-swift-examples#453's commits because GitHub can't use the fork branch as a base. Once ml-explore/mlx-swift-examples#453 lands in main, this branch will rebase cleanly and the diff will shrink to the cleanup-only changes (5 commits, ~−52 lines source + ~120 lines new tests/docs).
Summary
Five atomic commits, all stacked on top of ml-explore/mlx-swift-examples#453:
refactor(Qwen3ASR)— Five new constants inQwen3ASRTokens(asrText/newline/system/user/assistant); the four prompt-construction sites (generateText,buildPromptTokenIds,transcribe,transcribeNoMLX) now read them from one place instead of redeclaring nine raw token IDs each. PR Suggestion: Add an example for TTS (Text-to-Speech) ml-explore/mlx-swift-examples#453 added a third copy; this reduces it to one. DropsbuildBatchedCausalMaskplus its unit test — the helper is unused (only its own test referenced it).fix(transcribe-batch)— JSONLtimeis now an even split across rows in a group instead of an audio-duration proration. DefaulttranscribeBatchruns decoder forwards serially per row and only batches host token sync, so each row contributes ≈ equally per decode step.batch_timestill surfaces real wall-clock.test(Qwen3ASR)— Regression test for the experimental[B,1,H]row-truncation bug at batch size 2. Skips whenQWEN3_ASR_EXPERIMENTAL_BATCH_DECODEis unset; when set, assertstranscribeBatch == serialinsideXCTExpectFailureso the bug is locked in until fixed (then the expectation goes unsatisfied and the test flips red).test(Qwen3ASR)— Locks in the non-greedy fall-back. Five unit tests forisGreedyFastPathcovering default-true and each non-greedy variant. E2EtestBatchFallsBackToSerialForNonGreedyOptionsasserts batched output matches per-chunk serial output whenrepetitionPenalty=1.15, noRepeatNgramSize=3are set. Without these, removing theisGreedyFastPathguard would silently bypass sampling/repetition logic for batch callers.docs(qwen3-asr)— Define "non-greedy" precisely inqwen3-asr-inference.md; document JSONLtime/batch_timesemantics; reframe the experimental-path limitation from soft warning to regression-tested constraint and reference the new tests.Validation (cleanup branch)
QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1)XCTExpectFailure56444fd6d4b9identical to ml-explore/mlx-swift-examples#453Test plan
swift test --filter 'Qwen3ASRBatchedDecodeTests|TranscribeBatchCommandTests'swift test --filter E2EQwen3ASRBatchedDecodeCorrectnessTestsswift test --filter E2EQwen3ASRGreedyDeterminismTestsswift test --filter Qwen3ASR --skip E2Eswift test --filter E2EQwen3ASRExperimentalBatchedDecodeTests(no env → skips)QWEN3_ASR_EXPERIMENTAL_BATCH_DECODE=1 swift test --filter E2EQwen3ASRExperimentalBatchedDecodeTests(passes viaXCTExpectFailure)python3 scripts/benchmark_qwen3_batch_decode.py --batch-sizes 1,2,4,6,8 --num-chunks 24 --require-identical-output