Skip to content

Comments

Vmendelev/2512 s2s eval#1246

Draft
wprazuch wants to merge 68 commits intomainfrom
vmendelev/2512_s2s_eval
Draft

Vmendelev/2512 s2s eval#1246
wprazuch wants to merge 68 commits intomainfrom
vmendelev/2512_s2s_eval

Conversation

@wprazuch
Copy link
Collaborator

No description provided.

karpnv and others added 30 commits December 20, 2025 07:00
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
vmendelev and others added 28 commits February 3, 2026 14:53
Config file for running subtests that were missed in the initial run.

Co-authored-by: Cursor <cursoragent@cursor.com>
The S2S model outputs special tokens like <$0.72$> and <|3.6|> which
represent energy/confidence and timing markers. These were being
included in the text output, causing issues with VoiceBench scoring:
- Exact match tests (bbh, openbookqa, mmsu) failed due to prefix mismatch
- IFEval instruction following patterns didn't match

Added _clean_special_tokens() to strip these markers from the output.

Co-authored-by: Cursor <cursoragent@cursor.com>
Strip S2S timing tokens (<$X.XX$> and <|X.XX|>) from generation
field during conversion to VoiceBench format. This fixes scoring
issues where these markers caused exact match failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements a YAML-driven NemotronVoiceChat offline backend (script-like OmegaConf resolution), wires it into serve_unified, and adds VoiceBench configs including a 10-sample sd_qa smoke test.

Co-authored-by: Cursor <cursoragent@cursor.com>
Match Kevin's recipe by treating tts_ckpt_path as pretrained_model when it is a file, and only using pretrained_tts_model for exported directories. Add a 10-sample sd_qa smoke config with audio output enabled.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose per-request ASR hypothesis from offline_inference outputs (asr_hyps) in debug_info to aid debugging and evaluation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add an intermediate stage that transcribes generated agent audio, computes WER/CER vs generated text, writes output_asr.jsonl + agent_audio_metrics.json, and optionally scores VoiceBench on generated text or agent ASR via config.

Co-authored-by: Cursor <cursoragent@cursor.com>
When output.jsonl and .done markers exist, skip generation and avoid setting dependencies on the generation expname so the ASR/WER and scoring stages can be rerun.

Co-authored-by: Cursor <cursoragent@cursor.com>
Always run two VoiceBench scoring stages: one on output.jsonl (generated) and one on output_asr.jsonl (agent ASR). Store both results in metrics.json under greedy and greedy_asr.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep a single greedy metrics dict: generated-text scoring writes panda/gpt keys, ASR scoring writes panda_asr/gpt_asr keys, and both runs merge into one metrics.json.

Co-authored-by: Cursor <cursoragent@cursor.com>
When writing ASR-scored results (panda_asr/gpt_asr), merge into the existing greedy dict instead of replacing it so metrics.json contains both generated and ASR metrics.

Co-authored-by: Cursor <cursoragent@cursor.com>
If output.jsonl has no audio paths (or a sample lacks audio), write output_asr.jsonl as a passthrough of generated text to avoid empty transcripts and misleading ASR scoring.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add full VoiceBench audio-enabled config plus an sd_qa smoke config for ASR-scored evaluation, and document the s2s_voicechat backend usage and exact commands.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add dataset preparation with auto-download support
- Add four subtests: pause, backchannel, turn_taking, interruption
- Add scoring integration scripts and format converters
- Add evaluation scripts and configuration files
- Add result analysis and comparison utilities
- Add comprehensive documentation and evaluation guide
- Remove 727 audio files (712MB) from git tracking
- Add data/ directory to .gitignore
- Data will be prepared on cluster using prepare.py
- Significantly reduces git packaging time
- Update to use evaluation/evaluate.py (not root evaluate.py)
- Map subtest names to FDB task names (pause -> pause_handling, etc.)
- Add note about ASR transcript requirement
- Add --audio_output_dir to save audio to mmkrtchyan's directory
- Fixes Permission denied error when saving to vmendelev's directory
- Set TMPDIR to mmkrtchyan's directory to override hardcoded vmendelev path
- Update config to use custom inference yaml
- unified_server.py already supports AUDIO_SAVE_DIR environment variable
- Set to mmkrtchyan's directory instead of vmendelev's hardcoded path
- This will fix Permission denied errors when saving audio

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
@melllinia melllinia force-pushed the vmendelev/2512_s2s_eval branch from 392e767 to a9cbefb Compare February 18, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants