Conversation
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Config file for running subtests that were missed in the initial run. Co-authored-by: Cursor <cursoragent@cursor.com>
The S2S model outputs special tokens like <$0.72$> and <|3.6|> which represent energy/confidence and timing markers. These were being included in the text output, causing issues with VoiceBench scoring: - Exact match tests (bbh, openbookqa, mmsu) failed due to prefix mismatch - IFEval instruction following patterns didn't match Added _clean_special_tokens() to strip these markers from the output. Co-authored-by: Cursor <cursoragent@cursor.com>
Strip S2S timing tokens (<$X.XX$> and <|X.XX|>) from generation field during conversion to VoiceBench format. This fixes scoring issues where these markers caused exact match failures. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements a YAML-driven NemotronVoiceChat offline backend (script-like OmegaConf resolution), wires it into serve_unified, and adds VoiceBench configs including a 10-sample sd_qa smoke test. Co-authored-by: Cursor <cursoragent@cursor.com>
Match Kevin's recipe by treating tts_ckpt_path as pretrained_model when it is a file, and only using pretrained_tts_model for exported directories. Add a 10-sample sd_qa smoke config with audio output enabled. Co-authored-by: Cursor <cursoragent@cursor.com>
Expose per-request ASR hypothesis from offline_inference outputs (asr_hyps) in debug_info to aid debugging and evaluation. Co-authored-by: Cursor <cursoragent@cursor.com>
Add an intermediate stage that transcribes generated agent audio, computes WER/CER vs generated text, writes output_asr.jsonl + agent_audio_metrics.json, and optionally scores VoiceBench on generated text or agent ASR via config. Co-authored-by: Cursor <cursoragent@cursor.com>
When output.jsonl and .done markers exist, skip generation and avoid setting dependencies on the generation expname so the ASR/WER and scoring stages can be rerun. Co-authored-by: Cursor <cursoragent@cursor.com>
Always run two VoiceBench scoring stages: one on output.jsonl (generated) and one on output_asr.jsonl (agent ASR). Store both results in metrics.json under greedy and greedy_asr. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep a single greedy metrics dict: generated-text scoring writes panda/gpt keys, ASR scoring writes panda_asr/gpt_asr keys, and both runs merge into one metrics.json. Co-authored-by: Cursor <cursoragent@cursor.com>
When writing ASR-scored results (panda_asr/gpt_asr), merge into the existing greedy dict instead of replacing it so metrics.json contains both generated and ASR metrics. Co-authored-by: Cursor <cursoragent@cursor.com>
If output.jsonl has no audio paths (or a sample lacks audio), write output_asr.jsonl as a passthrough of generated text to avoid empty transcripts and misleading ASR scoring. Co-authored-by: Cursor <cursoragent@cursor.com>
Add full VoiceBench audio-enabled config plus an sd_qa smoke config for ASR-scored evaluation, and document the s2s_voicechat backend usage and exact commands. Co-authored-by: Cursor <cursoragent@cursor.com>
- Add dataset preparation with auto-download support - Add four subtests: pause, backchannel, turn_taking, interruption - Add scoring integration scripts and format converters - Add evaluation scripts and configuration files - Add result analysis and comparison utilities - Add comprehensive documentation and evaluation guide
- Remove 727 audio files (712MB) from git tracking - Add data/ directory to .gitignore - Data will be prepared on cluster using prepare.py - Significantly reduces git packaging time
- Update to use evaluation/evaluate.py (not root evaluate.py) - Map subtest names to FDB task names (pause -> pause_handling, etc.) - Add note about ASR transcript requirement
- Add --audio_output_dir to save audio to mmkrtchyan's directory - Fixes Permission denied error when saving to vmendelev's directory
- Set TMPDIR to mmkrtchyan's directory to override hardcoded vmendelev path - Update config to use custom inference yaml
- unified_server.py already supports AUDIO_SAVE_DIR environment variable - Set to mmkrtchyan's directory instead of vmendelev's hardcoded path - This will fix Permission denied errors when saving audio Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
392e767 to
a9cbefb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.