-
Notifications
You must be signed in to change notification settings - Fork 168
Vmendelev/2512 s2s eval #1246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
wprazuch
wants to merge
79
commits into
main
Choose a base branch
from
vmendelev/2512_s2s_eval
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Vmendelev/2512 s2s eval #1246
Changes from all commits
Commits
Show all changes
79 commits
Select commit
Hold shift + click to select a range
99ef50a
Added audio requests to vLLM models
karpnv 05dba7d
Intorduced vLLM_multimodal model to save multimodal outputs
vmendelev 8621313
generation.py to respect separate server type for the client
vmendelev 32daf07
Unified server to work with NeMo models not supported by vLLM
vmendelev d4c7ece
s2s incremental backend and session based backend
vmendelev 9c66bf6
s2s_demo test set and evaluation script
vmendelev da6c481
No special handling of data_dir in vllm constructor
vmendelev bb67e07
Fixed VLLM_multimodal
vmendelev 991f215
Metircs calculation
vmendelev 4de7342
LLM judge
vmendelev 5635a21
Eval starter script, config, comparator, documentation
vmendelev 29a0b87
Example cluster config
vmendelev f984ed1
Voicebench
vmendelev 77ba6c2
Added session related fields to incremental config
vmendelev 90ce0a6
Lock for session backend in unified server
vmendelev ef5287c
Race in Mamba triton kernels fix in session backend
vmendelev 13b0b01
Removed the lock from unified server
vmendelev f93e262
Session backend with unified debug output and fixed audio output
vmendelev c081ff3
Voicebench sd_qa_usa scoring fix
vmendelev 56964d4
Voicebench related scripts
vmendelev ec99c5c
Documentation and minor changes in s2s_demo test set scripts
vmendelev 5630986
Support for the standard OpenAI Chat Completion API audio input
vmendelev c0b7f84
Generation detection parameter into serve_unified
vmendelev 0a82387
Fixed session mechanism to deprecate session_id
vmendelev 5c6c727
A switch to get rid of debug info
vmendelev a32779b
Return only 1 turn text in multi turn requests
vmendelev b6512f2
Documentation on how to run the server only
vmendelev 177a93c
Fixed documentation and return asr output with the debug_info
vmendelev 2c17ae3
Documentation updated
vmendelev a01e794
Voicebench config to run external models
vmendelev 2e98154
Fixed input format to comply with OpenAI API
vmendelev 4a4e88a
Fixed voicebench evaluation runner
vmendelev 0bd5f3b
Demo configs
vmendelev a6eee52
Add S2S offline backend using NemotronVoiceChat model
vmendelev 3635e63
Fix dtype: use float32 for s2s offline backend (TTS model requires fp32)
vmendelev 3c830eb
Fix: also strip input_audio from output (not just audio_url)
vmendelev 472dfb1
Fix BBH scoring: infer task type from prompt to generate id field
vmendelev 7676d2a
Fix BBH data prep: include 'id' field from HuggingFace dataset
vmendelev 24b9aa5
Add VoiceBench config with ignore_system_prompt flag
vmendelev f64413e
Add VoiceBench remaining subtests config
vmendelev a469936
Fix: Strip special timing tokens from S2S text output
vmendelev 7504686
Add special token stripping to VoiceBench format conversion
vmendelev a526a11
Add rescore configs for bbh, ifeval, sd_qa
vmendelev 9b9150c
Add s2s_voicechat backend for NemotronVoiceChat offline inference
vmendelev 8da28bc
Fix s2s_voicechat TTS checkpoint override
vmendelev e2884b8
Add user ASR to s2s_voicechat debug info
vmendelev 682e629
Add agent-audio ASR scoring stage
vmendelev b923fbf
Avoid run_after when generation already done
vmendelev b901504
Run VoiceBench scoring on generated and ASR text
vmendelev ea7f7b4
Write ASR VoiceBench metrics as *_asr keys
vmendelev 16474b9
Deep-merge VoiceBench metrics into greedy dict
vmendelev 75fcd76
Make agent-audio ASR stage robust when audio missing
vmendelev d30a4fb
Add s2s_voicechat sound eval configs and docs
vmendelev 4be7a37
Add Full-Duplex-Bench evaluation integration
melllinia 4507b44
Update cluster paths for fullduplexbench
melllinia bed02f6
Remove audio data from git, keep only on cluster
melllinia 7ce8c9f
Fix FDB scoring to use correct evaluation script path
melllinia ce0b41a
Fix audio output directory permission issue
melllinia 4f6a994
Add TMPDIR env var to fix audio save permission issue
melllinia f992786
Fix audio save path using AUDIO_SAVE_DIR env var
melllinia daf1715
cheanup
melllinia 37a3f97
cheanup
melllinia 58e25fd
Adding FDB v1.5 and restructuring dirs
melllinia 1975ae1
splitting pause subtask
melllinia be614d6
add stereo audio option
melllinia 0ecc58a
add stereo audio option
melllinia a9cbefb
Trimming to per-sample input duration to remove batch-padding
melllinia 023cdc6
Script to parallelize execution with s2s_voicechat backend
vmendelev ba66eb3
adding missing scores to FDB v1.5
melllinia a4b6435
adding system prompt for MCQ
melllinia 2b151ed
adding incremental backend
melllinia ce0e4d0
hf asr leaderboard and feb26 configs
melllinia b645376
adding asr text
melllinia cbb6777
adding corpus level wer calculation
melllinia 83a65c4
adding hf normalization
melllinia 0002426
trimming back the audio to match the input length
melllinia b1fe941
trimming back the audio to match the input length
melllinia ffce093
fixing incremental backend postprocessing
melllinia eb78128
add S/I/D breakdown to ASR evaluation metrics and fix earnings22 data…
melllinia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| executor: slurm | ||
|
|
||
| ssh_tunnel: | ||
| host: draco-oci-login-01.draco-oci-iad.nvidia.com | ||
| # ------------------------------- Fill this up! ------------------------------- | ||
| user: mmkrtchyan | ||
| job_dir: /lustre/fsw/portfolios/llmservice/users/mmkrtchyan/workspace/code/nemo-run | ||
| identity: "" | ||
| # ----------------------------------------------------------------------------- | ||
|
|
||
| # if you're running directly from cluster, you only need to define job_dir and shouldn't use ssh_tunnel | ||
| # job_dir: <some location on slurm cluster to keep job metadata, uploaded code and generated sbatch files> | ||
|
|
||
| account: llmservice_nemo_speechlm | ||
| # account: llmservice_nemo_reasoning | ||
| partition: batch_block1,batch_block3,batch_block4 | ||
| cpu_partition: cpu | ||
| job_name_prefix: "" | ||
|
|
||
| containers: | ||
| trtllm: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-trtllm-0.7.0.sqsh | ||
| vllm: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-vllm-0.7.0.sqsh | ||
| sglang: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-sglang-0.7.0.sqsh | ||
| nemo: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-nemo-0.7.0.sqsh | ||
| nemo-rl: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-nemo-rl-0.7.0.sqsh | ||
| megatron: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-megatron-0.7.0.sqsh | ||
| sandbox: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-sandbox-0.7.1.sqsh | ||
| nemo-skills: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-0.7.0.sqsh | ||
| verl: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-verl-0.7.0.sqsh | ||
|
|
||
| mounts: | ||
| - /lustre:/lustre | ||
| - /lustre/fs12/portfolios/llmservice/projects/llmservice_nemo_speechlm/data/nemo_skills/dataset:/dataset | ||
|
|
||
| required_env_vars: | ||
| - NV_INFERENCE_KEY | ||
|
|
||
| env_vars: | ||
| # ------------------------------- Fill this up! ------------------------------- | ||
| - HF_HOME=/lustre/fsw/portfolios/llmservice/users/mmkrtchyan/.cache/huggingface | ||
| - AUDIO_SAVE_DIR=/lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/fullduplexbench_audio_outputs | ||
| - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | ||
| # ----------------------------------------------------------------------------- | ||
|
|
||
| timeouts: | ||
| batch_block1,batch_block3,batch_block4: 04:00:00 | ||
| interactive: 04:00:00 | ||
| interactive_singlenode: 04:00:00 | ||
| cpu: 04:00:00 | ||
|
|
||
| mail_type: FAIL | ||
| mail_user: # <your email goes here> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
59 changes: 59 additions & 0 deletions
59
...ills/dataset/asr-leaderboard/scripts/asr_leaderboard_s2s_incremental_v2_02mar_config.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| # HF Open ASR Leaderboard evaluation - S2S incremental V2 backend, TEXT output only. | ||
| # Evaluates WER on 8 ASR datasets: librispeech_clean, librispeech_other, voxpopuli, | ||
| # tedlium, gigaspeech, spgispeech, earnings22, ami. | ||
| # Checkpoint: Mar 2 2026 (Feb 26 STT + Mar 3 TTS, Megan speaker, force_turn_taking OFF) | ||
| # | ||
| # Run: | ||
| # python nemo_skills/dataset/asr-leaderboard/scripts/run_eval.py \ | ||
| # --config nemo_skills/dataset/asr-leaderboard/scripts/asr_leaderboard_s2s_incremental_v2_02mar_config.yaml | ||
|
|
||
| # Cluster settings | ||
| cluster: s2s_eval_oci_iad | ||
| partition: batch_block1,batch_block3,batch_block4 | ||
| cpu_partition: cpu | ||
|
|
||
| model: /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Nemotron-VoiceChat-november/duplex-eartts-10min_sw_et_eos_dp_eos_dup_fp32_1delay_02_March_exp_17_afg_long_FT_Megan_msr_34k_steps-stt-AS9.1_11002_new_branch_load_fixed | ||
|
|
||
| server_type: vllm | ||
| server_gpus: 1 | ||
| num_chunks: 104 | ||
|
|
||
| server_entrypoint: "-m nemo_skills.inference.server.serve_unified" | ||
| server_args: >- | ||
| --backend s2s_incremental_v2 | ||
| --no_decode_audio | ||
| --use_asr_as_response | ||
| --ignore_system_prompt | ||
| --speaker_reference /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Mg_a_00759.wav | ||
| --num_frames_per_inference 3 | ||
| --engine_type vllm_llm_vllm_eartts | ||
| --use_perception_cache | ||
| --use_perception_cudagraph | ||
| --buffer_size_frames 21 | ||
| --codec_token_history_size 60 | ||
| --repetition_penalty 1.0 | ||
| --matmul_precision medium | ||
| --vllm_gpu_memory_utilization 0.35 | ||
| --vllm_max_model_len 8192 | ||
| --output_dir /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v2_02mar_artifacts_full_no_system_prompt_3frames_per_inference_21buffer | ||
| --batch_size 1 | ||
| --code_path /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/NeMo | ||
| --pip_install "hf-xet==1.1.9 huggingface-hub==0.34.4 nvidia-modelopt==0.33.1 nvidia-modelopt-core==0.33.1 tokenizers==0.22.0 transformers==4.56.0 lhotse==1.32.2 nv-one-logger-core==2.1.0 nv-one-logger-pytorch-lightning-integration==2.1.0 nv-one-logger-training-telemetry==2.1.0 kaldialign==0.9.1 jiwer whisper-normalizer sacrebleu" | ||
|
|
||
| server_container: /lustre/fsw/portfolios/llmservice/users/erastorgueva/code/containers/triton25.05_s2svllm26.02.12.sqsh | ||
| server_server_type: vllm_multimodal | ||
|
|
||
| # Benchmark name used by nemo-skills eval pipeline | ||
| benchmark: asr-leaderboard | ||
|
|
||
| # Paths -- data_dir must match the container mount in cluster config | ||
| # (cluster mounts /lustre/fs12/.../nemo_skills/dataset -> /dataset) | ||
| data_dir: /dataset | ||
| output_dir: /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v5_02mar_artifacts_full_no_system_prompt_3frames_per_inference_21buffer | ||
|
|
||
| installation_command: "pip install jiwer whisper-normalizer sacrebleu" | ||
|
|
||
| expname: asr_leaderboard_s2s_incremental_v5_02mar | ||
| generation_only: false | ||
| scoring_only: false | ||
| dry_run: false |
65 changes: 65 additions & 0 deletions
65
nemo_skills/dataset/asr-leaderboard/scripts/asr_leaderboard_s2s_incremental_v2_config.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # HF Open ASR Leaderboard evaluation - S2S incremental V2 backend, TEXT output only. | ||
| # Evaluates WER on 8 ASR datasets: librispeech_clean, librispeech_other, voxpopuli, | ||
| # tedlium, gigaspeech, spgispeech, earnings22, ami. | ||
| # Checkpoint: Feb 26 2026 (legally friendly personaplex dataset) | ||
| # | ||
| # Prepare data first: | ||
| # ns prepare_data asr-leaderboard | ||
| # | ||
| # Run: | ||
| # python nemo_skills/dataset/asr-leaderboard/scripts/run_eval.py \ | ||
| # --config nemo_skills/dataset/asr-leaderboard/scripts/asr_leaderboard_s2s_incremental_v2_config.yaml | ||
|
|
||
| # Cluster settings | ||
| cluster: s2s_eval_oci_iad | ||
| partition: batch_block1,batch_block3,batch_block4 | ||
| cpu_partition: cpu | ||
|
|
||
| model: /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Nemotron-VoiceChat-november/duplex-eartts-10min_sw_et_eos_dp_eos_dup_fp32_1delay_26_Feb_exp_13_afg_14k_steps-stt-AS9.1_11002_new_branch_load_fixed | ||
|
|
||
| server_type: vllm | ||
| server_gpus: 1 | ||
| num_chunks: 1 | ||
|
|
||
| server_entrypoint: "-m nemo_skills.inference.server.serve_unified" | ||
| server_args: >- | ||
| --backend s2s_incremental_v2 | ||
| --no_decode_audio | ||
| --use_asr_as_response | ||
| --speaker_reference /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Mg_a_00759.wav | ||
| --num_frames_per_inference 3 | ||
| --engine_type vllm_llm_vllm_eartts | ||
| --use_perception_cache | ||
| --use_perception_cudagraph | ||
| --buffer_size_frames 20 | ||
| --codec_token_history_size 60 | ||
| --repetition_penalty 1.0 | ||
| --force_turn_taking | ||
| --force_turn_taking_threshold 40 | ||
| --force_turn_taking_pad_window 25 | ||
| --matmul_precision medium | ||
| --vllm_gpu_memory_utilization 0.35 | ||
| --vllm_max_model_len 8192 | ||
| --system_prompt "You are a helpful assistant. /no_think" | ||
| --output_dir /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v2_02mar_full_artifacts | ||
| --batch_size 2 | ||
| --code_path /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/NeMo | ||
| --pip_install "hf-xet==1.1.9 huggingface-hub==0.34.4 nvidia-modelopt==0.33.1 nvidia-modelopt-core==0.33.1 tokenizers==0.22.0 transformers==4.56.0 lhotse==1.32.2 nv-one-logger-core==2.1.0 nv-one-logger-pytorch-lightning-integration==2.1.0 nv-one-logger-training-telemetry==2.1.0 kaldialign==0.9.1 jiwer whisper-normalizer" | ||
|
|
||
| server_container: /lustre/fsw/portfolios/llmservice/users/erastorgueva/code/containers/triton25.05_s2svllm26.02.12.sqsh | ||
| server_server_type: vllm_multimodal | ||
|
|
||
| # Benchmark name used by nemo-skills eval pipeline | ||
| benchmark: asr-leaderboard | ||
|
|
||
| # Paths -- data_dir must match the container mount in cluster config | ||
| # (cluster mounts /lustre/fs12/.../nemo_skills/dataset -> /dataset) | ||
| data_dir: /dataset | ||
| output_dir: /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v2_02mar_full | ||
|
|
||
| installation_command: "pip install jiwer whisper-normalizer sacrebleu" | ||
| expname: asr_leaderboard_s2s_incremental_v2_02mar_full | ||
|
|
||
| generation_only: false | ||
| scoring_only: false | ||
| dry_run: false |
60 changes: 60 additions & 0 deletions
60
nemo_skills/dataset/asr-leaderboard/scripts/hf_baseline_02mar_config.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # HF Open ASR Leaderboard evaluation - (i) BASELINE setup | ||
| # S2S incremental V2 backend, TEXT output only. | ||
| # Evaluates WER on 8 ASR datasets: librispeech_clean, librispeech_other, voxpopuli, | ||
| # tedlium, gigaspeech, spgispeech, earnings22, ami. | ||
| # No inference boosting, no force turn taking. | ||
| # Checkpoint: Mar 2 2026 (Feb 26 STT + Mar 3 TTS, Megan speaker) | ||
| # | ||
| # Run: | ||
| # python nemo_skills/dataset/asr-leaderboard/scripts/run_eval.py \ | ||
| # --config nemo_skills/dataset/asr-leaderboard/scripts/hf_baseline_02mar_config.yaml | ||
|
|
||
| # Cluster settings | ||
| cluster: s2s_eval_oci_iad | ||
| partition: batch_block1,batch_block3,batch_block4 | ||
| cpu_partition: cpu | ||
|
|
||
| model: /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Nemotron-VoiceChat-november/duplex-eartts-10min_sw_et_eos_dp_eos_dup_fp32_1delay_02_March_exp_17_afg_long_FT_Megan_msr_34k_steps-stt-AS9.1_11002_new_branch_load_fixed | ||
|
|
||
| server_type: vllm | ||
| server_gpus: 1 | ||
| num_chunks: 104 | ||
|
|
||
| server_entrypoint: "-m nemo_skills.inference.server.serve_unified" | ||
| server_args: >- | ||
| --backend s2s_incremental_v2 | ||
| --no_decode_audio | ||
| --use_asr_as_response | ||
| --ignore_system_prompt | ||
| --speaker_reference /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Mg_a_00759.wav | ||
| --num_frames_per_inference 3 | ||
| --engine_type vllm_llm_vllm_eartts | ||
| --use_perception_cache | ||
| --use_perception_cudagraph | ||
| --buffer_size_frames 21 | ||
| --codec_token_history_size 60 | ||
| --repetition_penalty 1.0 | ||
| --matmul_precision medium | ||
| --vllm_gpu_memory_utilization 0.35 | ||
| --vllm_max_model_len 8192 | ||
| --output_dir /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02_FIXED/baseline/hf_artifacts | ||
| --batch_size 1 | ||
| --code_path /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/NeMo | ||
| --pip_install "hf-xet==1.1.9 huggingface-hub==0.34.4 nvidia-modelopt==0.33.1 nvidia-modelopt-core==0.33.1 tokenizers==0.22.0 transformers==4.56.0 lhotse==1.32.2 nv-one-logger-core==2.1.0 nv-one-logger-pytorch-lightning-integration==2.1.0 nv-one-logger-training-telemetry==2.1.0 kaldialign==0.9.1 jiwer whisper-normalizer sacrebleu" | ||
|
|
||
| server_container: /lustre/fsw/portfolios/llmservice/users/erastorgueva/code/containers/triton25.05_s2svllm26.02.12.sqsh | ||
| server_server_type: vllm_multimodal | ||
|
|
||
| # Benchmark name used by nemo-skills eval pipeline | ||
| benchmark: asr-leaderboard | ||
|
|
||
| # Paths | ||
| data_dir: /dataset | ||
| output_dir: /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02_FIXED/baseline/hf | ||
|
|
||
| installation_command: "pip install jiwer whisper-normalizer sacrebleu" | ||
|
|
||
| expname: hf_baseline_02mar | ||
| generation_only: false | ||
| scoring_only: false | ||
| dry_run: false |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not commit cluster configs to this repo