Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
99ef50a
Added audio requests to vLLM models
karpnv Nov 13, 2025
05dba7d
Intorduced vLLM_multimodal model to save multimodal outputs
vmendelev Dec 18, 2025
8621313
generation.py to respect separate server type for the client
vmendelev Dec 18, 2025
32daf07
Unified server to work with NeMo models not supported by vLLM
vmendelev Dec 20, 2025
d4c7ece
s2s incremental backend and session based backend
vmendelev Dec 20, 2025
9c66bf6
s2s_demo test set and evaluation script
vmendelev Dec 19, 2025
da6c481
No special handling of data_dir in vllm constructor
vmendelev Dec 20, 2025
bb67e07
Fixed VLLM_multimodal
vmendelev Dec 21, 2025
991f215
Metircs calculation
vmendelev Dec 22, 2025
4de7342
LLM judge
vmendelev Dec 22, 2025
5635a21
Eval starter script, config, comparator, documentation
vmendelev Dec 23, 2025
29a0b87
Example cluster config
vmendelev Dec 23, 2025
f984ed1
Voicebench
vmendelev Dec 29, 2025
77ba6c2
Added session related fields to incremental config
vmendelev Dec 29, 2025
90ce0a6
Lock for session backend in unified server
vmendelev Dec 29, 2025
ef5287c
Race in Mamba triton kernels fix in session backend
vmendelev Dec 29, 2025
13b0b01
Removed the lock from unified server
vmendelev Dec 29, 2025
f93e262
Session backend with unified debug output and fixed audio output
vmendelev Dec 30, 2025
c081ff3
Voicebench sd_qa_usa scoring fix
vmendelev Dec 30, 2025
56964d4
Voicebench related scripts
vmendelev Dec 31, 2025
ec99c5c
Documentation and minor changes in s2s_demo test set scripts
vmendelev Dec 31, 2025
5630986
Support for the standard OpenAI Chat Completion API audio input
vmendelev Jan 12, 2026
c0b7f84
Generation detection parameter into serve_unified
vmendelev Jan 12, 2026
0a82387
Fixed session mechanism to deprecate session_id
vmendelev Jan 12, 2026
5c6c727
A switch to get rid of debug info
vmendelev Jan 12, 2026
a32779b
Return only 1 turn text in multi turn requests
vmendelev Jan 12, 2026
b6512f2
Documentation on how to run the server only
vmendelev Jan 12, 2026
177a93c
Fixed documentation and return asr output with the debug_info
vmendelev Jan 13, 2026
2c17ae3
Documentation updated
vmendelev Jan 13, 2026
a01e794
Voicebench config to run external models
vmendelev Jan 15, 2026
2e98154
Fixed input format to comply with OpenAI API
vmendelev Jan 15, 2026
4a4e88a
Fixed voicebench evaluation runner
vmendelev Jan 16, 2026
0bd5f3b
Demo configs
vmendelev Jan 16, 2026
a6eee52
Add S2S offline backend using NemotronVoiceChat model
vmendelev Feb 3, 2026
3635e63
Fix dtype: use float32 for s2s offline backend (TTS model requires fp32)
vmendelev Feb 3, 2026
3c830eb
Fix: also strip input_audio from output (not just audio_url)
vmendelev Feb 3, 2026
472dfb1
Fix BBH scoring: infer task type from prompt to generate id field
vmendelev Feb 3, 2026
7676d2a
Fix BBH data prep: include 'id' field from HuggingFace dataset
vmendelev Feb 3, 2026
24b9aa5
Add VoiceBench config with ignore_system_prompt flag
vmendelev Feb 3, 2026
f64413e
Add VoiceBench remaining subtests config
vmendelev Feb 3, 2026
a469936
Fix: Strip special timing tokens from S2S text output
vmendelev Feb 3, 2026
7504686
Add special token stripping to VoiceBench format conversion
vmendelev Feb 3, 2026
a526a11
Add rescore configs for bbh, ifeval, sd_qa
vmendelev Feb 3, 2026
9b9150c
Add s2s_voicechat backend for NemotronVoiceChat offline inference
vmendelev Feb 4, 2026
8da28bc
Fix s2s_voicechat TTS checkpoint override
vmendelev Feb 4, 2026
e2884b8
Add user ASR to s2s_voicechat debug info
vmendelev Feb 5, 2026
682e629
Add agent-audio ASR scoring stage
vmendelev Feb 5, 2026
b923fbf
Avoid run_after when generation already done
vmendelev Feb 5, 2026
b901504
Run VoiceBench scoring on generated and ASR text
vmendelev Feb 5, 2026
ea7f7b4
Write ASR VoiceBench metrics as *_asr keys
vmendelev Feb 5, 2026
16474b9
Deep-merge VoiceBench metrics into greedy dict
vmendelev Feb 5, 2026
75fcd76
Make agent-audio ASR stage robust when audio missing
vmendelev Feb 5, 2026
d30a4fb
Add s2s_voicechat sound eval configs and docs
vmendelev Feb 6, 2026
4be7a37
Add Full-Duplex-Bench evaluation integration
melllinia Feb 5, 2026
4507b44
Update cluster paths for fullduplexbench
melllinia Feb 5, 2026
bed02f6
Remove audio data from git, keep only on cluster
melllinia Feb 5, 2026
7ce8c9f
Fix FDB scoring to use correct evaluation script path
melllinia Feb 5, 2026
ce0b41a
Fix audio output directory permission issue
melllinia Feb 5, 2026
4f6a994
Add TMPDIR env var to fix audio save permission issue
melllinia Feb 5, 2026
f992786
Fix audio save path using AUDIO_SAVE_DIR env var
melllinia Feb 5, 2026
daf1715
cheanup
melllinia Feb 9, 2026
37a3f97
cheanup
melllinia Feb 9, 2026
58e25fd
Adding FDB v1.5 and restructuring dirs
melllinia Feb 10, 2026
1975ae1
splitting pause subtask
melllinia Feb 11, 2026
be614d6
add stereo audio option
melllinia Feb 12, 2026
0ecc58a
add stereo audio option
melllinia Feb 12, 2026
a9cbefb
Trimming to per-sample input duration to remove batch-padding
melllinia Feb 16, 2026
023cdc6
Script to parallelize execution with s2s_voicechat backend
vmendelev Feb 21, 2026
ba66eb3
adding missing scores to FDB v1.5
melllinia Feb 23, 2026
a4b6435
adding system prompt for MCQ
melllinia Feb 25, 2026
2b151ed
adding incremental backend
melllinia Feb 26, 2026
ce0e4d0
hf asr leaderboard and feb26 configs
melllinia Mar 2, 2026
b645376
adding asr text
melllinia Mar 2, 2026
cbb6777
adding corpus level wer calculation
melllinia Mar 3, 2026
83a65c4
adding hf normalization
melllinia Mar 4, 2026
0002426
trimming back the audio to match the input length
melllinia Mar 5, 2026
b1fe941
trimming back the audio to match the input length
melllinia Mar 5, 2026
ffce093
fixing incremental backend postprocessing
melllinia Mar 7, 2026
eb78128
add S/I/D breakdown to ASR evaluation metrics and fix earnings22 data…
melllinia Mar 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions cluster_configs/s2s_eval_oci_iad.yaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not commit cluster configs to this repo

Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
executor: slurm

ssh_tunnel:
host: draco-oci-login-01.draco-oci-iad.nvidia.com
# ------------------------------- Fill this up! -------------------------------
user: mmkrtchyan
job_dir: /lustre/fsw/portfolios/llmservice/users/mmkrtchyan/workspace/code/nemo-run
identity: ""
# -----------------------------------------------------------------------------

# if you're running directly from cluster, you only need to define job_dir and shouldn't use ssh_tunnel
# job_dir: <some location on slurm cluster to keep job metadata, uploaded code and generated sbatch files>

account: llmservice_nemo_speechlm
# account: llmservice_nemo_reasoning
partition: batch_block1,batch_block3,batch_block4
cpu_partition: cpu
job_name_prefix: ""

containers:
trtllm: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-trtllm-0.7.0.sqsh
vllm: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-vllm-0.7.0.sqsh
sglang: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-sglang-0.7.0.sqsh
nemo: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-nemo-0.7.0.sqsh
nemo-rl: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-nemo-rl-0.7.0.sqsh
megatron: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-megatron-0.7.0.sqsh
sandbox: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-sandbox-0.7.1.sqsh
nemo-skills: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-0.7.0.sqsh
verl: /lustre/fsw/portfolios/llmservice/users/igitman/llm/images/nemo-skills-verl-0.7.0.sqsh

mounts:
- /lustre:/lustre
- /lustre/fs12/portfolios/llmservice/projects/llmservice_nemo_speechlm/data/nemo_skills/dataset:/dataset

required_env_vars:
- NV_INFERENCE_KEY

env_vars:
# ------------------------------- Fill this up! -------------------------------
- HF_HOME=/lustre/fsw/portfolios/llmservice/users/mmkrtchyan/.cache/huggingface
- AUDIO_SAVE_DIR=/lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/fullduplexbench_audio_outputs
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# -----------------------------------------------------------------------------

timeouts:
batch_block1,batch_block3,batch_block4: 04:00:00
interactive: 04:00:00
interactive_singlenode: 04:00:00
cpu: 04:00:00

mail_type: FAIL
mail_user: # <your email goes here>
65 changes: 29 additions & 36 deletions nemo_skills/dataset/asr-leaderboard/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,17 @@

"""Prepare ASR Leaderboard datasets for evaluation.

Downloads and formats datasets from the HuggingFace Open ASR Leaderboard.
Downloads and formats datasets from the official HF Open ASR Leaderboard ESB
test-only sorted dataset (hf-audio/esb-datasets-test-only-sorted). This is the
same data source used by the official leaderboard and the offline NeMo eval
pipeline, ensuring apples-to-apples WER comparison.

Audio paths in JSONL: /dataset/asr-leaderboard/data/{dataset}/{sample_id}.flac

Usage:
ns prepare_data asr-leaderboard
ns prepare_data asr-leaderboard --datasets librispeech_clean ami
ns prepare_data asr-leaderboard --datasets earnings22
ns prepare_data asr-leaderboard --no-audio # skip saving audio files
"""

Expand All @@ -34,29 +39,22 @@
SYSTEM_MESSAGE = "You are a helpful assistant. /no_think"
MIN_AUDIO_DURATION = 0.1 # Skip audio shorter than this (causes mel spectrogram errors)

# (hf_dataset, hf_config, hf_split, streaming)
# (hf_repo, config, split, text_field, id_field)
DATASET_CONFIGS = {
"librispeech_clean": ("librispeech_asr", "clean", "test", False),
"librispeech_other": ("librispeech_asr", "other", "test", False),
"voxpopuli": ("facebook/voxpopuli", "en", "test", False),
"tedlium": ("LIUM/tedlium", "release3", "test", False),
"gigaspeech": ("speechcolab/gigaspeech", "xs", "test", False),
"spgispeech": ("kensho/spgispeech", "test", "test", True), # streaming to avoid timeout due to large metadata
"earnings22": ("distil-whisper/earnings22", "chunked", "test", False),
"ami": ("edinburghcstr/ami", "ihm", "test", False),
"librispeech_clean": ("hf-audio/esb-datasets-test-only-sorted", "librispeech", "test.clean", "text", "id"),
"librispeech_other": ("hf-audio/esb-datasets-test-only-sorted", "librispeech", "test.other", "text", "id"),
"voxpopuli": ("hf-audio/esb-datasets-test-only-sorted", "voxpopuli", "test", "text", "id"),
"tedlium": ("hf-audio/esb-datasets-test-only-sorted", "tedlium", "test", "text", "id"),
"gigaspeech": ("hf-audio/esb-datasets-test-only-sorted", "gigaspeech", "test", "text", "id"),
"spgispeech": ("hf-audio/esb-datasets-test-only-sorted", "spgispeech", "test", "text", "id"),
"earnings22": ("hf-audio/esb-datasets-test-only-sorted", "earnings22", "test", "text", "id"),
"ami": ("hf-audio/esb-datasets-test-only-sorted", "ami", "test", "text", "id"),
}


def save_audio_and_format_entry(entry, dataset_name, audio_dir, sample_idx, with_audio=True):
def save_audio_and_format_entry(entry, dataset_name, audio_dir, sample_idx, text_field="text", id_field="id", with_audio=True):
"""Format a dataset entry and optionally save audio file."""
# Different datasets use different field names for transcription
text = (
entry.get("text", "") # ami, LS, gigaspeech, tedlium
or entry.get("normalized_text", "") # voxpopuli
or entry.get("transcript", "") # spgispeech
or entry.get("transcription", "") # earnings22
)
text = text.strip() if text else ""
text = entry.get(text_field, "").strip()

system_message = {"role": "system", "content": SYSTEM_MESSAGE}
user_message = {"role": "user", "content": "Transcribe the following audio."}
Expand All @@ -70,8 +68,8 @@ def save_audio_and_format_entry(entry, dataset_name, audio_dir, sample_idx, with
if duration < MIN_AUDIO_DURATION:
return None

sample_id = entry.get("id", str(sample_idx))
audio_filename = f"{sample_id}.flac"
sample_id = str(entry.get(id_field, sample_idx)).replace("/", "_")
audio_filename = f"{Path(sample_id).stem}.flac"

if with_audio:
sf.write(str(audio_dir / audio_filename), audio_array, sampling_rate)
Expand All @@ -82,14 +80,14 @@ def save_audio_and_format_entry(entry, dataset_name, audio_dir, sample_idx, with
}

formatted_entry = {
"task_type": "ASR_LEADERBOARD",
"task_type": "ASR",
"expected_answer": text,
"messages": [system_message, user_message],
"subset_for_metrics": dataset_name,
}

if "id" in entry:
formatted_entry["id"] = entry["id"]
if id_field in entry:
formatted_entry["id"] = entry[id_field]
if "speaker_id" in entry:
formatted_entry["speaker_id"] = entry["speaker_id"]

Expand All @@ -101,14 +99,11 @@ def prepare_dataset(dataset_name, output_dir, with_audio=True):
if dataset_name not in DATASET_CONFIGS:
raise ValueError(f"Unknown dataset: {dataset_name}. Available: {list(DATASET_CONFIGS.keys())}")

hf_dataset, hf_config, hf_split, streaming = DATASET_CONFIGS[dataset_name]
hf_repo, hf_config, hf_split, text_field, id_field = DATASET_CONFIGS[dataset_name]

print(f"Loading {dataset_name} from {hf_dataset} (streaming={streaming})...")
print(f"Loading {dataset_name} from {hf_repo} (config={hf_config}, split={hf_split})...")
try:
if hf_config:
dataset = load_dataset(hf_dataset, hf_config, split=hf_split, trust_remote_code=True, streaming=streaming)
else:
dataset = load_dataset(hf_dataset, split=hf_split, trust_remote_code=True, streaming=streaming)
dataset = load_dataset(hf_repo, hf_config, split=hf_split, trust_remote_code=True)
except Exception as e:
print(f"Warning: Failed to load {dataset_name}: {e}")
return 0
Expand All @@ -120,16 +115,13 @@ def prepare_dataset(dataset_name, output_dir, with_audio=True):
audio_dir.mkdir(parents=True, exist_ok=True)
print(f"Saving audio files to {audio_dir}")

if streaming:
print(f"Processing {dataset_name} (streaming)...")
else:
print(f"Processing {len(dataset)} samples from {dataset_name}...")
print(f"Processing {len(dataset)} samples from {dataset_name}...")

count = 0
skipped = 0
with open(output_file, "w", encoding="utf-8") as fout:
for idx, entry in enumerate(tqdm(dataset, desc=dataset_name)):
formatted = save_audio_and_format_entry(entry, dataset_name, audio_dir, idx, with_audio=with_audio)
formatted = save_audio_and_format_entry(entry, dataset_name, audio_dir, idx, text_field=text_field, id_field=id_field, with_audio=with_audio)
if formatted is None:
skipped += 1
continue
Expand Down Expand Up @@ -160,7 +152,8 @@ def main():
)
args = parser.parse_args()

output_dir = Path(__file__).parent
data_dir = Path("/dataset/asr-leaderboard")
output_dir = data_dir if data_dir.exists() else Path(__file__).parent
output_dir.mkdir(parents=True, exist_ok=True)

with_audio = not args.no_audio
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# HF Open ASR Leaderboard evaluation - S2S incremental V2 backend, TEXT output only.
# Evaluates WER on 8 ASR datasets: librispeech_clean, librispeech_other, voxpopuli,
# tedlium, gigaspeech, spgispeech, earnings22, ami.
# Checkpoint: Mar 2 2026 (Feb 26 STT + Mar 3 TTS, Megan speaker, force_turn_taking OFF)
#
# Run:
# python nemo_skills/dataset/asr-leaderboard/scripts/run_eval.py \
# --config nemo_skills/dataset/asr-leaderboard/scripts/asr_leaderboard_s2s_incremental_v2_02mar_config.yaml

# Cluster settings
cluster: s2s_eval_oci_iad
partition: batch_block1,batch_block3,batch_block4
cpu_partition: cpu

model: /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Nemotron-VoiceChat-november/duplex-eartts-10min_sw_et_eos_dp_eos_dup_fp32_1delay_02_March_exp_17_afg_long_FT_Megan_msr_34k_steps-stt-AS9.1_11002_new_branch_load_fixed

server_type: vllm
server_gpus: 1
num_chunks: 104

server_entrypoint: "-m nemo_skills.inference.server.serve_unified"
server_args: >-
--backend s2s_incremental_v2
--no_decode_audio
--use_asr_as_response
--ignore_system_prompt
--speaker_reference /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Mg_a_00759.wav
--num_frames_per_inference 3
--engine_type vllm_llm_vllm_eartts
--use_perception_cache
--use_perception_cudagraph
--buffer_size_frames 21
--codec_token_history_size 60
--repetition_penalty 1.0
--matmul_precision medium
--vllm_gpu_memory_utilization 0.35
--vllm_max_model_len 8192
--output_dir /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v2_02mar_artifacts_full_no_system_prompt_3frames_per_inference_21buffer
--batch_size 1
--code_path /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/NeMo
--pip_install "hf-xet==1.1.9 huggingface-hub==0.34.4 nvidia-modelopt==0.33.1 nvidia-modelopt-core==0.33.1 tokenizers==0.22.0 transformers==4.56.0 lhotse==1.32.2 nv-one-logger-core==2.1.0 nv-one-logger-pytorch-lightning-integration==2.1.0 nv-one-logger-training-telemetry==2.1.0 kaldialign==0.9.1 jiwer whisper-normalizer sacrebleu"

server_container: /lustre/fsw/portfolios/llmservice/users/erastorgueva/code/containers/triton25.05_s2svllm26.02.12.sqsh
server_server_type: vllm_multimodal

# Benchmark name used by nemo-skills eval pipeline
benchmark: asr-leaderboard

# Paths -- data_dir must match the container mount in cluster config
# (cluster mounts /lustre/fs12/.../nemo_skills/dataset -> /dataset)
data_dir: /dataset
output_dir: /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v5_02mar_artifacts_full_no_system_prompt_3frames_per_inference_21buffer

installation_command: "pip install jiwer whisper-normalizer sacrebleu"

expname: asr_leaderboard_s2s_incremental_v5_02mar
generation_only: false
scoring_only: false
dry_run: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# HF Open ASR Leaderboard evaluation - S2S incremental V2 backend, TEXT output only.
# Evaluates WER on 8 ASR datasets: librispeech_clean, librispeech_other, voxpopuli,
# tedlium, gigaspeech, spgispeech, earnings22, ami.
# Checkpoint: Feb 26 2026 (legally friendly personaplex dataset)
#
# Prepare data first:
# ns prepare_data asr-leaderboard
#
# Run:
# python nemo_skills/dataset/asr-leaderboard/scripts/run_eval.py \
# --config nemo_skills/dataset/asr-leaderboard/scripts/asr_leaderboard_s2s_incremental_v2_config.yaml

# Cluster settings
cluster: s2s_eval_oci_iad
partition: batch_block1,batch_block3,batch_block4
cpu_partition: cpu

model: /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Nemotron-VoiceChat-november/duplex-eartts-10min_sw_et_eos_dp_eos_dup_fp32_1delay_26_Feb_exp_13_afg_14k_steps-stt-AS9.1_11002_new_branch_load_fixed

server_type: vllm
server_gpus: 1
num_chunks: 1

server_entrypoint: "-m nemo_skills.inference.server.serve_unified"
server_args: >-
--backend s2s_incremental_v2
--no_decode_audio
--use_asr_as_response
--speaker_reference /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Mg_a_00759.wav
--num_frames_per_inference 3
--engine_type vllm_llm_vllm_eartts
--use_perception_cache
--use_perception_cudagraph
--buffer_size_frames 20
--codec_token_history_size 60
--repetition_penalty 1.0
--force_turn_taking
--force_turn_taking_threshold 40
--force_turn_taking_pad_window 25
--matmul_precision medium
--vllm_gpu_memory_utilization 0.35
--vllm_max_model_len 8192
--system_prompt "You are a helpful assistant. /no_think"
--output_dir /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v2_02mar_full_artifacts
--batch_size 2
--code_path /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/NeMo
--pip_install "hf-xet==1.1.9 huggingface-hub==0.34.4 nvidia-modelopt==0.33.1 nvidia-modelopt-core==0.33.1 tokenizers==0.22.0 transformers==4.56.0 lhotse==1.32.2 nv-one-logger-core==2.1.0 nv-one-logger-pytorch-lightning-integration==2.1.0 nv-one-logger-training-telemetry==2.1.0 kaldialign==0.9.1 jiwer whisper-normalizer"

server_container: /lustre/fsw/portfolios/llmservice/users/erastorgueva/code/containers/triton25.05_s2svllm26.02.12.sqsh
server_server_type: vllm_multimodal

# Benchmark name used by nemo-skills eval pipeline
benchmark: asr-leaderboard

# Paths -- data_dir must match the container mount in cluster config
# (cluster mounts /lustre/fs12/.../nemo_skills/dataset -> /dataset)
data_dir: /dataset
output_dir: /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02/asr_leaderboard_incremental_v2_02mar_full

installation_command: "pip install jiwer whisper-normalizer sacrebleu"
expname: asr_leaderboard_s2s_incremental_v2_02mar_full

generation_only: false
scoring_only: false
dry_run: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# HF Open ASR Leaderboard evaluation - (i) BASELINE setup
# S2S incremental V2 backend, TEXT output only.
# Evaluates WER on 8 ASR datasets: librispeech_clean, librispeech_other, voxpopuli,
# tedlium, gigaspeech, spgispeech, earnings22, ami.
# No inference boosting, no force turn taking.
# Checkpoint: Mar 2 2026 (Feb 26 STT + Mar 3 TTS, Megan speaker)
#
# Run:
# python nemo_skills/dataset/asr-leaderboard/scripts/run_eval.py \
# --config nemo_skills/dataset/asr-leaderboard/scripts/hf_baseline_02mar_config.yaml

# Cluster settings
cluster: s2s_eval_oci_iad
partition: batch_block1,batch_block3,batch_block4
cpu_partition: cpu

model: /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Nemotron-VoiceChat-november/duplex-eartts-10min_sw_et_eos_dp_eos_dup_fp32_1delay_02_March_exp_17_afg_long_FT_Megan_msr_34k_steps-stt-AS9.1_11002_new_branch_load_fixed

server_type: vllm
server_gpus: 1
num_chunks: 104

server_entrypoint: "-m nemo_skills.inference.server.serve_unified"
server_args: >-
--backend s2s_incremental_v2
--no_decode_audio
--use_asr_as_response
--ignore_system_prompt
--speaker_reference /lustre/fsw/portfolios/convai/users/ecasanova/Checkpoints/Mg_a_00759.wav
--num_frames_per_inference 3
--engine_type vllm_llm_vllm_eartts
--use_perception_cache
--use_perception_cudagraph
--buffer_size_frames 21
--codec_token_history_size 60
--repetition_penalty 1.0
--matmul_precision medium
--vllm_gpu_memory_utilization 0.35
--vllm_max_model_len 8192
--output_dir /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02_FIXED/baseline/hf_artifacts
--batch_size 1
--code_path /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/NeMo
--pip_install "hf-xet==1.1.9 huggingface-hub==0.34.4 nvidia-modelopt==0.33.1 nvidia-modelopt-core==0.33.1 tokenizers==0.22.0 transformers==4.56.0 lhotse==1.32.2 nv-one-logger-core==2.1.0 nv-one-logger-pytorch-lightning-integration==2.1.0 nv-one-logger-training-telemetry==2.1.0 kaldialign==0.9.1 jiwer whisper-normalizer sacrebleu"

server_container: /lustre/fsw/portfolios/llmservice/users/erastorgueva/code/containers/triton25.05_s2svllm26.02.12.sqsh
server_server_type: vllm_multimodal

# Benchmark name used by nemo-skills eval pipeline
benchmark: asr-leaderboard

# Paths
data_dir: /dataset
output_dir: /lustre/fsw/portfolios/convai/users/mmkrtchyan/projects/speechLM/s2s/MAR_02_FIXED/baseline/hf

installation_command: "pip install jiwer whisper-normalizer sacrebleu"

expname: hf_baseline_02mar
generation_only: false
scoring_only: false
dry_run: false
Loading