This tutorial demonstrates how to create training windows from audio segments for Audio Language Model (ALM) training using NeMo Curator.
The ALM pipeline processes audio manifests containing diarized segments and creates training windows with the following filters:
- Sample rate: Minimum 16kHz
- Bandwidth: Minimum 8kHz per segment
- Window duration: Target 120 seconds (±10% tolerance)
- Speaker count: 2-5 speakers per window
- Overlap filtering: Remove highly overlapping windows
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Diarized Audio │───▶│ ALM Pipeline │───▶│ Downstream │───▶│ Sharded Data │
│ Manifests │ │ (this stage) │ │ Processors │ │ for Training │
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
input.jsonl output.jsonl (future stages) ready for ALM
The output JSONL from this pipeline is consumed by downstream processors for additional processing (e.g., audio slicing, feature extraction, data augmentation). At the end of the full pipeline, the output will be sharded data ready for training Audio Language Models.
From the Curator repository root:
uv sync --extra audio_cpu
source .venv/bin/activateThis creates a .venv with all base, dev, test, and audio dependencies resolved
from the lockfile. If you don't have uv, you can fall back to pip:
pip install -e ".[audio_cpu]"Sample data is located in tests/fixtures/audio/alm/ for use in both testing and tutorials:
tests/fixtures/audio/alm/
└── sample_input.jsonl # 5 sample audio manifests with diarized segments
tutorials/audio/alm/
├── main.py # Pipeline runner (YAML-driven)
├── pipeline.yaml # Pipeline configuration
└── README.md # This file
The sample input contains 5 audio manifest entries with:
- Various sample rates (16kHz, 22kHz, 44kHz, 48kHz)
- 30+ segments per entry with speaker diarization
- Bandwidth metrics for quality filtering
- Multiple speakers (2-4 per conversation)
Run the pipeline on the included sample data (from Curator repo root):
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonlExpected output:
PIPELINE COMPLETE
==================================================
Output entries: 5
[alm_manifest_reader]
process_time: mean=0.0030s, total=0.01s
items_processed: 0
[alm_data_builder]
process_time: mean=0.0015s, total=0.01s
items_processed: 5
windows_created: 181
[alm_data_overlap]
process_time: mean=0.0004s, total=0.00s
items_processed: 5
output_windows (after overlap): 25
filtered_audio_duration: 3035.5s
[alm_manifest_writer]
process_time: mean=0.0001s, total=0.00s
items_processed: 5
With a single manifest file:
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=/path/to/your/data.jsonl \
output_dir=./my_outputWith a directory (recursively discovers all .jsonl and .json files in subdirectories):
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=/data/manifests/ \
output_dir=./my_outputThe pipeline supports two execution backends. Override via backend= on the command line:
| Backend | Description | When to use |
|---|---|---|
xenna |
Default executor. Uses Cosmos-Xenna streaming engine with automatic worker allocation. | Most workloads, CI/nightly benchmarks. |
ray_data |
Executor built on Ray Data map_batches. |
Development, machines where Xenna cannot detect GPUs, or when Ray Data integration is preferred. |
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonlpython tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonl \
backend=ray_dataAll parameters are defined in pipeline.yaml. Override from command line:
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=/data/input.jsonl \
output_dir=./custom_output \
stages.1.min_speakers=3 \
stages.1.max_speakers=6 \
stages.2.overlap_percentage=30| Parameter | Description | Default |
|---|---|---|
manifest_path |
Path to input JSONL manifest | Required |
output_dir |
Directory for output files | ./alm_output |
backend |
Execution backend | xenna |
stages.1.target_window_duration |
Target window duration (seconds) | 120.0 |
stages.1.tolerance |
Duration tolerance (e.g., 0.1 = ±10%) | 0.1 |
stages.1.min_sample_rate |
Minimum sample rate (Hz) | 16000 |
stages.1.min_bandwidth |
Minimum bandwidth (Hz) | 8000 |
stages.1.min_speakers |
Minimum speakers per window | 2 |
stages.1.max_speakers |
Maximum speakers per window | 5 |
stages.1.truncation |
Truncate segments exceeding window | true |
stages.1.drop_fields |
Comma-separated fields to drop from segments | "words" |
stages.1.drop_fields_top_level |
Comma-separated top-level fields to drop | "words,segments" |
stages.2.overlap_percentage |
Overlap threshold 0-100 | 50 |
stages.2.target_duration |
Target duration for overlap comparison | 120.0 |
stages.3.output_path |
Output JSONL path | ${output_dir}/alm_output.jsonl |
Match indices in stages list in pipeline.yaml:
stages.0.*: ALMManifestReaderStage parametersstages.1.*: ALMDataBuilderStage parametersstages.2.*: ALMDataOverlapStage parametersstages.3.*: ALMManifestWriterStage parameters
The input manifest should be a JSONL file where each line contains:
{
"audio_filepath": "/path/to/audio.wav",
"audio_sample_rate": 16000,
"segments": [
{
"start": 0.0,
"end": 5.2,
"speaker": "speaker_0",
"text": "Hello, how are you?",
"words": [{"word": "Hello", "start": 0.0, "end": 0.5}, ...],
"metrics": {"bandwidth": 8000}
},
...
]
}| Field | Type | Description |
|---|---|---|
audio_filepath |
string | Path to audio file |
audio_sample_rate |
int | Sample rate in Hz |
segments |
list | List of diarized segments |
segments[].start |
float | Segment start time (seconds) |
segments[].end |
float | Segment end time (seconds) |
segments[].speaker |
string | Speaker identifier |
segments[].metrics.bandwidth |
int | Segment bandwidth in Hz |
| Field | Type | Description |
|---|---|---|
segments[].text |
string | Transcription text |
segments[].words |
list | Word-level timestamps |
Results are written as JSONL to ${output_dir}/alm_output.jsonl. Each line contains:
{
"audio_filepath": "/path/to/audio.wav",
"windows": [...],
"filtered_windows": [
{
"segments": [
{"start": 10.0, "end": 15.2, "speaker": "speaker_0", "text": "..."},
{"start": 15.5, "end": 22.1, "speaker": "speaker_1", "text": "..."},
...
],
"speaker_durations": [45.2, 38.1, 22.5, 14.2, 0.0]
}
],
"filtered_dur": 120.5,
"filtered_dur_list": [120.5],
"stats": {
"total_segments": 150,
"total_dur": 3600.0,
"lost_bw": 5,
"lost_sr": 0,
"lost_spk": 12,
"lost_win": 8
},
"truncation_events": 3
}| Field | Description |
|---|---|
windows |
All valid windows from builder stage |
filtered_windows |
Windows after overlap filtering |
filtered_dur |
Total duration of filtered windows |
filtered_dur_list |
Duration of each filtered window |
stats |
Processing statistics and loss reasons |
truncation_events |
Count of segment truncations |
Creates training windows from audio segments.
Processing Logic:
- Check audio sample rate (skip if < min_sample_rate)
- For each segment as potential window start:
- Check bandwidth requirement
- Build window by adding consecutive segments
- Apply truncation if window exceeds max duration
- Validate speaker count (2-5 speakers)
- Check window duration (target ± tolerance)
- Create window with segments and speaker durations
Statistics Tracked:
lost_bw: Segments lost due to low bandwidthlost_sr: Segments lost due to low sample ratelost_spk: Segments lost due to speaker countlost_win: Segments lost due to window constraints
Filters windows based on overlap ratio.
Processing Logic:
- Calculate timestamps for all windows
- For each pair of overlapping windows:
- Calculate overlap ratio
- If overlap exceeds threshold, keep window closer to target duration
- Return filtered windows
Parameters:
overlap_percentage=0: Aggressive filtering (remove any overlap)overlap_percentage=50: Moderate filteringoverlap_percentage=100: Permissive (keep all windows)
For shorter windows (e.g., 60 seconds):
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonl \
stages.1.target_window_duration=60 \
stages.1.tolerance=0.15For exactly 2-3 speakers:
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonl \
stages.1.min_speakers=2 \
stages.1.max_speakers=3Remove all overlapping windows:
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonl \
stages.2.overlap_percentage=0Process all manifests in a directory tree:
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/nested_manifestsThis recursively discovers all .jsonl and .json files under nested_manifests/ (including subdir_a/ and subdir_b/), partitions them via FilePartitioningStage, and processes all entries through the full pipeline. Expected output with the included test fixtures (4 manifest files, 5 entries each = 20 entries):
PIPELINE COMPLETE
==================================================
[file_partitioning]
items_processed: 0
[alm_manifest_reader]
items_processed: 20
[alm_data_builder]
windows_created: 724
[alm_data_overlap]
output_windows (after overlap): 100
filtered_audio_duration: 12142.0s
[alm_manifest_writer]
items_processed: 20
See benchmarking/ALM_BENCHMARK.md for the full ALM benchmark documentation, including how to run benchmarks, configuration, CLI arguments, and reference results.
The ALM pipeline has comprehensive unit and integration tests in tests/stages/audio/alm/.
From the Curator repository root:
pytest tests/stages/audio/alm/ -vtests/stages/audio/alm/
├── conftest.py # Shared fixtures
├── test_alm_manifest_reader.py # 14 tests (3 classes)
├── test_alm_manifest_writer.py # 11 tests (2 classes)
├── test_alm_data_builder.py # 13 tests (2 classes)
└── test_alm_data_overlap.py # 10 tests (2 classes)
| Fixture | Description |
|---|---|
sample_entries |
Loads all 5 entries from tests/fixtures/audio/alm/sample_input.jsonl |
sample_entry |
First entry from sample_entries |
entry_with_windows |
sample_entry processed through ALMDataBuilderStage (pre-built windows for overlap tests) |
TestALMManifestReaderStage (unit tests):
| Test | What it verifies |
|---|---|
test_reads_single_manifest |
Reads 2-entry JSONL, returns AudioTask per entry |
test_reads_multiple_manifests |
Accepts list of manifest paths, concatenates entries |
test_one_audio_entry_per_line |
Each JSONL line becomes exactly one AudioTask |
test_skips_blank_lines |
Blank/whitespace-only lines in JSONL are ignored |
test_empty_manifest |
Empty file returns [] |
test_preserves_nested_data |
Nested segments[].metrics.bandwidth survives round-trip |
test_duplicate_manifests_for_repeat |
Same path repeated 3x produces 3 batches (repeat-factor pattern) |
TestALMManifestReaderDirectory:
| Test | What it verifies |
|---|---|
test_reads_all_jsonl_from_directory |
Recursively discovers and reads all JSONL files in a directory tree |
test_reads_from_subdirectory_a |
Reads manifests from a specific subdirectory |
test_reads_from_subdirectory_b |
Reads manifests from another subdirectory |
test_composite_discovers_nested_directory |
Composite stage discovers nested directories end-to-end |
test_ignores_non_jsonl_files |
Non-JSONL files in the directory are skipped |
TestALMManifestReaderIntegration:
| Test | What it verifies |
|---|---|
test_reads_sample_fixture |
Reads the real sample_input.jsonl fixture, verifies 5 entries with segments |
test_composite_end_to_end_with_directory |
Composite reader processes a directory of manifests end-to-end |
TestALMManifestWriter (unit tests):
| Test | What it verifies |
|---|---|
test_writes_entry_to_jsonl |
Entry written as JSONL line with correct audio_filepath |
test_returns_file_group_task |
Returns FileGroupTask with output path, task_id, dataset_name |
test_propagates_metadata_and_stage_perf |
_metadata and _stage_perf pass through to output task |
test_appends_across_multiple_process_calls |
3 sequential process() calls produce 3 lines |
test_setup_on_node_truncates_existing_file |
setup_on_node() clears pre-existing file content |
test_setup_on_node_creates_parent_directories |
setup_on_node() creates nested directories for output path |
test_handles_unicode_content |
Japanese and accented characters survive write/read |
test_preserves_nested_structures |
windows[].segments[] and stats dict survive serialization |
test_num_workers_returns_one |
num_workers() returns 1 (single-writer constraint) |
test_xenna_stage_spec |
Returns {"num_workers": 1} |
TestALMManifestWriterRoundTrip:
| Test | What it verifies |
|---|---|
test_reader_writer_round_trip |
Write all fixture entries with writer, read back with reader, verify audio_filepath and segment counts match |
TestALMDataBuilder (unit tests):
| Test | What it verifies |
|---|---|
test_creates_windows_from_sample |
Sample entry produces non-empty windows list and stats |
test_filters_low_sample_rate |
Entry with 8kHz sample rate has lost_sr > 0 or empty windows |
test_filters_low_bandwidth |
All segments set to 4kHz bandwidth triggers lost_bw > 0 |
test_speaker_constraints |
Single-speaker entry with min_speakers=2 produces zero windows |
test_empty_segments |
Entry with segments=[] returns empty windows |
test_drop_fields |
words removed from segments inside windows; words and segments removed from top-level |
test_different_sample_rates |
All 5 fixture entries (16-48kHz) process without error |
test_validate_input_valid |
validate_input() returns True when required keys present |
test_validate_input_missing_segments |
validate_input() returns False when segments key missing |
test_validate_input_missing_sample_rate |
validate_input() returns False when audio_sample_rate key missing |
test_process_batch_raises_on_missing_segments |
process_batch() raises ValueError on missing segments |
test_process_batch_raises_on_missing_sample_rate |
process_batch() raises ValueError on missing audio_sample_rate |
TestALMDataBuilderIntegration:
| Test | What it verifies |
|---|---|
test_processes_all_sample_entries |
All 5 fixture entries produce exactly 181 total windows |
TestALMDataOverlap (unit tests):
| Test | What it verifies |
|---|---|
test_validate_input_valid |
validate_input() returns True when windows key present |
test_validate_input_missing_windows |
validate_input() returns False when windows key missing |
test_process_batch_raises_on_missing_windows |
process_batch() raises ValueError on missing windows |
test_filters_overlapping_windows |
filtered_windows count <= input windows count |
test_keeps_closer_to_target |
Aggressive filtering (overlap_percentage=0) produces valid output |
test_permissive_mode |
overlap_percentage=100 keeps >= windows than overlap_percentage=0 |
test_no_windows |
Entry with windows=[] passes through unchanged |
test_validation |
Invalid overlap_percentage (-1, 101) and target_duration (-1) raise ValueError |
test_calculates_duration |
Output includes filtered_dur >= 0 and filtered_dur_list |
TestALMDataOverlapIntegration:
| Test | What it verifies |
|---|---|
test_full_pipeline |
Full Builder -> Overlap pipeline: 5 entries produce 181 windows -> 25 filtered windows, total filtered duration ~3035.5 seconds |
- Both stages use Ray-based parallelism via the selected backend (
xennaorray_data) - Processing is CPU-bound (no GPU required)
- Memory usage scales with manifest size
- For large manifests, consider processing in batches or using
--repeat-factorfor scale testing
- Check that
audio_sample_rate >= min_sample_rate - Verify
segments[].metrics.bandwidth >= min_bandwidth - Ensure sufficient consecutive segments for target duration
- Check speaker identifiers (avoid "no-speaker")
- Reduce
min_speakersrequirement - Increase
tolerancefor window duration - Lower
min_bandwidththreshold - Increase
overlap_percentage(more permissive)
- Process manifest in smaller batches
- Reduce number of parallel workers