Skip to content

feat: add comprehensive audio model benchmarking infrastructure#110

Draft
maryamtahhan wants to merge 28 commits into
redhat-et:mainfrom
maryamtahhan:feat/audio-tests
Draft

feat: add comprehensive audio model benchmarking infrastructure#110
maryamtahhan wants to merge 28 commits into
redhat-et:mainfrom
maryamtahhan:feat/audio-tests

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

Add complete test suite for audio models (ASR, translation, audio chat) on vLLM CPU deployments.

Audio Benchmark Playbook (audio-benchmark.yml):

  • Three-play structure: Setup/Validation, Start vLLM Server, Execute Benchmarks
  • Supports managed mode (Ansible starts vLLM) and external endpoint mode
  • CPU auto-calculation with control plane reservation (2 cores for vLLM scheduler, KV cache, async ops)
  • Audio-specific Tensor Parallel defaults: TP=1 for <64 cores, TP=2 for 64+ cores
  • OMP thread calculation: (requested_cores - 2) / tensor_parallel
  • Integrated vLLM metrics collection via common task files
  • Generates test-metadata.json for Streamlit dashboard integration
  • Containerized execution for both vLLM and GuideLLM

GuideLLM Audio Benchmark Role (benchmark_guidellm_audio):

  • GuideLLM v0.6.0 container execution with audio support
  • Builds command as argument list to avoid JSON escaping issues
  • Audio preprocessing via --data-preprocessors encode_media
  • HuggingFace cache volume mount for disk space management
  • Supports audio_transcriptions, audio_translations, chat_completions request formats
  • Comprehensive error handling and log capture

Created reusable task files used by both LLM and audio benchmarks:

  • tasks/start-vllm-metrics-collection.yml: Unified vLLM /metrics endpoint collection
  • tasks/stop-vllm-metrics-collection.yml: Unified metrics teardown
  • tasks/wait-for-vllm-ready.yml: Health check logic for vLLM readiness
  • tasks/setup-results-directory.yml: Results directory creation

Refactored llm-benchmark-auto.yml to use common task files (~55 lines reduced).

Five comprehensive test scenarios plus quick validation test:

  1. transcription-throughput.yaml: PRIMARY TEST - "How long to transcribe N files?"

    • Sequential baseline (100 files)
    • Concurrent processing (2, 4, 8 workers)
    • Maximum throughput test
    • Metrics: total time, files/sec, audio_seconds/sec, real-time factor
  2. transcription-latency.yaml: Per-request latency under load

    • Light/medium/heavy load tests
    • P50/P95/P99 latency percentiles
    • Latency degradation analysis
  3. audio-duration-scaling.yaml: Performance vs audio length

    • Short (1-5s), medium (5-15s), long (15-30s), full-length clips
    • Linear vs non-linear scaling analysis
    • Per-audio-second processing cost
  4. constant-rate-stress.yaml: Sustained load stability

    • Sustained rates: 2, 5, 10 req/s (5 min each)
    • Extended duration test (15 min)
    • Memory leak detection
  5. format-comparison.yaml: Audio format impact

    • MP3 (64kbps, 128kbps)
    • WAV (uncompressed, 8/16/48kHz)
    • FLAC (lossless)
    • Stereo vs mono comparison
  6. quick-test.yaml: Fast validation (5 files)

model-matrix.yaml:

  • Whisper models: tiny (39M), small (244M), medium (769M)
  • Audio chat: ultravox-v0_5-llama-3_2-1b
  • Audio preprocessing presets: mp3_64k, mp3_128k, wav_16k, flac_16k
  • Recommended vLLM settings per model (dtype, max_model_len, kvcache)
  • Dataset configurations (LibriSpeech, Common Voice)

tests/audio-models/README.md (435 lines):

  • Quick start guide (managed and external modes)
  • Detailed usage for all test scenarios
  • CPU allocation breakdown with control plane explanation
  • Container mode documentation (no host installation required)
  • Advanced configuration (manual CPU tuning, custom datasets)
  • Results analysis and interpretation
  • Troubleshooting guide

models/audio-models/audio-models.md:

  • Model selection rationale
  • Supported endpoints (ASR, translation, chat)
  • Dataset descriptions (LibriSpeech, Common Voice)
  • Audio preprocessing configurations
  • CPU optimization parameters
  • Performance expectations
  • Test scenario mappings

Updated models/models.md:

  • Added audio models section
  • Links to audio-specific documentation
  • Integration with existing model matrix

CPU Allocation Formula:

Container Allocation:  cores 0-31 (32 total)
├── Control Plane:     2 cores (scheduler, KV cache, async ops)
└── Worker Threads:    30 cores (OMP_NUM_THREADS=30)
    └── Per TP rank:   30/TP cores

OMP_NUM_THREADS = (requested_cores - 2) / tensor_parallel

Container Execution:

  • vLLM container: ghcr.io/vllm-project/vllm-cpu-env:latest
  • GuideLLM container: ghcr.io/vllm-project/guidellm:v0.6.0
  • HuggingFace cache volume: host mount to avoid disk space issues
  • SELinux context: :z flag for proper permissions
  • Cache directory: 0777 mode for container write access

Metrics Collection:

  • vLLM server metrics from /metrics endpoint
  • GuideLLM client-side metrics (latency, throughput, audio-specific)
  • Audio-specific metrics: audio_seconds, audio_samples, audio_bytes, audio_tokens
  • Real-time factor: processing_time / audio_duration

Successfully validated with quick-test scenario:

  • Container image pull and execution
  • HuggingFace dataset download (LibriSpeech)
  • Audio preprocessing (MP3 encoding)
  • vLLM server startup and health checks
  • Metrics collection
  • Results JSON generation

Add complete test suite for audio models (ASR, translation, audio chat) on vLLM CPU deployments.

**Audio Benchmark Playbook** (audio-benchmark.yml):
- Three-play structure: Setup/Validation, Start vLLM Server, Execute Benchmarks
- Supports managed mode (Ansible starts vLLM) and external endpoint mode
- CPU auto-calculation with control plane reservation (2 cores for vLLM scheduler, KV cache, async ops)
- Audio-specific Tensor Parallel defaults: TP=1 for <64 cores, TP=2 for 64+ cores
- OMP thread calculation: (requested_cores - 2) / tensor_parallel
- Integrated vLLM metrics collection via common task files
- Generates test-metadata.json for Streamlit dashboard integration
- Containerized execution for both vLLM and GuideLLM

**GuideLLM Audio Benchmark Role** (benchmark_guidellm_audio):
- GuideLLM v0.6.0 container execution with audio support
- Builds command as argument list to avoid JSON escaping issues
- Audio preprocessing via --data-preprocessors encode_media
- HuggingFace cache volume mount for disk space management
- Supports audio_transcriptions, audio_translations, chat_completions request formats
- Comprehensive error handling and log capture

Created reusable task files used by both LLM and audio benchmarks:
- **tasks/start-vllm-metrics-collection.yml**: Unified vLLM /metrics endpoint collection
- **tasks/stop-vllm-metrics-collection.yml**: Unified metrics teardown
- **tasks/wait-for-vllm-ready.yml**: Health check logic for vLLM readiness
- **tasks/setup-results-directory.yml**: Results directory creation

Refactored llm-benchmark-auto.yml to use common task files (~55 lines reduced).

Five comprehensive test scenarios plus quick validation test:

1. **transcription-throughput.yaml**: PRIMARY TEST - "How long to transcribe N files?"
   - Sequential baseline (100 files)
   - Concurrent processing (2, 4, 8 workers)
   - Maximum throughput test
   - Metrics: total time, files/sec, audio_seconds/sec, real-time factor

2. **transcription-latency.yaml**: Per-request latency under load
   - Light/medium/heavy load tests
   - P50/P95/P99 latency percentiles
   - Latency degradation analysis

3. **audio-duration-scaling.yaml**: Performance vs audio length
   - Short (1-5s), medium (5-15s), long (15-30s), full-length clips
   - Linear vs non-linear scaling analysis
   - Per-audio-second processing cost

4. **constant-rate-stress.yaml**: Sustained load stability
   - Sustained rates: 2, 5, 10 req/s (5 min each)
   - Extended duration test (15 min)
   - Memory leak detection

5. **format-comparison.yaml**: Audio format impact
   - MP3 (64kbps, 128kbps)
   - WAV (uncompressed, 8/16/48kHz)
   - FLAC (lossless)
   - Stereo vs mono comparison

6. **quick-test.yaml**: Fast validation (5 files)

**model-matrix.yaml**:
- Whisper models: tiny (39M), small (244M), medium (769M)
- Audio chat: ultravox-v0_5-llama-3_2-1b
- Audio preprocessing presets: mp3_64k, mp3_128k, wav_16k, flac_16k
- Recommended vLLM settings per model (dtype, max_model_len, kvcache)
- Dataset configurations (LibriSpeech, Common Voice)

**tests/audio-models/README.md** (435 lines):
- Quick start guide (managed and external modes)
- Detailed usage for all test scenarios
- CPU allocation breakdown with control plane explanation
- Container mode documentation (no host installation required)
- Advanced configuration (manual CPU tuning, custom datasets)
- Results analysis and interpretation
- Troubleshooting guide

**models/audio-models/audio-models.md**:
- Model selection rationale
- Supported endpoints (ASR, translation, chat)
- Dataset descriptions (LibriSpeech, Common Voice)
- Audio preprocessing configurations
- CPU optimization parameters
- Performance expectations
- Test scenario mappings

**Updated models/models.md**:
- Added audio models section
- Links to audio-specific documentation
- Integration with existing model matrix

**CPU Allocation Formula**:
```
Container Allocation:  cores 0-31 (32 total)
├── Control Plane:     2 cores (scheduler, KV cache, async ops)
└── Worker Threads:    30 cores (OMP_NUM_THREADS=30)
    └── Per TP rank:   30/TP cores

OMP_NUM_THREADS = (requested_cores - 2) / tensor_parallel
```

**Container Execution**:
- vLLM container: ghcr.io/vllm-project/vllm-cpu-env:latest
- GuideLLM container: ghcr.io/vllm-project/guidellm:v0.6.0
- HuggingFace cache volume: host mount to avoid disk space issues
- SELinux context: :z flag for proper permissions
- Cache directory: 0777 mode for container write access

**Metrics Collection**:
- vLLM server metrics from /metrics endpoint
- GuideLLM client-side metrics (latency, throughput, audio-specific)
- Audio-specific metrics: audio_seconds, audio_samples, audio_bytes, audio_tokens
- Real-time factor: processing_time / audio_duration

Successfully validated with quick-test scenario:
- Container image pull and execution
- HuggingFace dataset download (LibriSpeech)
- Audio preprocessing (MP3 encoding)
- vLLM server startup and health checks
- Metrics collection
- Results JSON generation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ac447a02-4fe3-4c64-a964-c693636cf343

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

maryamtahhan and others added 20 commits April 21, 2026 14:07
LibriSpeech ASR dataset is ~350GB across all splits, causing timeouts
during dataset download for quick validation tests.

Switch quick-test to use hf-internal-testing/librispeech_asr_dummy which
is a tiny test dataset (<1MB) designed for rapid ASR testing.

This allows the test to complete in seconds instead of failing after
multi-minute downloads.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
When using delegate_to with delegate_facts, variables are stored in
hostvars['localhost'] and must be accessed through that path.

Fixed references to:
- benchmark_end_time
- test_duration_seconds (in format task)

These variables are now correctly accessed via hostvars['localhost']
to prevent 'undefined' errors during test finalization.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add comprehensive documentation explaining the difference between:

1. **Offline Batch Processing** (e.g., "I have N audio files, transcribe
   them all ASAP")
   - Focus: Total completion time, maximum throughput
   - Use cases: Post-call transcription, media archive processing
   - Profiles: synchronous (sequential), throughput (max capacity)

2. **Online Serving** (e.g., "How many concurrent users can use my
   transcription API?")
   - Focus: Per-request latency (P50, P95, P99), concurrent user capacity
   - Use cases: Real-time transcription API, voice assistant backend
   - Profiles: concurrent (simulates N concurrent users), constant rate

Key clarifications:
- 'concurrent' profile with 'rate: N' simulates N concurrent USERS,
  NOT parallel batch processing of N files
- Sequential + max-throughput stages answer offline batch questions
- Concurrent-N stages answer online serving questions
- transcription-throughput covers BOTH patterns in one test

Added:
- README section "Understanding Test Profiles" with analogies
- README section "Which Test Should I Run?" with use case mapping
- Per-scenario serving pattern labels in test table
- Detailed explanation for each test scenario
- Inline comments in transcription-throughput.yaml clarifying stages

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…etup

Add `audio_num_files` parameter to allow dynamic override of the number
of audio files processed per benchmark stage. This enables quick testing
with fewer files (e.g., 10) or comprehensive benchmarks with more files
(e.g., 100+).

Changes:
- Add `audio_num_files` parameter validation (1-10000 range)
- Override `max_requests` in all stages when parameter is provided
- Change default `vllm_dtype` from "float16" to "auto" for better
  model compatibility (aligns with LLM benchmark defaults)
- Fix directory permissions from 0755 to 0777 to prevent container
  permission errors when downloading models
- Update documentation with parameter usage examples
- Add inventory file requirement to all command examples

Usage:
  ansible-playbook -i inventory/hosts.yml audio-benchmark.yml \
    -e "test_model=openai/whisper-tiny" \
    -e "test_scenario=transcription-throughput" \
    -e "requested_cores=32" \
    -e "audio_num_files=10"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add :Z flag to volume mounts for SELinux compatibility on RHEL/Amazon Linux
- Include HF_TOKEN environment variable for model downloads
- Setup HuggingFace token role before starting vLLM server

Fixes: PermissionError when downloading models to mounted volumes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The VLLM_CPU_KVCACHE_SPACE environment variable expects just the numeric
value (e.g., 2 for 2GiB), not the full string with units. Use the
extract_size_value filter to convert '2GiB' -> 2.

Fixes: pydantic ValidationError for invalid literal '2GiB'

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
For single tensor-parallel configurations (TP=1), vLLM's CPU backend
manages threads automatically and doesn't expect OMP_NUM_THREADS to be set.
Setting it causes 'RuntimeError: Expected positive number of threads'.

Only set OMP_NUM_THREADS and VLLM_CPU_OMP_THREADS_BIND when TP > 1,
matching the behavior of the LLM benchmark playbook.

Fixes: RuntimeError during engine core initialization for TP=1

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Reserve 2 CPU cores for vLLM's control plane (scheduler, KV cache manager,
async operations) by setting VLLM_CPU_NUM_OF_RESERVED_CPU=2. This prevents
worker threads from saturating all cores and starving the control plane.

Without this, performance degrades as the control plane competes with
worker threads for CPU time.

Also updated CPU configuration display to show:
- Reserved cores configuration
- Auto-managed threads for TP=1 (no OMP override)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…ction

Add detailed comments explaining the critical interaction between
VLLM_CPU_NUM_OF_RESERVED_CPU and VLLM_CPU_OMP_THREADS_BIND:

**For TP=1 (our default):**
- VLLM_CPU_OMP_THREADS_BIND left unset (defaults to 'auto')
- VLLM_CPU_NUM_OF_RESERVED_CPU=2 works properly
- vLLM auto-manages thread binding and respects reserved cores
- Default behavior: world_size==1 reserves 0 cores (bad for performance!)
- We explicitly set 2 to prevent worker saturation

**For TP>1:**
- VLLM_CPU_OMP_THREADS_BIND set explicitly (e.g., '0-31|32-63')
- OMP_NUM_THREADS set explicitly
- Setting explicit binding DISABLES VLLM_CPU_NUM_OF_RESERVED_CPU
- Reserved cores must be accounted for in the binding string

This ensures optimal CPU allocation without control plane starvation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The vllm_control_plane_cores variable was only defined in the localhost
play but needed in the vllm-server play for VLLM_CPU_NUM_OF_RESERVED_CPU.

Add it to the vllm-server play's vars section so it's available when
building environment variables.

Fixes: 'vllm_control_plane_cores' is undefined error

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update default vLLM container image from v0.18.0 to v0.19.0 for better
audio model support and bug fixes.

Users can still override with: export VLLM_CONTAINER_IMAGE=<custom-image>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…hput

GuideLLM's ThroughputProfile requires a 'rate' parameter. Set it to 'inf'
(infinite) to send requests as fast as possible, which is the intended
behavior for finding maximum throughput capacity.

Fixes: ValueError: ThroughputProfile requires a rate parameter

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…hmark

The Ansible role was only passing --rate to GuideLLM for concurrent and
constant profiles, but not for throughput profile. This caused the
max-throughput stage to fail with "ThroughputProfile requires a rate parameter".

Added rate parameter handling for throughput profile in both container mode
(command args list) and host mode (command string) sections.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…default

Changed max-throughput stage rate from "inf" to 50 concurrent streams:
- GuideLLM's throughput profile requires finite rate (max_concurrency)
- Default of 50 concurrent streams is reasonable for most deployments
- Can be increased in scenario YAML for load balancer/multi-instance setups
- Not exposed as playbook parameter since it's deployment-specific

This addresses the error:
  ValidationError: Input should be a finite number [type=finite_number, input_value=inf]

Updated documentation to explain:
- Rate parameter represents concurrent request streams for throughput profile
- How to adjust for multi-instance deployments behind load balancers
- Why it's not tied to core count (external mode doesn't know server config)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed stage_results_path to include stage suffix, creating:
  results/.../transcription-throughput/sequential/benchmarks.json
  results/.../transcription-throughput/concurrent-2/benchmarks.json
  etc.

Benefits:
- Each stage preserves GuideLLM's default output filenames
- No overwriting between stages
- Cleaner organization for multi-stage benchmarks
- Easier to add stage-specific artifacts later

This fixes the issue where all stages were writing to the same directory
and overwriting each other's results.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Added fetch tasks to copy benchmarks.json and benchmarks.csv from the
load generator host back to the controller (localhost) after each stage
completes.

Changes:
- Check for benchmark files on load generator before fetching
- Display found files for debugging
- Fetch benchmarks.json and benchmarks.csv to controller
- Handle both container mode and host mode execution
- Use failed_when: false to handle cases where files don't exist

Without this fix, benchmark results remained on the load generator host
and were not accessible from the controller where the playbook was run.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed results path from:
  results/audio-models/{model}/{scenario}/
to:
  results/audio-models/{model}/{scenario}-{timestamp}/

This matches the LLM benchmark directory structure and prevents results
from being overwritten on subsequent test runs.

Example:
  results/audio-models/openai__whisper-tiny/transcription-throughput-20260422-110932/
  ├── sequential/
  ├── concurrent-2/
  ├── concurrent-4/
  ├── concurrent-8/
  ├── max-throughput/
  ├── vllm-metrics.json
  └── test-metadata.json

Updated documentation to reflect the new structure and explain the
timestamp-based organization.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
GuideLLM requires different encode kwargs structure depending on request type:
- audio_transcriptions/translations: flat structure {"format": "mp3", ...}
- chat_completions: nested structure {"audio": {"format": "mp3", ...}}

The previous implementation always used nested structure, which caused
GuideLLM to fail silently when generating requests for transcription/translation
tasks - it would initialize but process 0 requests.

Now conditionally wraps kwargs in "audio" only for chat_completions.

This fixes the issue where all benchmark stages completed in <1 second
with 0 requests processed.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…taset

GuideLLM was trying to download the entire LibriSpeech test split (2620 samples)
even when only benchmarking 5 requests. This caused:
- Very long dataset download times (downloading all 48 tar files)
- Benchmarks completing in 0 seconds with 0 requests (timeout/failure)

Solution: Add --data-samples parameter set to 2x max_requests:
- max_requests=5 → load 10 samples
- max_requests=100 → load 200 samples

This provides enough samples for the benchmark while avoiding unnecessary
downloads. The 2x buffer ensures we have extra samples in case of any
preprocessing failures.

Applied to both container mode (command args) and host mode (command string).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Build custom vLLM image with audio deps (librosa, soundfile, ffmpeg-python) for v0.19.0
- Add load generator connectivity check before running benchmarks
- Add audio endpoint pre-flight check to verify vLLM supports the request type
- Stop all vLLM containers (audio/LLM/embedding) before starting new ones

Resolves audio benchmark failures caused by:
1. Missing audio dependencies in v0.19.0 image (predates vllm[audio] extras)
2. Network/firewall issues blocking load generator → vLLM server connections
3. Multiple vLLM containers running simultaneously causing port conflicts

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
maryamtahhan and others added 7 commits April 23, 2026 12:44
- Add docs/audio-benchmarking.md with complete guide:
  - Quick start and prerequisites
  - All test scenarios explained
  - Understanding results and metrics
  - Advanced configuration (CPU, models, non-container mode)
  - Troubleshooting section covering all common issues:
    - Network connectivity (AWS security groups, private IPs)
    - Missing audio dependencies
    - Multiple containers
    - Dataset download issues
    - GuideLLM backend validation
    - Low throughput diagnosis
  - Best practices for production use

- Update main README.md:
  - Add Audio Models test suite to quick start
  - Include audio-models in repository structure
  - Add to models and documentation sections
  - Link to comprehensive audio benchmarking guide

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add new Streamlit dashboard page (🎧 Audio Metrics) with audio-specific
performance analysis:

Audio-Specific Metrics:
- Audio throughput (audio_seconds/wall_clock_second)
- Real-Time Factor (RTF = processing_time / audio_duration)
  - RTF < 1.0 = faster than real-time
  - RTF = 1.0 = real-time processing
  - RTF > 1.0 = slower than real-time
- Request throughput (files/second)
- Per-core efficiency

Features:
- Extracts audio metadata from GuideLLM benchmarks.json:
  - audio_seconds (duration)
  - audio_tokens
  - audio_samples
  - audio_bytes
- Calculates RTF percentiles (mean, p50, p95, p99) per request
- Aggregates audio metrics across stages
- Visualizations:
  - Audio throughput by stage and model
  - RTF trends with percentile overlay
  - Latency vs audio duration scaling
  - Request throughput comparison
  - Per-core efficiency
- CSV export for external analysis
- Supports stage-based filtering (sequential, concurrent-N, max-throughput)

Updates:
- Home.py: Add audio metrics to dashboard overview
- README.md: Document audio metrics page with RTF explanation
- New page: 4_🎧_Audio_Metrics.py

Works with existing audio benchmark results structure:
results/audio-models/model/scenario-timestamp/stage/benchmarks.json

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fix AttributeError by using config.get_results_directory() method
instead of non-existent config.results_dir attribute.

Also set default path to audio-models instead of llm for the
audio metrics dashboard.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Key improvements to audio metrics dashboard:

1. Filter out LLM models automatically:
   - Only show results with audio_seconds > 0
   - Prevents LLM text generation results from appearing in audio dashboard

2. Fix scenario extraction:
   - Use metadata.scenario_name (audio format) instead of metadata.scenario
   - Correctly shows "transcription-throughput" instead of "unknown"

3. Add comprehensive metrics explanations:
   - Expandable "Understanding Audio Metrics" section
   - Explains Audio Throughput, RTF, Request Throughput, Efficiency
   - Defines percentiles (P50, P95, P99) in audio context
   - Clarifies test stages (sequential, concurrent-N, max-throughput)

4. Replace request throughput chart with total time chart:
   - NEW: "Total Time to Process N Files" chart
   - Shows wall-clock duration (lower = faster)
   - Directly answers: "How long to transcribe N audio files?"
   - Includes summary table with files/second and files/hour

5. Add detailed descriptions to all charts:
   - Audio Throughput: Explains audio_sec/wall_sec ratio
   - RTF: Explains < 1.0 = faster than real-time
   - Latency vs Duration: Explains scaling behavior

6. Reorder charts by user priority:
   - Total Time first (most requested metric)
   - Then Audio Throughput, RTF, etc.

Fixes AttributeError and improves dashboard usability.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Bug: Audio throughput was showing 0.00x because request_end_time is
an absolute Unix timestamp, not a duration.

Fix: Use benchmark-level 'duration' field (wall-clock time) instead
of request timestamps for audio throughput calculation.

Calculation:
- Before: total_audio_seconds / max(request_end_time) = tiny value
- After: total_audio_seconds / benchmark_duration = correct ratio

Example:
- 5 files × 3.5s each = 17.5s total audio
- Benchmark duration = 1.17s wall-clock
- Audio throughput = 17.5 / 1.17 = 14.96x (processing 15x faster than real-time)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…alysis

Major dashboard improvements based on user feedback:

1. Test Dataset Overview (new section):
   - Audio Files: Total files processed
   - Avg Duration: Average file length (e.g., "3.5s per file")
   - Total Audio: Total audio content (auto-formats as s/min/h)
   - Total Data: Audio payload size in MB

2. Performance Overview (redesigned):
   - Audio Hours/Hour: Clearer than audio_sec/wall_sec (e.g., "10.0 h/h" = 10x real-time)
   - Max Files/Hour: Prominent capacity metric (e.g., "15,840 files/hour")
   - Avg RTF: Real-time factor
   - Success Rate: Request success percentage

3. NEW: Speedup vs Sequential chart:
   - Bar chart showing speedup relative to sequential baseline
   - Answers: "Is concurrent-8 worth it vs sequential?"
   - Includes summary table with speedup values
   - Reference line at 1.0x (sequential baseline)

4. NEW: Files/Hour chart:
   - Dedicated chart for files processed per hour
   - Better for capacity planning than requests/second
   - Answers: "How many files can we process overnight?"

5. Audio Hours/Hour chart (renamed from Audio Throughput):
   - Changed from "audio_sec/wall_sec" to "hours/hour"
   - More intuitive: "10.0 h/h" = process 10 hours in 1 hour
   - Added reference line at 1.0 h/h (real-time)

6. Chart reordering by user priority:
   - Total Time (most requested)
   - Speedup (shows concurrency benefit)
   - Files/Hour (capacity planning)
   - Audio Hours/Hour (clearer throughput)
   - RTF (real-time factor)
   - Latency vs Duration
   - Efficiency

7. Updated metrics explanations:
   - Added Audio Hours/Hour definition
   - Added Files/Hour definition
   - Added Speedup definition
   - All with examples and use cases

Answers key user questions:
- "How long to transcribe N files?" → Total Time chart
- "Is concurrent processing faster?" → Speedup chart
- "What's our capacity?" → Files/Hour + Audio Hours/Hour
- "How much faster than real-time?" → Audio Hours/Hour (10.0 h/h = 10x)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…oard

Match LLM dashboard UX and add detailed audio file information:

1. RTF Percentile Selection (matching LLM dashboard):
   - Changed from radio buttons to checkboxes
   - Follows same pattern as LLM Client Metrics dashboard
   - Shows: Mean, P50, P95, P99 checkboxes side-by-side
   - Default: P95 + P99 selected (like LLM dashboard)
   - Warning if no percentiles selected
   - Consistent line styles: mean=solid, p50=dash, p95=dot, p99=dashdot

2. Audio File Format Information (new metrics):
   - Second row in Test Dataset Overview
   - Audio Format: MP3/WAV/FLAC (from metadata)
   - Sample Rate: 16kHz (from audio_sample_rate)
   - Bitrate: 64k (from audio_bitrate)
   - Dataset: librispeech_asr (from dataset_name)
   - All extracted from test-metadata.json

Benefits:
- Consistent UX across all dashboards
- Users can see exact audio file specifications tested
- Important for comparing different audio formats
- Helps contextualize performance results

Example display:
Row 1: Files | Avg Duration | Total Audio | Total Data
Row 2: MP3 | 16kHz | 64k | librispeech_asr

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant