feat: add comprehensive audio model benchmarking infrastructure by maryamtahhan · Pull Request #110 · redhat-et/vllm-cpu-perf-eval

maryamtahhan · 2026-04-21T13:00:33Z

Add complete test suite for audio models (ASR, translation, audio chat) on vLLM CPU deployments.

Audio Benchmark Playbook (audio-benchmark.yml):

Three-play structure: Setup/Validation, Start vLLM Server, Execute Benchmarks
Supports managed mode (Ansible starts vLLM) and external endpoint mode
CPU auto-calculation with control plane reservation (2 cores for vLLM scheduler, KV cache, async ops)
Audio-specific Tensor Parallel defaults: TP=1 for <64 cores, TP=2 for 64+ cores
OMP thread calculation: (requested_cores - 2) / tensor_parallel
Integrated vLLM metrics collection via common task files
Generates test-metadata.json for Streamlit dashboard integration
Containerized execution for both vLLM and GuideLLM

GuideLLM Audio Benchmark Role (benchmark_guidellm_audio):

GuideLLM v0.6.0 container execution with audio support
Builds command as argument list to avoid JSON escaping issues
Audio preprocessing via --data-preprocessors encode_media
HuggingFace cache volume mount for disk space management
Supports audio_transcriptions, audio_translations, chat_completions request formats
Comprehensive error handling and log capture

Created reusable task files used by both LLM and audio benchmarks:

tasks/start-vllm-metrics-collection.yml: Unified vLLM /metrics endpoint collection
tasks/stop-vllm-metrics-collection.yml: Unified metrics teardown
tasks/wait-for-vllm-ready.yml: Health check logic for vLLM readiness
tasks/setup-results-directory.yml: Results directory creation

Refactored llm-benchmark-auto.yml to use common task files (~55 lines reduced).

Five comprehensive test scenarios plus quick validation test:

transcription-throughput.yaml: PRIMARY TEST - "How long to transcribe N files?"
- Sequential baseline (100 files)
- Concurrent processing (2, 4, 8 workers)
- Maximum throughput test
- Metrics: total time, files/sec, audio_seconds/sec, real-time factor
transcription-latency.yaml: Per-request latency under load
- Light/medium/heavy load tests
- P50/P95/P99 latency percentiles
- Latency degradation analysis
audio-duration-scaling.yaml: Performance vs audio length
- Short (1-5s), medium (5-15s), long (15-30s), full-length clips
- Linear vs non-linear scaling analysis
- Per-audio-second processing cost
constant-rate-stress.yaml: Sustained load stability
- Sustained rates: 2, 5, 10 req/s (5 min each)
- Extended duration test (15 min)
- Memory leak detection
format-comparison.yaml: Audio format impact
- MP3 (64kbps, 128kbps)
- WAV (uncompressed, 8/16/48kHz)
- FLAC (lossless)
- Stereo vs mono comparison
quick-test.yaml: Fast validation (5 files)

model-matrix.yaml:

Whisper models: tiny (39M), small (244M), medium (769M)
Audio chat: ultravox-v0_5-llama-3_2-1b
Audio preprocessing presets: mp3_64k, mp3_128k, wav_16k, flac_16k
Recommended vLLM settings per model (dtype, max_model_len, kvcache)
Dataset configurations (LibriSpeech, Common Voice)

tests/audio-models/README.md (435 lines):

Quick start guide (managed and external modes)
Detailed usage for all test scenarios
CPU allocation breakdown with control plane explanation
Container mode documentation (no host installation required)
Advanced configuration (manual CPU tuning, custom datasets)
Results analysis and interpretation
Troubleshooting guide

models/audio-models/audio-models.md:

Model selection rationale
Supported endpoints (ASR, translation, chat)
Dataset descriptions (LibriSpeech, Common Voice)
Audio preprocessing configurations
CPU optimization parameters
Performance expectations
Test scenario mappings

Updated models/models.md:

Added audio models section
Links to audio-specific documentation
Integration with existing model matrix

CPU Allocation Formula:

Container Allocation:  cores 0-31 (32 total)
├── Control Plane:     2 cores (scheduler, KV cache, async ops)
└── Worker Threads:    30 cores (OMP_NUM_THREADS=30)
    └── Per TP rank:   30/TP cores

OMP_NUM_THREADS = (requested_cores - 2) / tensor_parallel

Container Execution:

vLLM container: ghcr.io/vllm-project/vllm-cpu-env:latest
GuideLLM container: ghcr.io/vllm-project/guidellm:v0.6.0
HuggingFace cache volume: host mount to avoid disk space issues
SELinux context: :z flag for proper permissions
Cache directory: 0777 mode for container write access

Metrics Collection:

vLLM server metrics from /metrics endpoint
GuideLLM client-side metrics (latency, throughput, audio-specific)
Audio-specific metrics: audio_seconds, audio_samples, audio_bytes, audio_tokens
Real-time factor: processing_time / audio_duration

Successfully validated with quick-test scenario:

Container image pull and execution
HuggingFace dataset download (LibriSpeech)
Audio preprocessing (MP3 encoding)
vLLM server startup and health checks
Metrics collection
Results JSON generation

Add complete test suite for audio models (ASR, translation, audio chat) on vLLM CPU deployments. **Audio Benchmark Playbook** (audio-benchmark.yml): - Three-play structure: Setup/Validation, Start vLLM Server, Execute Benchmarks - Supports managed mode (Ansible starts vLLM) and external endpoint mode - CPU auto-calculation with control plane reservation (2 cores for vLLM scheduler, KV cache, async ops) - Audio-specific Tensor Parallel defaults: TP=1 for <64 cores, TP=2 for 64+ cores - OMP thread calculation: (requested_cores - 2) / tensor_parallel - Integrated vLLM metrics collection via common task files - Generates test-metadata.json for Streamlit dashboard integration - Containerized execution for both vLLM and GuideLLM **GuideLLM Audio Benchmark Role** (benchmark_guidellm_audio): - GuideLLM v0.6.0 container execution with audio support - Builds command as argument list to avoid JSON escaping issues - Audio preprocessing via --data-preprocessors encode_media - HuggingFace cache volume mount for disk space management - Supports audio_transcriptions, audio_translations, chat_completions request formats - Comprehensive error handling and log capture Created reusable task files used by both LLM and audio benchmarks: - **tasks/start-vllm-metrics-collection.yml**: Unified vLLM /metrics endpoint collection - **tasks/stop-vllm-metrics-collection.yml**: Unified metrics teardown - **tasks/wait-for-vllm-ready.yml**: Health check logic for vLLM readiness - **tasks/setup-results-directory.yml**: Results directory creation Refactored llm-benchmark-auto.yml to use common task files (~55 lines reduced). Five comprehensive test scenarios plus quick validation test: 1. **transcription-throughput.yaml**: PRIMARY TEST - "How long to transcribe N files?" - Sequential baseline (100 files) - Concurrent processing (2, 4, 8 workers) - Maximum throughput test - Metrics: total time, files/sec, audio_seconds/sec, real-time factor 2. **transcription-latency.yaml**: Per-request latency under load - Light/medium/heavy load tests - P50/P95/P99 latency percentiles - Latency degradation analysis 3. **audio-duration-scaling.yaml**: Performance vs audio length - Short (1-5s), medium (5-15s), long (15-30s), full-length clips - Linear vs non-linear scaling analysis - Per-audio-second processing cost 4. **constant-rate-stress.yaml**: Sustained load stability - Sustained rates: 2, 5, 10 req/s (5 min each) - Extended duration test (15 min) - Memory leak detection 5. **format-comparison.yaml**: Audio format impact - MP3 (64kbps, 128kbps) - WAV (uncompressed, 8/16/48kHz) - FLAC (lossless) - Stereo vs mono comparison 6. **quick-test.yaml**: Fast validation (5 files) **model-matrix.yaml**: - Whisper models: tiny (39M), small (244M), medium (769M) - Audio chat: ultravox-v0_5-llama-3_2-1b - Audio preprocessing presets: mp3_64k, mp3_128k, wav_16k, flac_16k - Recommended vLLM settings per model (dtype, max_model_len, kvcache) - Dataset configurations (LibriSpeech, Common Voice) **tests/audio-models/README.md** (435 lines): - Quick start guide (managed and external modes) - Detailed usage for all test scenarios - CPU allocation breakdown with control plane explanation - Container mode documentation (no host installation required) - Advanced configuration (manual CPU tuning, custom datasets) - Results analysis and interpretation - Troubleshooting guide **models/audio-models/audio-models.md**: - Model selection rationale - Supported endpoints (ASR, translation, chat) - Dataset descriptions (LibriSpeech, Common Voice) - Audio preprocessing configurations - CPU optimization parameters - Performance expectations - Test scenario mappings **Updated models/models.md**: - Added audio models section - Links to audio-specific documentation - Integration with existing model matrix **CPU Allocation Formula**: ``` Container Allocation: cores 0-31 (32 total) ├── Control Plane: 2 cores (scheduler, KV cache, async ops) └── Worker Threads: 30 cores (OMP_NUM_THREADS=30) └── Per TP rank: 30/TP cores OMP_NUM_THREADS = (requested_cores - 2) / tensor_parallel ``` **Container Execution**: - vLLM container: ghcr.io/vllm-project/vllm-cpu-env:latest - GuideLLM container: ghcr.io/vllm-project/guidellm:v0.6.0 - HuggingFace cache volume: host mount to avoid disk space issues - SELinux context: :z flag for proper permissions - Cache directory: 0777 mode for container write access **Metrics Collection**: - vLLM server metrics from /metrics endpoint - GuideLLM client-side metrics (latency, throughput, audio-specific) - Audio-specific metrics: audio_seconds, audio_samples, audio_bytes, audio_tokens - Real-time factor: processing_time / audio_duration Successfully validated with quick-test scenario: - Container image pull and execution - HuggingFace dataset download (LibriSpeech) - Audio preprocessing (MP3 encoding) - vLLM server startup and health checks - Metrics collection - Results JSON generation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

coderabbitai · 2026-04-21T13:00:42Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ac447a02-4fe3-4c64-a964-c693636cf343

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

LibriSpeech ASR dataset is ~350GB across all splits, causing timeouts during dataset download for quick validation tests. Switch quick-test to use hf-internal-testing/librispeech_asr_dummy which is a tiny test dataset (<1MB) designed for rapid ASR testing. This allows the test to complete in seconds instead of failing after multi-minute downloads. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

When using delegate_to with delegate_facts, variables are stored in hostvars['localhost'] and must be accessed through that path. Fixed references to: - benchmark_end_time - test_duration_seconds (in format task) These variables are now correctly accessed via hostvars['localhost'] to prevent 'undefined' errors during test finalization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add comprehensive documentation explaining the difference between: 1. **Offline Batch Processing** (e.g., "I have N audio files, transcribe them all ASAP") - Focus: Total completion time, maximum throughput - Use cases: Post-call transcription, media archive processing - Profiles: synchronous (sequential), throughput (max capacity) 2. **Online Serving** (e.g., "How many concurrent users can use my transcription API?") - Focus: Per-request latency (P50, P95, P99), concurrent user capacity - Use cases: Real-time transcription API, voice assistant backend - Profiles: concurrent (simulates N concurrent users), constant rate Key clarifications: - 'concurrent' profile with 'rate: N' simulates N concurrent USERS, NOT parallel batch processing of N files - Sequential + max-throughput stages answer offline batch questions - Concurrent-N stages answer online serving questions - transcription-throughput covers BOTH patterns in one test Added: - README section "Understanding Test Profiles" with analogies - README section "Which Test Should I Run?" with use case mapping - Per-scenario serving pattern labels in test table - Detailed explanation for each test scenario - Inline comments in transcription-throughput.yaml clarifying stages Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…etup Add `audio_num_files` parameter to allow dynamic override of the number of audio files processed per benchmark stage. This enables quick testing with fewer files (e.g., 10) or comprehensive benchmarks with more files (e.g., 100+). Changes: - Add `audio_num_files` parameter validation (1-10000 range) - Override `max_requests` in all stages when parameter is provided - Change default `vllm_dtype` from "float16" to "auto" for better model compatibility (aligns with LLM benchmark defaults) - Fix directory permissions from 0755 to 0777 to prevent container permission errors when downloading models - Update documentation with parameter usage examples - Add inventory file requirement to all command examples Usage: ansible-playbook -i inventory/hosts.yml audio-benchmark.yml \ -e "test_model=openai/whisper-tiny" \ -e "test_scenario=transcription-throughput" \ -e "requested_cores=32" \ -e "audio_num_files=10" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Add :Z flag to volume mounts for SELinux compatibility on RHEL/Amazon Linux - Include HF_TOKEN environment variable for model downloads - Setup HuggingFace token role before starting vLLM server Fixes: PermissionError when downloading models to mounted volumes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

The VLLM_CPU_KVCACHE_SPACE environment variable expects just the numeric value (e.g., 2 for 2GiB), not the full string with units. Use the extract_size_value filter to convert '2GiB' -> 2. Fixes: pydantic ValidationError for invalid literal '2GiB' Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

For single tensor-parallel configurations (TP=1), vLLM's CPU backend manages threads automatically and doesn't expect OMP_NUM_THREADS to be set. Setting it causes 'RuntimeError: Expected positive number of threads'. Only set OMP_NUM_THREADS and VLLM_CPU_OMP_THREADS_BIND when TP > 1, matching the behavior of the LLM benchmark playbook. Fixes: RuntimeError during engine core initialization for TP=1 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Reserve 2 CPU cores for vLLM's control plane (scheduler, KV cache manager, async operations) by setting VLLM_CPU_NUM_OF_RESERVED_CPU=2. This prevents worker threads from saturating all cores and starving the control plane. Without this, performance degrades as the control plane competes with worker threads for CPU time. Also updated CPU configuration display to show: - Reserved cores configuration - Auto-managed threads for TP=1 (no OMP override) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…ction Add detailed comments explaining the critical interaction between VLLM_CPU_NUM_OF_RESERVED_CPU and VLLM_CPU_OMP_THREADS_BIND: **For TP=1 (our default):** - VLLM_CPU_OMP_THREADS_BIND left unset (defaults to 'auto') - VLLM_CPU_NUM_OF_RESERVED_CPU=2 works properly - vLLM auto-manages thread binding and respects reserved cores - Default behavior: world_size==1 reserves 0 cores (bad for performance!) - We explicitly set 2 to prevent worker saturation **For TP>1:** - VLLM_CPU_OMP_THREADS_BIND set explicitly (e.g., '0-31|32-63') - OMP_NUM_THREADS set explicitly - Setting explicit binding DISABLES VLLM_CPU_NUM_OF_RESERVED_CPU - Reserved cores must be accounted for in the binding string This ensures optimal CPU allocation without control plane starvation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

The vllm_control_plane_cores variable was only defined in the localhost play but needed in the vllm-server play for VLLM_CPU_NUM_OF_RESERVED_CPU. Add it to the vllm-server play's vars section so it's available when building environment variables. Fixes: 'vllm_control_plane_cores' is undefined error Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Update default vLLM container image from v0.18.0 to v0.19.0 for better audio model support and bug fixes. Users can still override with: export VLLM_CONTAINER_IMAGE=<custom-image> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…hput GuideLLM's ThroughputProfile requires a 'rate' parameter. Set it to 'inf' (infinite) to send requests as fast as possible, which is the intended behavior for finding maximum throughput capacity. Fixes: ValueError: ThroughputProfile requires a rate parameter Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…hmark The Ansible role was only passing --rate to GuideLLM for concurrent and constant profiles, but not for throughput profile. This caused the max-throughput stage to fail with "ThroughputProfile requires a rate parameter". Added rate parameter handling for throughput profile in both container mode (command args list) and host mode (command string) sections. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…default Changed max-throughput stage rate from "inf" to 50 concurrent streams: - GuideLLM's throughput profile requires finite rate (max_concurrency) - Default of 50 concurrent streams is reasonable for most deployments - Can be increased in scenario YAML for load balancer/multi-instance setups - Not exposed as playbook parameter since it's deployment-specific This addresses the error: ValidationError: Input should be a finite number [type=finite_number, input_value=inf] Updated documentation to explain: - Rate parameter represents concurrent request streams for throughput profile - How to adjust for multi-instance deployments behind load balancers - Why it's not tied to core count (external mode doesn't know server config) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Changed stage_results_path to include stage suffix, creating: results/.../transcription-throughput/sequential/benchmarks.json results/.../transcription-throughput/concurrent-2/benchmarks.json etc. Benefits: - Each stage preserves GuideLLM's default output filenames - No overwriting between stages - Cleaner organization for multi-stage benchmarks - Easier to add stage-specific artifacts later This fixes the issue where all stages were writing to the same directory and overwriting each other's results. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Added fetch tasks to copy benchmarks.json and benchmarks.csv from the load generator host back to the controller (localhost) after each stage completes. Changes: - Check for benchmark files on load generator before fetching - Display found files for debugging - Fetch benchmarks.json and benchmarks.csv to controller - Handle both container mode and host mode execution - Use failed_when: false to handle cases where files don't exist Without this fix, benchmark results remained on the load generator host and were not accessible from the controller where the playbook was run. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Changed results path from: results/audio-models/{model}/{scenario}/ to: results/audio-models/{model}/{scenario}-{timestamp}/ This matches the LLM benchmark directory structure and prevents results from being overwritten on subsequent test runs. Example: results/audio-models/openai__whisper-tiny/transcription-throughput-20260422-110932/ ├── sequential/ ├── concurrent-2/ ├── concurrent-4/ ├── concurrent-8/ ├── max-throughput/ ├── vllm-metrics.json └── test-metadata.json Updated documentation to reflect the new structure and explain the timestamp-based organization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

GuideLLM requires different encode kwargs structure depending on request type: - audio_transcriptions/translations: flat structure {"format": "mp3", ...} - chat_completions: nested structure {"audio": {"format": "mp3", ...}} The previous implementation always used nested structure, which caused GuideLLM to fail silently when generating requests for transcription/translation tasks - it would initialize but process 0 requests. Now conditionally wraps kwargs in "audio" only for chat_completions. This fixes the issue where all benchmark stages completed in <1 second with 0 requests processed. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…taset GuideLLM was trying to download the entire LibriSpeech test split (2620 samples) even when only benchmarking 5 requests. This caused: - Very long dataset download times (downloading all 48 tar files) - Benchmarks completing in 0 seconds with 0 requests (timeout/failure) Solution: Add --data-samples parameter set to 2x max_requests: - max_requests=5 → load 10 samples - max_requests=100 → load 200 samples This provides enough samples for the benchmark while avoiding unnecessary downloads. The 2x buffer ensures we have extra samples in case of any preprocessing failures. Applied to both container mode (command args) and host mode (command string). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Build custom vLLM image with audio deps (librosa, soundfile, ffmpeg-python) for v0.19.0 - Add load generator connectivity check before running benchmarks - Add audio endpoint pre-flight check to verify vLLM supports the request type - Stop all vLLM containers (audio/LLM/embedding) before starting new ones Resolves audio benchmark failures caused by: 1. Missing audio dependencies in v0.19.0 image (predates vllm[audio] extras) 2. Network/firewall issues blocking load generator → vLLM server connections 3. Multiple vLLM containers running simultaneously causing port conflicts Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Add docs/audio-benchmarking.md with complete guide: - Quick start and prerequisites - All test scenarios explained - Understanding results and metrics - Advanced configuration (CPU, models, non-container mode) - Troubleshooting section covering all common issues: - Network connectivity (AWS security groups, private IPs) - Missing audio dependencies - Multiple containers - Dataset download issues - GuideLLM backend validation - Low throughput diagnosis - Best practices for production use - Update main README.md: - Add Audio Models test suite to quick start - Include audio-models in repository structure - Add to models and documentation sections - Link to comprehensive audio benchmarking guide Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add new Streamlit dashboard page (🎧 Audio Metrics) with audio-specific performance analysis: Audio-Specific Metrics: - Audio throughput (audio_seconds/wall_clock_second) - Real-Time Factor (RTF = processing_time / audio_duration) - RTF < 1.0 = faster than real-time - RTF = 1.0 = real-time processing - RTF > 1.0 = slower than real-time - Request throughput (files/second) - Per-core efficiency Features: - Extracts audio metadata from GuideLLM benchmarks.json: - audio_seconds (duration) - audio_tokens - audio_samples - audio_bytes - Calculates RTF percentiles (mean, p50, p95, p99) per request - Aggregates audio metrics across stages - Visualizations: - Audio throughput by stage and model - RTF trends with percentile overlay - Latency vs audio duration scaling - Request throughput comparison - Per-core efficiency - CSV export for external analysis - Supports stage-based filtering (sequential, concurrent-N, max-throughput) Updates: - Home.py: Add audio metrics to dashboard overview - README.md: Document audio metrics page with RTF explanation - New page: 4_🎧_Audio_Metrics.py Works with existing audio benchmark results structure: results/audio-models/model/scenario-timestamp/stage/benchmarks.json Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Fix AttributeError by using config.get_results_directory() method instead of non-existent config.results_dir attribute. Also set default path to audio-models instead of llm for the audio metrics dashboard. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Key improvements to audio metrics dashboard: 1. Filter out LLM models automatically: - Only show results with audio_seconds > 0 - Prevents LLM text generation results from appearing in audio dashboard 2. Fix scenario extraction: - Use metadata.scenario_name (audio format) instead of metadata.scenario - Correctly shows "transcription-throughput" instead of "unknown" 3. Add comprehensive metrics explanations: - Expandable "Understanding Audio Metrics" section - Explains Audio Throughput, RTF, Request Throughput, Efficiency - Defines percentiles (P50, P95, P99) in audio context - Clarifies test stages (sequential, concurrent-N, max-throughput) 4. Replace request throughput chart with total time chart: - NEW: "Total Time to Process N Files" chart - Shows wall-clock duration (lower = faster) - Directly answers: "How long to transcribe N audio files?" - Includes summary table with files/second and files/hour 5. Add detailed descriptions to all charts: - Audio Throughput: Explains audio_sec/wall_sec ratio - RTF: Explains < 1.0 = faster than real-time - Latency vs Duration: Explains scaling behavior 6. Reorder charts by user priority: - Total Time first (most requested metric) - Then Audio Throughput, RTF, etc. Fixes AttributeError and improves dashboard usability. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Bug: Audio throughput was showing 0.00x because request_end_time is an absolute Unix timestamp, not a duration. Fix: Use benchmark-level 'duration' field (wall-clock time) instead of request timestamps for audio throughput calculation. Calculation: - Before: total_audio_seconds / max(request_end_time) = tiny value - After: total_audio_seconds / benchmark_duration = correct ratio Example: - 5 files × 3.5s each = 17.5s total audio - Benchmark duration = 1.17s wall-clock - Audio throughput = 17.5 / 1.17 = 14.96x (processing 15x faster than real-time) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…alysis Major dashboard improvements based on user feedback: 1. Test Dataset Overview (new section): - Audio Files: Total files processed - Avg Duration: Average file length (e.g., "3.5s per file") - Total Audio: Total audio content (auto-formats as s/min/h) - Total Data: Audio payload size in MB 2. Performance Overview (redesigned): - Audio Hours/Hour: Clearer than audio_sec/wall_sec (e.g., "10.0 h/h" = 10x real-time) - Max Files/Hour: Prominent capacity metric (e.g., "15,840 files/hour") - Avg RTF: Real-time factor - Success Rate: Request success percentage 3. NEW: Speedup vs Sequential chart: - Bar chart showing speedup relative to sequential baseline - Answers: "Is concurrent-8 worth it vs sequential?" - Includes summary table with speedup values - Reference line at 1.0x (sequential baseline) 4. NEW: Files/Hour chart: - Dedicated chart for files processed per hour - Better for capacity planning than requests/second - Answers: "How many files can we process overnight?" 5. Audio Hours/Hour chart (renamed from Audio Throughput): - Changed from "audio_sec/wall_sec" to "hours/hour" - More intuitive: "10.0 h/h" = process 10 hours in 1 hour - Added reference line at 1.0 h/h (real-time) 6. Chart reordering by user priority: - Total Time (most requested) - Speedup (shows concurrency benefit) - Files/Hour (capacity planning) - Audio Hours/Hour (clearer throughput) - RTF (real-time factor) - Latency vs Duration - Efficiency 7. Updated metrics explanations: - Added Audio Hours/Hour definition - Added Files/Hour definition - Added Speedup definition - All with examples and use cases Answers key user questions: - "How long to transcribe N files?" → Total Time chart - "Is concurrent processing faster?" → Speedup chart - "What's our capacity?" → Files/Hour + Audio Hours/Hour - "How much faster than real-time?" → Audio Hours/Hour (10.0 h/h = 10x) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…oard Match LLM dashboard UX and add detailed audio file information: 1. RTF Percentile Selection (matching LLM dashboard): - Changed from radio buttons to checkboxes - Follows same pattern as LLM Client Metrics dashboard - Shows: Mean, P50, P95, P99 checkboxes side-by-side - Default: P95 + P99 selected (like LLM dashboard) - Warning if no percentiles selected - Consistent line styles: mean=solid, p50=dash, p95=dot, p99=dashdot 2. Audio File Format Information (new metrics): - Second row in Test Dataset Overview - Audio Format: MP3/WAV/FLAC (from metadata) - Sample Rate: 16kHz (from audio_sample_rate) - Bitrate: 64k (from audio_bitrate) - Dataset: librispeech_asr (from dataset_name) - All extracted from test-metadata.json Benefits: - Consistent UX across all dashboards - Users can see exact audio file specifications tested - Important for comparing different audio formats - Helps contextualize performance results Example display: Row 1: Files | Avg Duration | Total Audio | Total Data Row 2: MP3 | 16kHz | 64k | librispeech_asr Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan and others added 20 commits April 21, 2026 14:07

maryamtahhan force-pushed the feat/audio-tests branch from 7959946 to a8863c6 Compare April 23, 2026 11:10

maryamtahhan and others added 7 commits April 23, 2026 12:44

maryamtahhan force-pushed the feat/audio-tests branch from 23811ee to 54822d5 Compare April 23, 2026 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add comprehensive audio model benchmarking infrastructure#110

feat: add comprehensive audio model benchmarking infrastructure#110
maryamtahhan wants to merge 28 commits into
redhat-et:mainfrom
maryamtahhan:feat/audio-tests

maryamtahhan commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maryamtahhan commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading