- [02/25/2025] π Added support for OWSM (Open Whisper-style Speech Models) ASR system
- [12/10/2024] π Published BIGOS benchmark paper in NeurIPS Datasets and Benchmarks Track 2024
- [12/01/2024] π Released updated PolEval evaluation results on the Polish ASR leaderboard
BIGOS (Benchmark Intended Grouping of Open Speech) is a framework for evaluating Automatic Speech Recognition (ASR) systems on Polish language datasets. It provides tools for:
- Curating speech datasets in a standardized format
- Generating ASR transcriptions from various engines (commercial and open-source)
- Evaluating transcription quality with standard metrics
- Visualizing and analyzing results
Key Benefits: BIGOS standardizes evaluation across multiple ASR systems and datasets, enabling fair comparison and quantitative analysis of ASR performance on Polish speech.
- Python 3.10+
- Required system packages:
sudo apt-get install sox ffmpeg # Ubuntu/Debian brew install sox ffmpeg # macOS
-
Clone the repository:
git clone https://github.com/your-username/pl-asr-bigos-tools.git cd pl-asr-bigos-tools
-
Install Python dependencies:
pip install -r requirements.txt
-
Configure your environment:
- Copy
config/user-specific/template.ini
toconfig/user-specific/config.ini
- Edit the file with your API keys and paths
- Validate your configuration with:
make test-force-hyp make test
- Copy
The main functionality is accessible through the Makefile:
# Run evaluation on BIGOS dataset
make eval-e2e EVAL_CONFIG=bigos
# Run evaluation on PELCRA dataset
make eval-e2e EVAL_CONFIG=pelcra
# Generate hypotheses for a specific configuration
make hyp-gen EVAL_CONFIG=bigos
# Calculate statistics for cached hypotheses
make hyps-stats EVAL_CONFIG=bigos
# Force regeneration of evaluation data
make eval-e2e-force EVAL_CONFIG=bigos
The BIGOS benchmark system follows a modular architecture:
- Dataset Management: Curated datasets in BIGOS format
- ASR Systems: Standardized interface for diverse ASR engines
- Hypothesis Generation: Processing audio through ASR systems
- Evaluation: Calculating metrics and generating reports
- Analysis: Tools for visualizing and interpreting results
The evaluation workflow consists of the following stages:
- Preparation: Loading datasets and preparing processing pipelines
- Hypothesis Generation: Creating transcriptions using specified ASR systems
- Evaluation: Calculating metrics like WER, CER, MER, etc.
- Analysis: Reporting and visualization of results
- Create a new class in
scripts/asr_eval_lib/asr_systems/
based on the template - Register your system in
scripts/asr_eval_lib/asr_systems/__init__.py
- Update configuration files in
config/eval-run-specific/
Example of registering a new ASR system:
# In scripts/asr_eval_lib/asr_systems/__init__.py
from .your_new_asr_system import YourNewASRSystem
def asr_system_factory(system, model, config):
# Existing code...
elif system == 'your_system':
# Configuration for your new system
return YourNewASRSystem(system, model, other_params)
# More systems...
- Open an existing config file (e.g.,
config/eval-run-specific/bigos.json
) - Save a modified version as
config/eval-run-specific/<dataset_name>.json
- Ensure your dataset follows the BIGOS format and is publicly available
- Run the evaluation with:
make eval-e2e EVAL_CONFIG=<dataset_name>
To generate a synthetic test set:
make tts-set-gen TTS_SET=<tts_set_name>
Replace <tts_set_name>
with the appropriate configuration name (e.g., amu-med-all
).
To display a manifest for a specific dataset and split:
make sde-manifest DATASET_SUBSET=<subset_name> SPLIT=<split_name>
config/
- Configuration filescommon/
- Shared configurationeval-run-specific/
- ASR evaluation configurationtts-set-specific/
- TTS generation configurationuser-specific/
- User-specific settings (API keys, paths)
scripts/
- Main implementation codeasr_eval_lib/
- ASR evaluation frameworkasr_systems/
- ASR system implementationseval_utils/
- Evaluation metrics and utilitiesprefect_flows/
- Prefect workflow definitions
tts_gen_lib/
- Speech synthesis for test datautils/
- Common utilities
data/
- Working directory for datasets and results (gitignored)
The benchmark currently supports the following ASR systems:
- Google Cloud Speech-to-Text (v1 and v2)
- Microsoft Azure Speech-to-Text
- OpenAI Whisper (Cloud and Local)
- AssemblyAI
- NVIDIA NeMo
- Facebook MMS
- Facebook Wav2Vec
- OWSM (Open Whisper-style Speech Models)
The framework is designed to work with datasets in the BIGOS format:
- BIGOS V2 - Primarily read speech
- PELCRA for BIGOS - Primarily conversational speech
- API Key Access: If encountering authentication errors, verify your API keys in
config.ini
- Missing Dependencies: If experiencing import errors, run
pip install -r requirements.txt
- Permission Issues: For file access errors, check directory permissions in your configuration
- Disk Space: ASR hypothesis caching requires substantial disk space; monitor usage in the
data/
directory
The following TODO items represent ongoing development priorities:
- Add detailed docstrings to all classes and functions
- Create a comprehensive API reference
- Add examples for extending with new metrics
- Document the data format specification in detail
- Add type hints to improve code readability and IDE support
- Implement more robust error handling in ASR system implementations
- Add logging throughout the codebase (replace print statements)
- Standardize configuration approach (choose either JSON or INI consistently)
- Add support for new ASR systems (e.g., Meta Seamless, Amazon Transcribe)
- Implement additional evaluation metrics (e.g., semantic metrics)
- Create a web interface for results visualization
- Add support for languages beyond Polish
- Implement audio preprocessing options (e.g., noise reduction, normalization)
- Expand test coverage for core components
- Add integration tests for complete evaluation flows
- Create fixtures for testing without API access
- Containerize the application with Docker
- Create a CI/CD pipeline for automated testing
- Implement a proper Python package structure
- Add infrastructure for distributed processing
Contributions to BIGOS are welcome! Please see DEVELOPER.md for guidance.
This project is licensed under the MIT License - see the LICENSE.md file for details.
If you use this benchmark in your research, please cite:
@inproceedings{NEURIPS2024_69bddcea,
author = {Junczyk, Micha\l },
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {57439--57471},
publisher = {Curran Associates, Inc.},
title = {BIGOS V2 Benchmark for Polish ASR: Curated Datasets and Tools for Reproducible Evaluation},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/69bddcea866e8210cf483769841282dd-Paper-Datasets_and_Benchmarks_Track.pdf},
volume = {37},
year = {2024}
}