-
Notifications
You must be signed in to change notification settings - Fork 43
PYA-1148: add transcription metrics #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
PYA-1148: add transcription metrics #73
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds transcription metrics functionality to pyannote-metrics, implementing two Word Error Rate (WER) metrics: Time-Constrained minimum-Permutation Word Error Rate (tcpWER) and standard Word Error Rate (WER). The implementation includes comprehensive text normalization capabilities with a focus on English language support.
Key changes:
- Added transcription metrics with tcpWER and WER implementations using meeteval library
- Introduced text normalization infrastructure with English-specific normalizers
- Added new optional dependencies: meeteval, more-itertools, regex, cython, and kaldialign
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updated dependency lock file with new packages (meeteval, more-itertools, regex, cython, kaldialign) and version updates for pyannote-core and pyannote-database |
| src/pyannote/metrics/transcription.py | Implements two WER metric classes with normalization support and meeteval integration |
| src/pyannote/metrics/normalizers/init.py | Provides normalizer selection logic with support for per-file language-based normalization |
| src/pyannote/metrics/normalizers/basic.py | Basic text normalization utilities for symbol and diacritic removal |
| src/pyannote/metrics/normalizers/english/english.py | Comprehensive English text normalizer with number and contraction handling |
| src/pyannote/metrics/normalizers/english/init.py | Module exports for English normalizer |
| src/pyannote/metrics/normalizers/english/english.json | British-American spelling mappings (1739 entries) |
| src/pyannote/metrics/normalizers/english/pre_english.json | Pre-normalization contraction mappings |
| pyproject.toml | Added new "transcription" optional dependency group |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| "shan't": "shall not", | ||
| "han't": "has not", | ||
| "ain't": "ain not" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalization of "ain't" to "ain not" appears incorrect. "Ain't" is typically considered a contraction for "am not", "is not", or "are not" (and sometimes "has not" or "have not"), but "ain not" is not grammatically valid. Consider changing this to "am not" or another appropriate form.
| "ain't": "ain not" | |
| "ain't": "am not" |
| r"\bwon't\b": "will not", | ||
| r"\bcan't\b": "can not", | ||
| r"\blet's\b": "let us", | ||
| r"\bain't\b": "aint", |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalization of "ain't" to "aint" is inconsistent with the pre_english.json file which maps "ain't" to "ain not". This will cause the pre_english.json mapping to be overwritten by this replacement, resulting in "ain't" → "ain not" → "aint". Consider aligning these normalizations or clarifying the intended behavior.
| r"\bain't\b": "aint", | |
| r"\bain't\b": "ain not", |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
No description provided.