PYA-1148: add transcription metrics #73

clement-pages · 2025-11-25T09:39:24Z

No description provided.

Copilot

Pull request overview

This PR adds transcription metrics functionality to pyannote-metrics, implementing two Word Error Rate (WER) metrics: Time-Constrained minimum-Permutation Word Error Rate (tcpWER) and standard Word Error Rate (WER). The implementation includes comprehensive text normalization capabilities with a focus on English language support.

Key changes:

Added transcription metrics with tcpWER and WER implementations using meeteval library
Introduced text normalization infrastructure with English-specific normalizers
Added new optional dependencies: meeteval, more-itertools, regex, cython, and kaldialign

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
uv.lock	Updated dependency lock file with new packages (meeteval, more-itertools, regex, cython, kaldialign) and version updates for pyannote-core and pyannote-database
src/pyannote/metrics/transcription.py	Implements two WER metric classes with normalization support and meeteval integration
src/pyannote/metrics/normalizers/init.py	Provides normalizer selection logic with support for per-file language-based normalization
src/pyannote/metrics/normalizers/basic.py	Basic text normalization utilities for symbol and diacritic removal
src/pyannote/metrics/normalizers/english/english.py	Comprehensive English text normalizer with number and contraction handling
src/pyannote/metrics/normalizers/english/init.py	Module exports for English normalizer
src/pyannote/metrics/normalizers/english/english.json	British-American spelling mappings (1739 entries)
src/pyannote/metrics/normalizers/english/pre_english.json	Pre-normalization contraction mappings
pyproject.toml	Added new "transcription" optional dependency group

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/pyannote/metrics/transcription.py

Copilot · 2025-11-25T10:11:18Z

src/pyannote/metrics/normalizers/english/pre_english.json

+{
+  "shan't": "shall not",
+  "han't": "has not",
+  "ain't": "ain not"


The normalization of "ain't" to "ain not" appears incorrect. "Ain't" is typically considered a contraction for "am not", "is not", or "are not" (and sometimes "has not" or "have not"), but "ain not" is not grammatically valid. Consider changing this to "am not" or another appropriate form.

Suggested change

"ain't": "ain not"

"ain't": "am not"

Copilot · 2025-11-25T10:11:18Z

src/pyannote/metrics/normalizers/english/english.py

+            r"\bwon't\b": "will not",
+            r"\bcan't\b": "can not",
+            r"\blet's\b": "let us",
+            r"\bain't\b": "aint",


The normalization of "ain't" to "aint" is inconsistent with the pre_english.json file which maps "ain't" to "ain not". This will cause the pre_english.json mapping to be overwritten by this replacement, resulting in "ain't" → "ain not" → "aint". Consider aligning these normalizations or clarifying the intended behavior.

Suggested change

r"\bain't\b": "aint",

r"\bain't\b": "ain not",

src/pyannote/metrics/normalizers/english/english.py

Co-authored-by: Copilot <[email protected]>

add transcription metrics and normalizers

2ce87b3

hbredin requested a review from Copilot November 25, 2025 10:08

Copilot started reviewing on behalf of hbredin November 25, 2025 10:08 View session

Copilot finished reviewing on behalf of hbredin November 25, 2025 10:10

Copilot AI reviewed Nov 25, 2025

View reviewed changes

clement-pages and others added 4 commits November 25, 2025 11:51

Apply suggestions from code review

a54f708

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

41d6a1a

Co-authored-by: Copilot <[email protected]>

print a warning instead of raising error when no-normalizer available

f2a9dc4

add normalization for chinese

fe48091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PYA-1148: add transcription metrics #73

PYA-1148: add transcription metrics #73

Uh oh!

clement-pages commented Nov 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PYA-1148: add transcription metrics #73

Are you sure you want to change the base?

PYA-1148: add transcription metrics #73

Uh oh!

Conversation

clement-pages commented Nov 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant