Skip to content

Conversation

@clement-pages
Copy link

No description provided.

@hbredin hbredin requested a review from Copilot November 25, 2025 10:08
Copilot finished reviewing on behalf of hbredin November 25, 2025 10:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds transcription metrics functionality to pyannote-metrics, implementing two Word Error Rate (WER) metrics: Time-Constrained minimum-Permutation Word Error Rate (tcpWER) and standard Word Error Rate (WER). The implementation includes comprehensive text normalization capabilities with a focus on English language support.

Key changes:

  • Added transcription metrics with tcpWER and WER implementations using meeteval library
  • Introduced text normalization infrastructure with English-specific normalizers
  • Added new optional dependencies: meeteval, more-itertools, regex, cython, and kaldialign

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
uv.lock Updated dependency lock file with new packages (meeteval, more-itertools, regex, cython, kaldialign) and version updates for pyannote-core and pyannote-database
src/pyannote/metrics/transcription.py Implements two WER metric classes with normalization support and meeteval integration
src/pyannote/metrics/normalizers/init.py Provides normalizer selection logic with support for per-file language-based normalization
src/pyannote/metrics/normalizers/basic.py Basic text normalization utilities for symbol and diacritic removal
src/pyannote/metrics/normalizers/english/english.py Comprehensive English text normalizer with number and contraction handling
src/pyannote/metrics/normalizers/english/init.py Module exports for English normalizer
src/pyannote/metrics/normalizers/english/english.json British-American spelling mappings (1739 entries)
src/pyannote/metrics/normalizers/english/pre_english.json Pre-normalization contraction mappings
pyproject.toml Added new "transcription" optional dependency group

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

{
"shan't": "shall not",
"han't": "has not",
"ain't": "ain not"
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization of "ain't" to "ain not" appears incorrect. "Ain't" is typically considered a contraction for "am not", "is not", or "are not" (and sometimes "has not" or "have not"), but "ain not" is not grammatically valid. Consider changing this to "am not" or another appropriate form.

Suggested change
"ain't": "ain not"
"ain't": "am not"

Copilot uses AI. Check for mistakes.
r"\bwon't\b": "will not",
r"\bcan't\b": "can not",
r"\blet's\b": "let us",
r"\bain't\b": "aint",
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization of "ain't" to "aint" is inconsistent with the pre_english.json file which maps "ain't" to "ain not". This will cause the pre_english.json mapping to be overwritten by this replacement, resulting in "ain't" → "ain not" → "aint". Consider aligning these normalizations or clarifying the intended behavior.

Suggested change
r"\bain't\b": "aint",
r"\bain't\b": "ain not",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant