Skip to content

Remove filler words that conflict with European languages#943

Open
AlexanderYastrebov wants to merge 1 commit intocjpais:mainfrom
AlexanderYastrebov:remove-some-filler-words
Open

Remove filler words that conflict with European languages#943
AlexanderYastrebov wants to merge 1 commit intocjpais:mainfrom
AlexanderYastrebov:remove-some-filler-words

Conversation

@AlexanderYastrebov
Copy link

Before Submitting This PR

Please confirm you have done the following:

If this is a feature or change that was previously closed/rejected:

  • I have explained in the description below why this should be reconsidered
  • I have gathered community feedback (link to discussion below)

Human Written Description

Removed three filler words from the transcription filter that are actual words in European languages:

  • 'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine) Example: Portuguese 'Isto é um teste' (This is a test) Removing this was breaking Portuguese transcriptions

  • 'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have' Example: Spanish 'Él ha comido' (He has eaten) This is a very common verb form in Romance and Scandinavian languages

  • 'eh' - Italian interjection and Canadian English discourse marker While less critical, this can appear in legitimate Italian speech

The remaining filler words are primarily English
vocalized hesitations that don't conflict with common words in other European languages.

Updated tests to use 'uhm' instead of 'um' where needed.

Related Issues/Discussions

Fixes #941
Discussion: #940

Community Feedback

Testing

Tested using google translate with parakeet model as described in #940 for um and ha.

Screenshots/Videos (if applicable)

AI Assistance

  • No AI was used in this PR
  • AI was used (please describe below)

If AI was used:

  • Tools used: Copilot+Claude Sonnet 4.5
  • How extensively: asked about filler words:

Please summarize which of these filler words exist in European languages and therefore are not safe to remove.

Removed three filler words from the transcription filter that are actual
words in European languages:

- 'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine)
  Example: Portuguese 'Isto é um teste' (This is a test)
  Removing this was breaking Portuguese transcriptions

- 'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have'
  Example: Spanish 'Él ha comido' (He has eaten)
  This is a very common verb form in Romance and Scandinavian languages

- 'eh' - Italian interjection and Canadian English discourse marker
  While less critical, this can appear in legitimate Italian speech

The remaining filler words are primarily English
vocalized hesitations that don't conflict with common words in other
European languages.

Updated tests to use 'uhm' instead of 'um' where needed.

Fixes cjpais#941
@cjpais
Copy link
Owner

cjpais commented Mar 3, 2026

I don't think this is a good solution. We probably should at minimum ship this as a setting and allow people to change the list as this will significantly impact English users

@AlexanderYastrebov
Copy link
Author

BTW um is also used in German, see https://de.wikipedia.org/wiki/Um-zu-Satz

IMO crude removal of words without language context is wrong.

@cjpais
Copy link
Owner

cjpais commented Mar 3, 2026

I agree, but what I am saying is this should be a customizable list. Or to have language detection generally done and pass that along to a filter list to apply different ones for different languages. At least this should be possible with Whisper

I don't think it's a fix to create a regression for other users. We can only improve both, not one step forward one step back.

@AlexanderYastrebov
Copy link
Author

It does not look like #589 provided any statistics why this specific set of words was removed. I think the problem magnitude also depends on the model used and might not be an issue at all on the powerful models like parakeet v3 (note that I am not arguing to rip off stutter clean up).
Maybe this can be accepted optimistically and reverted in case english speakers report a regression.

Speaking of new features, please consider adding #930 - its a very powerful feature that likely requires less code than customizable cleanup list which is trivially implemented like:

#!/usr/bin/env python3
import sys
import re

text = sys.stdin.read()

FILLER_WORDS = (
    "uh", "um", "uhm", "umm", "uhh", "uhhh", "ah", "eh", "hmm", "hm", "mmm", "mm", "mh", "ha",
    "ehh",
)

# Remove filler words (case-insensitive)
for word in FILLER_WORDS:
    # Match filler word with word boundaries, optionally followed by comma or period
    pattern = re.compile(r'\b' + re.escape(word) + r'\b[,.]?', re.IGNORECASE)
    text = pattern.sub('', text)

# Clean up multiple spaces to single space
text = re.sub(r'\s{2,}', ' ', text)

sys.stdout.write(text)

@cjpais
Copy link
Owner

cjpais commented Mar 3, 2026

@AlexanderYastrebov new features are not really being accepted at the moment.

What you are suggesting is a power user feature. Not something for the average person. Handy targets the most average grandma as the primary user.

As an English speaker I know there will be regressions based on the usage of Parakeet I have and it's not going to be pulled in. I've given options for you for the way forward for this PR. Regressions are unacceptable. We can discuss ways forward but we cannot discuss what is effectively a hack. We must think about the proper way to fix. I will not entertain hacks or regressions.

@AlexanderYastrebov
Copy link
Author

AlexanderYastrebov commented Mar 3, 2026

What you are suggesting is a power user feature. Not something for the average person. Handy targets the most average grandma as the primary user.

I'd argue its not the case anymore - with everyone using ChatGPT it is not a high bar to compose one-off script to transform stdin to stdout. Ship a stub hook like

#!/usr/bin/env python3
"""
This script reads text from standard input (stdin), allows for 
text transformations, and writes the result to standard output (stdout).

Ask ChatGPT or Claude to add code here to change transcription like you wish.
"""

import sys

text = sys.stdin.read()

sys.stdout.write(text)

and you'll be surprised how creative people are.

@cjpais
Copy link
Owner

cjpais commented Mar 3, 2026

@AlexanderYastrebov grandmas are not using ChatGPT for an application like this. People are creative but you're still talking about 1%. I have a lot of normal people as friends that would never do this. This is not up for discussion. Your feature will be considered, but like everyone else's. New features are not accepted at this time. And if you really want a feature you will listen to my feedback. Arguing with me and the little time I have to maintain this application is not a good strategy. If you really want this in, help me work on a Beta Channel so we can push experimental features out to people without creating a massive set of issues in mainline.

It is also a departure from the discussion on this PR, which I've given potential ways forward

@rguerreiro
Copy link

I'm ok with the filler words being a customizable list that the user can change as he\she wishes. Either way it really needs to be implemented because, as Handy stands right now, it's unusable to me in portuguese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Automatic filler word removal includes a legitimate word in portuguese

3 participants