Remove filler words that conflict with European languages by AlexanderYastrebov · Pull Request #943 · cjpais/Handy

AlexanderYastrebov · 2026-03-02T21:27:13Z

Before Submitting This PR

Please confirm you have done the following:

I have searched existing issues and pull requests (including closed ones) to ensure this isn't a duplicate
I have read CONTRIBUTING.md

If this is a feature or change that was previously closed/rejected:

I have explained in the description below why this should be reconsidered
I have gathered community feedback (link to discussion below)

Human Written Description

Removed three filler words from the transcription filter that are actual words in European languages:

'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine) Example: Portuguese 'Isto é um teste' (This is a test) Removing this was breaking Portuguese transcriptions
'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have' Example: Spanish 'Él ha comido' (He has eaten) This is a very common verb form in Romance and Scandinavian languages
'eh' - Italian interjection and Canadian English discourse marker While less critical, this can appear in legitimate Italian speech

The remaining filler words are primarily English
vocalized hesitations that don't conflict with common words in other European languages.

Updated tests to use 'uhm' instead of 'um' where needed.

Related Issues/Discussions

Fixes #941
Discussion: #940

Community Feedback

Testing

Tested using google translate with parakeet model as described in #940 for um and ha.

Screenshots/Videos (if applicable)

AI Assistance

No AI was used in this PR
AI was used (please describe below)

If AI was used:

Tools used: Copilot+Claude Sonnet 4.5
How extensively: asked about filler words:

Please summarize which of these filler words exist in European languages and therefore are not safe to remove.

Removed three filler words from the transcription filter that are actual words in European languages: - 'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine) Example: Portuguese 'Isto é um teste' (This is a test) Removing this was breaking Portuguese transcriptions - 'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have' Example: Spanish 'Él ha comido' (He has eaten) This is a very common verb form in Romance and Scandinavian languages - 'eh' - Italian interjection and Canadian English discourse marker While less critical, this can appear in legitimate Italian speech The remaining filler words are primarily English vocalized hesitations that don't conflict with common words in other European languages. Updated tests to use 'uhm' instead of 'um' where needed. Fixes cjpais#941

cjpais · 2026-03-03T03:48:00Z

I don't think this is a good solution. We probably should at minimum ship this as a setting and allow people to change the list as this will significantly impact English users

AlexanderYastrebov · 2026-03-03T07:41:15Z

BTW um is also used in German, see https://de.wikipedia.org/wiki/Um-zu-Satz

IMO crude removal of words without language context is wrong.

cjpais · 2026-03-03T08:46:59Z

I agree, but what I am saying is this should be a customizable list. Or to have language detection generally done and pass that along to a filter list to apply different ones for different languages. At least this should be possible with Whisper

I don't think it's a fix to create a regression for other users. We can only improve both, not one step forward one step back.

AlexanderYastrebov · 2026-03-03T09:21:28Z

It does not look like #589 provided any statistics why this specific set of words was removed. I think the problem magnitude also depends on the model used and might not be an issue at all on the powerful models like parakeet v3 (note that I am not arguing to rip off stutter clean up).
Maybe this can be accepted optimistically and reverted in case english speakers report a regression.

Speaking of new features, please consider adding #930 - its a very powerful feature that likely requires less code than customizable cleanup list which is trivially implemented like:

#!/usr/bin/env python3
import sys
import re

text = sys.stdin.read()

FILLER_WORDS = (
    "uh", "um", "uhm", "umm", "uhh", "uhhh", "ah", "eh", "hmm", "hm", "mmm", "mm", "mh", "ha",
    "ehh",
)

# Remove filler words (case-insensitive)
for word in FILLER_WORDS:
    # Match filler word with word boundaries, optionally followed by comma or period
    pattern = re.compile(r'\b' + re.escape(word) + r'\b[,.]?', re.IGNORECASE)
    text = pattern.sub('', text)

# Clean up multiple spaces to single space
text = re.sub(r'\s{2,}', ' ', text)

sys.stdout.write(text)

cjpais · 2026-03-03T09:45:32Z

@AlexanderYastrebov new features are not really being accepted at the moment.

What you are suggesting is a power user feature. Not something for the average person. Handy targets the most average grandma as the primary user.

As an English speaker I know there will be regressions based on the usage of Parakeet I have and it's not going to be pulled in. I've given options for you for the way forward for this PR. Regressions are unacceptable. We can discuss ways forward but we cannot discuss what is effectively a hack. We must think about the proper way to fix. I will not entertain hacks or regressions.

AlexanderYastrebov · 2026-03-03T09:59:30Z

What you are suggesting is a power user feature. Not something for the average person. Handy targets the most average grandma as the primary user.

I'd argue its not the case anymore - with everyone using ChatGPT it is not a high bar to compose one-off script to transform stdin to stdout. Ship a stub hook like

#!/usr/bin/env python3
"""
This script reads text from standard input (stdin), allows for 
text transformations, and writes the result to standard output (stdout).

Ask ChatGPT or Claude to add code here to change transcription like you wish.
"""

import sys

text = sys.stdin.read()

sys.stdout.write(text)

and you'll be surprised how creative people are.

cjpais · 2026-03-03T10:24:23Z

@AlexanderYastrebov grandmas are not using ChatGPT for an application like this. People are creative but you're still talking about 1%. I have a lot of normal people as friends that would never do this. This is not up for discussion. Your feature will be considered, but like everyone else's. New features are not accepted at this time. And if you really want a feature you will listen to my feedback. Arguing with me and the little time I have to maintain this application is not a good strategy. If you really want this in, help me work on a Beta Channel so we can push experimental features out to people without creating a massive set of issues in mainline.

It is also a departure from the discussion on this PR, which I've given potential ways forward

rguerreiro · 2026-03-04T07:41:49Z

I'm ok with the filler words being a customizable list that the user can change as he\she wishes. Either way it really needs to be implemented because, as Handy stands right now, it's unusable to me in portuguese.

AlexanderYastrebov mentioned this pull request Mar 2, 2026

[BUG] Automatic filler word removal includes a legitimate word in portuguese #941

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove filler words that conflict with European languages#943

Remove filler words that conflict with European languages#943
AlexanderYastrebov wants to merge 1 commit intocjpais:mainfrom
AlexanderYastrebov:remove-some-filler-words

AlexanderYastrebov commented Mar 2, 2026

Uh oh!

cjpais commented Mar 3, 2026

Uh oh!

AlexanderYastrebov commented Mar 3, 2026

Uh oh!

cjpais commented Mar 3, 2026 •

edited

Loading

Uh oh!

AlexanderYastrebov commented Mar 3, 2026

Uh oh!

cjpais commented Mar 3, 2026 •

edited

Loading

Uh oh!

AlexanderYastrebov commented Mar 3, 2026 •

edited

Loading

Uh oh!

cjpais commented Mar 3, 2026 •

edited

Loading

Uh oh!

rguerreiro commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

AlexanderYastrebov commented Mar 2, 2026

Before Submitting This PR

Human Written Description

Related Issues/Discussions

Community Feedback

Testing

Screenshots/Videos (if applicable)

AI Assistance

Uh oh!

cjpais commented Mar 3, 2026

Uh oh!

AlexanderYastrebov commented Mar 3, 2026

Uh oh!

cjpais commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexanderYastrebov commented Mar 3, 2026

Uh oh!

cjpais commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexanderYastrebov commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjpais commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rguerreiro commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjpais commented Mar 3, 2026 •

edited

Loading

cjpais commented Mar 3, 2026 •

edited

Loading

AlexanderYastrebov commented Mar 3, 2026 •

edited

Loading

cjpais commented Mar 3, 2026 •

edited

Loading