Remove filler words that conflict with European languages#943
Remove filler words that conflict with European languages#943AlexanderYastrebov wants to merge 1 commit intocjpais:mainfrom
Conversation
Removed three filler words from the transcription filter that are actual words in European languages: - 'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine) Example: Portuguese 'Isto é um teste' (This is a test) Removing this was breaking Portuguese transcriptions - 'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have' Example: Spanish 'Él ha comido' (He has eaten) This is a very common verb form in Romance and Scandinavian languages - 'eh' - Italian interjection and Canadian English discourse marker While less critical, this can appear in legitimate Italian speech The remaining filler words are primarily English vocalized hesitations that don't conflict with common words in other European languages. Updated tests to use 'uhm' instead of 'um' where needed. Fixes cjpais#941
|
I don't think this is a good solution. We probably should at minimum ship this as a setting and allow people to change the list as this will significantly impact English users |
|
BTW IMO crude removal of words without language context is wrong. |
|
I agree, but what I am saying is this should be a customizable list. Or to have language detection generally done and pass that along to a filter list to apply different ones for different languages. At least this should be possible with Whisper I don't think it's a fix to create a regression for other users. We can only improve both, not one step forward one step back. |
|
It does not look like #589 provided any statistics why this specific set of words was removed. I think the problem magnitude also depends on the model used and might not be an issue at all on the powerful models like parakeet v3 (note that I am not arguing to rip off stutter clean up). Speaking of new features, please consider adding #930 - its a very powerful feature that likely requires less code than customizable cleanup list which is trivially implemented like: #!/usr/bin/env python3
import sys
import re
text = sys.stdin.read()
FILLER_WORDS = (
"uh", "um", "uhm", "umm", "uhh", "uhhh", "ah", "eh", "hmm", "hm", "mmm", "mm", "mh", "ha",
"ehh",
)
# Remove filler words (case-insensitive)
for word in FILLER_WORDS:
# Match filler word with word boundaries, optionally followed by comma or period
pattern = re.compile(r'\b' + re.escape(word) + r'\b[,.]?', re.IGNORECASE)
text = pattern.sub('', text)
# Clean up multiple spaces to single space
text = re.sub(r'\s{2,}', ' ', text)
sys.stdout.write(text) |
|
@AlexanderYastrebov new features are not really being accepted at the moment. What you are suggesting is a power user feature. Not something for the average person. Handy targets the most average grandma as the primary user. As an English speaker I know there will be regressions based on the usage of Parakeet I have and it's not going to be pulled in. I've given options for you for the way forward for this PR. Regressions are unacceptable. We can discuss ways forward but we cannot discuss what is effectively a hack. We must think about the proper way to fix. I will not entertain hacks or regressions. |
I'd argue its not the case anymore - with everyone using ChatGPT it is not a high bar to compose one-off script to transform stdin to stdout. Ship a stub hook like #!/usr/bin/env python3
"""
This script reads text from standard input (stdin), allows for
text transformations, and writes the result to standard output (stdout).
Ask ChatGPT or Claude to add code here to change transcription like you wish.
"""
import sys
text = sys.stdin.read()
sys.stdout.write(text)and you'll be surprised how creative people are. |
|
@AlexanderYastrebov grandmas are not using ChatGPT for an application like this. People are creative but you're still talking about 1%. I have a lot of normal people as friends that would never do this. This is not up for discussion. Your feature will be considered, but like everyone else's. New features are not accepted at this time. And if you really want a feature you will listen to my feedback. Arguing with me and the little time I have to maintain this application is not a good strategy. If you really want this in, help me work on a Beta Channel so we can push experimental features out to people without creating a massive set of issues in mainline. It is also a departure from the discussion on this PR, which I've given potential ways forward |
|
I'm ok with the filler words being a customizable list that the user can change as he\she wishes. Either way it really needs to be implemented because, as Handy stands right now, it's unusable to me in portuguese. |
Before Submitting This PR
Please confirm you have done the following:
If this is a feature or change that was previously closed/rejected:
Human Written Description
Removed three filler words from the transcription filter that are actual words in European languages:
'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine) Example: Portuguese 'Isto é um teste' (This is a test) Removing this was breaking Portuguese transcriptions
'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have' Example: Spanish 'Él ha comido' (He has eaten) This is a very common verb form in Romance and Scandinavian languages
'eh' - Italian interjection and Canadian English discourse marker While less critical, this can appear in legitimate Italian speech
The remaining filler words are primarily English
vocalized hesitations that don't conflict with common words in other European languages.
Updated tests to use 'uhm' instead of 'um' where needed.
Related Issues/Discussions
Fixes #941
Discussion: #940
Community Feedback
Testing
Tested using google translate with parakeet model as described in #940 for
umandha.Screenshots/Videos (if applicable)
AI Assistance
If AI was used: