Skip to content

SubhojitGhimire/World-Wide-Wave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WorldWideWave - AI Voice Clone & Dubbing Pipeline

Python Version License

An end-to-end Python script that automatically transcribes audio, translates it to a different language, and then re-synthesizes the audio in the target language while preserving the original speaker's voice.


Features

  • Speech-to-Text: Utilizes OpenAI's Whisper for highly accurate transcription with word-level timestamps.
  • Translation: Translates the transcribed text into a wide variety of target languages using deep-translator.
  • Voice Cloning: Employs Coqui-AI's powerful XTTS-v2 model for zero-shot voice cloning. It analyzes a short sample of the original speaker's voice and uses it to generate new audio.
  • Text-to-Speech: Synthesizes the translated text in the target language using the cloned voice.
  • Duration Matching: Intelligently time-stretches the dubbed audio segments using Rubber Band to ensure they match the duration of the original speech, maintaining lip-sync compatibility.
  • End-to-End Pipeline: A single script integrates all components, providing a seamless workflow from input audio to final dubbed output.

Requirements

cupy-cuda12x==13.5.1
deep-translator==1.11.4
jupyter==1.1.1
meson==1.8.3
numpy==2.2.6
notebook==7.4.5
openai-whisper @ git+https://github.com/openai/whisper.git@c0d2f624c09dc18e709e37c2ad90c039a4eb72a2
pydub==0.25.1
pyrubberband==0.4.0
torch==2.7.1+cu128
torchaudio==2.7.1
torchvision==0.22.1+cu128
transformers==4.39.3
TTS==0.22.0

System-Level Dependencies


Tips for Best Results

  1. High-Quality Reference Audio: For the best voice cloning results, use a high-quality INPUT_AUDIO_PATH file that is 15-30 seconds long and contains only the target speaker's voice with no background noise or music.
  2. Tweak TTS Speed: You can adjust the speed parameter within the tts.tts_to_file() function call to make the synthesized voice speak slightly faster or slower, which can sometimes result in a more natural cadence.
  3. GPU Acceleration: For any significant amount of audio, running this script on a machine with an NVIDIA GPU will be dramatically faster.

*This README.md file has been improved for overall readability (grammar, sentence structure, and organization) using AI tools.

About

One script. One voice. Multiple languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors