Open-source pipeline for dubbing English video into Bangla β translation, speech synthesis, voice conversion, and prosody transfer, assembled into a single CLI.
Input video (.mkv / .mp4)
β
ββ[ffmpeg]βββββββΊ english.wav + english.srt
β
ββ[Demucs]βββββββΊ vocals.wav + background.wav
β
ββ[pysrt]ββββββββΊ segments[] with per-segment audio slices
β
ββ[NLLB-200]βββββΊ segments[].bn (Bangla text)
β
ββ[MMS-TTS-ben]ββΊ segments[].bn_audio (Bangla speech)
β
ββ[kNN-VC]βββββββΊ voice-converted Bangla audio (optional)
β
ββ[pyworld]ββββββΊ F0 + energy transferred from EN speaker (optional)
β
ββ[rubberband]βββΊ duration-aligned, pitch-preserved audio
β
ββ[ffmpeg]βββββββΊ dubbed.mkv (voice + background mixed)
git clone https://github.com/Y3454R/OpenDub.git
cd OpenDub
python -m venv venv && source venv/bin/activate
pip install -r requirements.txtmacOS β rubberband backend (required for pitch-preserving time stretch):
brew install rubberband
pip install pyrubberbandFast test β skips voice cloning and prosody transfer:
python cli.py \
--input test_files/clip.mkv \
--srt test_files/english.srt \
--output outputs/dubbed/dubbed.mkv \
--no-voice-clone \
--no-prosodyFull pipeline (requires kNN-VC checkpoint):
python cli.py \
--input movie.mkv \
--output dubbed.mkv \
--vc-checkpoint /path/to/knnvc_checkpointClip a time window:
python cli.py \
--input movie.mkv \
--start 00:23:00 \
--end 00:27:00 \
--output outputs/dubbed/clip.mkv \
--no-voice-clone --no-prosodyExternal SRT (skip subtitle extraction):
python cli.py \
--input movie.mkv \
--srt subtitles.srt \
--output dubbed.mkv| Flag | Default | Description |
|---|---|---|
--input |
(required) | Input video file |
--output |
dubbed.mkv |
Output video file |
--srt |
None |
External SRT file; skips subtitle extraction from video |
--start |
None |
Clip start time, e.g. 00:23:00 |
--end |
None |
Clip end time, e.g. 00:27:00 |
--no-voice-clone |
off | Skip kNN-VC voice conversion |
--no-prosody |
off | Skip pyworld prosody transfer |
--vc-checkpoint |
None |
Path to kNN-VC model checkpoint |
--output-dir |
outputs |
Working directory for intermediate files |
| Module | Model | Purpose |
|---|---|---|
audio/extractor.py |
ffmpeg | Extract mono 16kHz WAV + SRT from video |
audio/separator.py |
Demucs htdemucs |
Split vocals from background music/effects |
subtitles/parser.py |
pysrt | Parse SRT, clean tags, slice per-segment audio |
translation/translator.py |
NLLB-200-distilled-600M | English β Bangla text translation |
tts/synthesizer.py |
MMS-TTS-ben | Bangla text β speech |
voice/cloning.py |
kNN-VC | Convert TTS voice to match source speaker |
voice/prosody.py |
pyworld (WORLD vocoder) | Transfer F0 contour and energy from EN speaker |
audio/stretcher.py |
rubberband / librosa TSM | Pitch-preserving duration alignment |
audio/mixer.py |
ffmpeg | Assemble segments, mix background, mux to video |
This pipeline is a baseline for studying isochrony-aware Bangla dubbing β matching dubbed speech duration to the original without degrading naturalness.
Key open problems:
- Voice cloning for Bangla β kNN-VC was trained on English; cross-lingual transfer to Bangla TTS output is untested
- Isochrony β Bangla text is typically longer than its English source; the isochrony report printed after each run quantifies the gap per segment
- Prosody transfer β F0 contour from the English speaker is transferred via WORLD vocoder; interaction with Bangla tone patterns is an open question
The isochrony report at the end of each run shows mean/std/max duration mismatch across segments β this is the metric the research aims to minimize.
outputs/
βββ segments/
β βββ english.wav
β βββ vocals.wav
β βββ background.wav
β βββ seg_0001_en.wav
β βββ seg_0001_bn.wav
β βββ ...
βββ dubbed/
βββ dubbed.mkv
- Python 3.10+
- ffmpeg (system install:
brew install ffmpeg/apt install ffmpeg) - rubberband (macOS:
brew install rubberband) - CUDA GPU recommended for full pipeline; M-series Apple Silicon works for development