Local voice cloning using Coqui TTS XTTS v2 on your Raspberry Pi CM5.
- 🎙️ Zero-shot voice cloning - Clone any voice from 6-10 seconds of audio
- 🌍 Multilingual - Supports 17+ languages
- 🔊 Integrated with Palmir hardware - Uses device microphone and speakers
- 💻 Fully local - No internet required after setup
- 🚀 Easy to use - Web GUI + CLI interface
- 🌐 Web Interface - Beautiful drag & drop web UI for easy voice cloning
The environment is already set up! Just activate it:
cd "/home/distiller/projects/voice cloning"
source venv/bin/activateLaunch the beautiful web interface:
./start_web.shThen open in your browser:
- Local: http://localhost:5001
- Network: http://<palmir-ip>:5001
Features:
- Drag & drop file upload
- Record directly from microphone
- Real-time voice cloning
- Audio playback & download
- File management
See WEB_GUI_README.md for details.
For CLI usage, see below or check QUICKSTART.md.
Record your voice and generate cloned speech in one command:
python voice_clone.py --mode full --text "Hello, this is my cloned voice!"This will:
- Record 10 seconds of your voice as reference
- Clone your voice and generate the specified text
- Play back the result
python voice_clone.py --mode record --duration 10 --reference my_voice.wavSpeak naturally for the duration. The more expressive, the better!
python voice_clone.py --mode clone \
--reference my_voice.wav \
--text "This is the text I want to speak in the cloned voice" \
--output output.wavpython voice_clone.py \
--mode clone \
--reference voice_reference.wav \
--text "Your text here" \
--output generated.wav \
--language en \
--no-play # Don't auto-play the result- English:
en - Spanish:
es - French:
fr - German:
de - Italian:
it - Portuguese:
pt - Polish:
pl - Turkish:
tr - Russian:
ru - Dutch:
nl - Czech:
cs - Arabic:
ar - Chinese:
zh-cn - Japanese:
ja - Hungarian:
hu - Korean:
ko - Hindi:
hi
On Raspberry Pi CM5 (ARM64, 4-core):
- First run: ~2-3 minutes (downloads models)
- Voice cloning: ~10-30 seconds per sentence
- RAM usage: ~3-4GB during inference
- Quality: Near-human naturalness
-
Reference audio quality:
- Use 6-10 seconds of clear speech
- Avoid background noise
- Include emotional variety if possible
-
Text generation:
- Shorter sentences generate faster
- Natural punctuation improves prosody
- XTTS handles complex text well
-
Performance:
- First generation takes longer (model loading)
- Subsequent generations are faster
- CPU mode is slower but works well
First run downloads ~2GB of models. Ensure good internet connection:
python -c "from TTS.api import TTS; TTS('tts_models/multilingual/multi-dataset/xtts_v2')"Test Palmir audio separately:
source /opt/distiller-cm5-sdk/activate.sh
python -c "from distiller_cm5_sdk.hardware.audio import Audio; a = Audio(); print('Audio OK')"Close other applications or reduce text length.
# Record yourself
python voice_clone.py --mode record --duration 10
# Generate speech
python voice_clone.py --mode clone \
--text "I can now speak any text in my own voice!"# Clone from any WAV file
python voice_clone.py --mode clone \
--reference /path/to/audio.wav \
--text "The quick brown fox jumps over the lazy dog"# Generate in Spanish
python voice_clone.py --mode clone \
--reference spanish_speaker.wav \
--text "Hola, ¿cómo estás?" \
--language es- TTS Engine: Coqui TTS XTTS v2
- Backend: PyTorch (CPU mode)
- Audio I/O: Palmir SDK (ALSA)
- Sample Rate: 22050 Hz
- Format: WAV (16-bit PCM)
This tool uses Coqui TTS, which is open source. Check their license for commercial use.