Skip to content

darklite80/palmir-voice-cloning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XTTS v2 Voice Cloning for Palmir Device

Local voice cloning using Coqui TTS XTTS v2 on your Raspberry Pi CM5.

Features

  • 🎙️ Zero-shot voice cloning - Clone any voice from 6-10 seconds of audio
  • 🌍 Multilingual - Supports 17+ languages
  • 🔊 Integrated with Palmir hardware - Uses device microphone and speakers
  • 💻 Fully local - No internet required after setup
  • 🚀 Easy to use - Web GUI + CLI interface
  • 🌐 Web Interface - Beautiful drag & drop web UI for easy voice cloning

Installation

The environment is already set up! Just activate it:

cd "/home/distiller/projects/voice cloning"
source venv/bin/activate

Quick Start

Option 1: Web GUI (Recommended) 🌐

Launch the beautiful web interface:

./start_web.sh

Then open in your browser:

Features:

  • Drag & drop file upload
  • Record directly from microphone
  • Real-time voice cloning
  • Audio playback & download
  • File management

See WEB_GUI_README.md for details.

Option 2: Command Line Interface

For CLI usage, see below or check QUICKSTART.md.

CLI Usage

Quick Start (Full Pipeline)

Record your voice and generate cloned speech in one command:

python voice_clone.py --mode full --text "Hello, this is my cloned voice!"

This will:

  1. Record 10 seconds of your voice as reference
  2. Clone your voice and generate the specified text
  3. Play back the result

Step-by-Step Usage

1. Record Reference Audio

python voice_clone.py --mode record --duration 10 --reference my_voice.wav

Speak naturally for the duration. The more expressive, the better!

2. Clone Voice and Generate Speech

python voice_clone.py --mode clone \
  --reference my_voice.wav \
  --text "This is the text I want to speak in the cloned voice" \
  --output output.wav

Advanced Options

python voice_clone.py \
  --mode clone \
  --reference voice_reference.wav \
  --text "Your text here" \
  --output generated.wav \
  --language en \
  --no-play  # Don't auto-play the result

Supported Languages

  • English: en
  • Spanish: es
  • French: fr
  • German: de
  • Italian: it
  • Portuguese: pt
  • Polish: pl
  • Turkish: tr
  • Russian: ru
  • Dutch: nl
  • Czech: cs
  • Arabic: ar
  • Chinese: zh-cn
  • Japanese: ja
  • Hungarian: hu
  • Korean: ko
  • Hindi: hi

Performance

On Raspberry Pi CM5 (ARM64, 4-core):

  • First run: ~2-3 minutes (downloads models)
  • Voice cloning: ~10-30 seconds per sentence
  • RAM usage: ~3-4GB during inference
  • Quality: Near-human naturalness

Tips for Best Results

  1. Reference audio quality:

    • Use 6-10 seconds of clear speech
    • Avoid background noise
    • Include emotional variety if possible
  2. Text generation:

    • Shorter sentences generate faster
    • Natural punctuation improves prosody
    • XTTS handles complex text well
  3. Performance:

    • First generation takes longer (model loading)
    • Subsequent generations are faster
    • CPU mode is slower but works well

Troubleshooting

Model download fails

First run downloads ~2GB of models. Ensure good internet connection:

python -c "from TTS.api import TTS; TTS('tts_models/multilingual/multi-dataset/xtts_v2')"

Audio hardware issues

Test Palmir audio separately:

source /opt/distiller-cm5-sdk/activate.sh
python -c "from distiller_cm5_sdk.hardware.audio import Audio; a = Audio(); print('Audio OK')"

Out of memory

Close other applications or reduce text length.

Examples

Clone your voice

# Record yourself
python voice_clone.py --mode record --duration 10

# Generate speech
python voice_clone.py --mode clone \
  --text "I can now speak any text in my own voice!"

Use existing audio file

# Clone from any WAV file
python voice_clone.py --mode clone \
  --reference /path/to/audio.wav \
  --text "The quick brown fox jumps over the lazy dog"

Multiple languages

# Generate in Spanish
python voice_clone.py --mode clone \
  --reference spanish_speaker.wav \
  --text "Hola, ¿cómo estás?" \
  --language es

Architecture

  • TTS Engine: Coqui TTS XTTS v2
  • Backend: PyTorch (CPU mode)
  • Audio I/O: Palmir SDK (ALSA)
  • Sample Rate: 22050 Hz
  • Format: WAV (16-bit PCM)

License

This tool uses Coqui TTS, which is open source. Check their license for commercial use.

About

Local voice cloning using XTTS v2 on Raspberry Pi CM5 Pamir device with zero-shot voice cloning capabilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors