Skip to content

RobViren/kvoicewalk

Repository files navigation

KVoiceWalk

KVoiceWalk tries to create new Kokoro voice style tensors that clones target voices by using a random walk algorithm and a hybrid scoring method that combines Resemblyzer similarity, feature extraction, and self similarity. This is meant to be a step towards a more advanced genetic algorithm and prove out the scoring function and general concept.

This project is only possible because of the incredible work of projects like Kokoro and Resemblyzer. I was struck by how small the Kokoro style tensors were and wondered if it would be possible to "evolve" new voice tensors more similar to target audio. The results are promising and this scoring method could be a valid option for a future genetic algorithm. I wanted more voice options for Kokoro, and now I have them.

Prerequisites

  • Python 3.10 to 3.12
  • uv installed (pip install uv)
  • For GUI mode: Python Tk support (tkinter)
  • Optional but recommended: ffmpeg for manual audio preprocessing

GPU/CUDA Prerequisites

  • NVIDIA GPU with recent NVIDIA driver (project tested on RTX 3060)
  • Driver must support modern CUDA runtime (CUDA 12.x class drivers work)
  • CUDA Toolkit is not required separately for this project; CUDA wheels are installed via Python dependencies

After uv sync, verify CUDA is available:

uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)"

Example Audio

Target Audio File (Generated Using A different text to speech library)

target.mp4

The Most Similar Stock Trained Voice From Kokoro, af_heart.pt. Similarity score of 71%

baseline.mp4

KVoiceWalk Generated Voice Tensor After 10,000 steps. Similarity score of 93% (From Resemblyzer)

generated.mp4

Installation

  1. Clone this repository, change directory into it, setup environment, install dependencies
git clone https://github.com/RobViren/kvoicewalk.git
cd kvoicewalk
uv venv --python 3.10
source .venv/bin/activate # '.venv\Scripts\activate' if Windows
uv sync

Quick Start (Default Mode: GUI)

Launch the GUI first. This is the recommended/default workflow.

uv run gui.py

Use the GUI to queue multiple tasks, then run them sequentially with live logs.

CLI Usage (Optional)

If you prefer terminal-only runs, use main.py directly.

KVoiceWalk expects target audio files to be Mono 24000 Hz WAV (around 20-30s single speaker). If needed, convert first:

ffmpeg -i input_file.wav -ar 24000 target.wav

Example random walk run:

uv run main.py --target_text "The old lighthouse keeper never imagined that one day he'd be guiding ships from the comfort of his living room, but with modern technology and an array of cameras, he did just that, sipping tea while the storm raged outside and gulls shrieked overhead." --target_audio ./example/target.wav --device cuda --log_interval 100

Example test voice run:

uv run main.py --test_voice /path/to/voice.pt --target_text "Your really awesome text you want spoken"

This will generate an audio file called out.wav using the supplied *.pt file.

Interpolated Start

KVoiceWalk has a function to interpolate around the trained voices and determine the best possible starting population of tensors to act as a guide for the random walk function to clone the target voice. Simply run the application as follows to run interpolation first. This does take awhile and having a beefy GPU will help with processing time.

uv run main.py --target_text "The works the speaker says in the audio clip" --target_audio /path/to/target.wav --interpolate_start

This will run an interpolation search for the best voices and put them in a folder labeled interpolated which you can use as the basis for a new random walk later. It will also continue a random walk afterwards.

Example Outputs

The closest voice in the trained models for the example/target.wav was af_heart.pt with the following stats.

af_heart.pt          Target Sim: 0.709, Self Sim: 0.978, Feature Sim: 0.47, Score: 81.22

Interpolation search gave a voice that had the following stats.

af_jessica.pt_if_sara.pt_0.10.pt Target Sim: 0.780, Self Sim: 0.973, Feature Sim: 0.34, Score: 84.20

The interpolation showed a big improvement. The population of interpolated voices is then used as the basis for standard deviation mutation of a supplied voice tensor. After 10,000 steps of random walking and replacing with the best, we get this.

Step:9371, Target Sim:0.917, Self Sim:0.971, Feature Sim:0.54, Score:92.99, Diversity:0.01

An improvement of 13.7% in similarity while still maintaining model stability and voice quality.

Design

By far the hardest thing to get right was the scoring function. Earlier attempts using Resemblyzer only resulted in overfitted garbage. Self similarity was important in keeping the model producing the same sounding input despite different inputs. Self similarity represented stability in the model and was critical in evaluation.

But even with self similarity and similarity presented by Resemblyzer it was not enough. I had to add an audio feature similarity comparison in order to prevent audio quality getting poor. What happened without this is the audio would pass similarity and self similarity checks but again sound like a metal basket of tools being thrown down stairs. The feature comparison made the difference and prevented over fitting to a random sound that apparently sounded similar to the target wav file.

The other secret sauce was the harmonic mean calculation that controls the scoring. The harmonic mean allows for some backsliding on self similarity, feature similarity, and target similarity so long as the improvement goes the right way. This made exploring the space easier for the system instead of requiring that all three only improve, which led to quick and sad stagnation. I lowered the weighting on the feature similarity. I mainly need that to prevent the voice from going completely out of bounds.

Notes

This does not run in parallel, but does adopt early returning on bad tensors. You can run multiple instances assuming you have the GPU/CPU for it. I can run about 2 in parallel on my 3070 laptop. The results are random. You can have some that led to incredible sounding results after stagnating for a long time, and others can crash and burn right away. Totally random. This is where a future genetic algorithm would be better. But the random walk proves out the theory.

Other things you could do:

  • Populate a database with results from this and train a model to predict similarity and see if you can use that to more tightly guide voice creation
  • Use different methods for voice generation than my simple method, though PCA had some challenges
  • Implement your own genetic algorithm and evolve voice tensors instead of random walk

KVoiceWalk Features

Transcribe Start, --transcribe_start

KvoiceWalk con use Faster-Whisper to quickly convert your audio clip to text and update your --target_text. A copy of the transcription is also saved as a txt file in the ./texts folder. Txt files can be used as the --target-text argument with a relative path /path/to/your/transcribed.txt. This can be combined with --interpolate_start also.

uv run main.py --target_text "This text will be replaced!" --target_audio /path/to/target.wav --transcribe_start

uv run main.py --target_text /path/to/your/transcribed.txt --target_audio /path/to/target.wav

Transcribe Many, --transcribe_many

KVoiceWalk can be used for file prep prior to your runs. With --transcribe_many, single file wav or folders containing wav files may be transcribed and their transcriptions saved as individual txt files in the ./texts folder.

uv run main.py --target_audio /path/to/target.wav --transcribe_many

uv run main.py --target_audio /path/to/audio/Folder/ --transcribe_many

Export Voices, --voices_folder and --export_bin

Voices with a folder can be exported by passing --voices_folder and --export_bin in the command line. All .pt voices in the --voices_folder /path/to/your/voices/ argument will be packaged together as 'voices.bin' and saved in the same folder.

uv run main.py --voices_folder ./voices --export_bin

All KVoiceWalk Arguments

## General Arguments
"--target_text", type=str, help="The words contained in the target audio file.
    Should be around 100-200 tokens (two sentences). Alternatively, can point to a txt file of the transcription."

"--other_text", type=str, help="A segment of text used to compare self similarity. Should be around 100-200 tokens." 
    default="If you mix vinegar, baking soda, and a bit of dish soap in a tall cylinder, the resulting eruption is both
    a visual and tactile delight, often used in classrooms to simulate volcanic activity on a miniature scale."

"--voice_folder", type=str, help="Path to the voices you want to use as part of the random walk.", default="./voices"

"--transcribe_start", help="Input: filepath to wav file Output: Transcription .txt in ./texts Transcribes a target wav or wav folder and replaces --target_text''

"--interpolate_start", help="Goes through an interpolation search step before random walking", action='store_true'

"--population_limit", type=int, help="Limits the amount of voices used as part of the random walk", default=10

"--step_limit", type=int, help="Limits the amount of steps in the random walk", default=10000)

"--output_name", type=str, help="Filename for the generated output audio", default="out.wav")

## Arguments for Random Walk mode
"--target_audio", type=str, help="Path to the target audio file. Must be 24000 Hz mono wav file."

"--starting_voice", type=str, help="Path to the starting voice tensor"

## Arguments for Test mode
"--test_voice", type=str, help="Path to the voice tensor you want to test"

## Arguments for Util mode
"--export_bin", help='Exports target voices in the --voice_folder directory', action='store_true'

"--transcribe_many", help='Input: filepath to wav file or folder\nOutput: Individualized transcriptions in ./texts folder\nTranscribes a target wav or wav folder. Replaces --target_text'

## GPU / CUDA
KVoiceWalk now supports explicit device selection across Kokoro synthesis, Resemblyzer scoring, and Faster-Whisper transcription.

Use:
```bash
uv run main.py --target_text "..." --target_audio /path/to/target.wav --device cuda

Valid options are --device auto (default), --device cpu, and --device cuda. If CUDA is requested but unavailable, the app falls back to CPU.

Recent Updates (April 2026)

  • Added a desktop GUI queue runner: uv run gui.py
  • Added queue controls for sequential task execution with live logs
  • Added --device {auto,cpu,cuda} support across CLI and GUI
  • Added --log_interval (default 100) to reduce log volume in non-interactive runs

Logging Control

For long runs launched from the GUI or any non-interactive process, progress logs are now emitted at intervals instead of every tick.

Example:

uv run main.py --target_text "..." --target_audio ./example/target.wav --device cuda --step_limit 10000 --log_interval 100

CUDA Performance Note

Observed on this project setup:

  • CUDA mode VRAM usage: about 4 GB
  • CPU run: about 26 hours
  • RTX 3060 CUDA run: about 4 hours
  • Speedup: about 6.5x

About

A random walk voice style cloning application for Kokoro text to speech

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages