Full GPU Rewrite, Performance Boost + misc by RobertAgee · Pull Request #7 · RobViren/kvoicewalk

RobertAgee · 2025-06-16T23:20:47Z

Full GPU Implementation and Optimization

Updated feature analysis to SOTA Speechbrain, TorchAudio, nn.audio methods
All feature analysis, tensor operations now on GPU, can run in parallel
Reused/simplified tensor operations whereever possible
Added minimum feature similarity (feat sim must be >=(best feat sim - 0.01)), before self-sim check to bypass expensive audio gen if feat sim regresses too much
~3x faster on RTX4070 8GB, 10-15% CPU utilization on Intel i9-14900HX
<0.75 GB allocation / 1.5 reserved
patched Kokoro's memory leak (needs routine cache clearing, capped overhead memory usage)
10,000 iterations:

Random Walk Final Results for gravelierjej
Duration: 56.80 minutes
Best Voice: out/gravelierjej_jejraven_20250616_161107/gravelierjej_9550_0.42_0.48_jejraven.pt
Best Score: 0.42
Best Similarity: 0.48
Random Walk pt and wav files ---> out/gravelierjej_jejraven_20250616_161107
0it [56:48, ?it/s, GPU Stats: 0.6771GB allocated, 1.4491GB reserved
Process Times: Audio1 gen: 0.286750s, Audio2 gen: 0.222268s, Target Sim: 0.020977s,  Self Sim: 0.018055s, Feat Sim: 0.049192s, Total: 0.597364s]

Settings Configuration, Debug, Memory, Process Times logging, misc

set true in utilities/kvw_config.json
loads automatically at program start
lots of stuff.... sweats lol

Stuff ToDo:

Most noted in code
Revisit scoring methodologies for any new optimizations possible (penalty, weighing, etc)
- Notably SpeechBrain cosine similarity is more accurate that Resemblyzer
- Idea: Use collection of target audios compared to themselves to get average similarity, use that for confirmation (~0.80)
Clean up docstrings
Move scorevoice() -> FitnessScorer
Add more feature-wise mutation strategy
Clean up variable naming (make it more legible)
Add convenient save/reloads
Offer disable checkpoint wav/pt saves (useful for early checkpoints, performance crippler)
Use smaller kokoro model
Diagnose where speech bottlenecks are, speed up
Clean up console printing
Consider if unifying speech/voice generators makes sense performance wise
Reduce signature objects in calls if possible
Add more GPU feature analysis
Merge some functionalities from my kokovoicelab fork

Benchmark <0.75GB VRAM usage

Similarity checker

RobertAgee · 2025-06-16T23:22:51Z

Hey @RobViren Not going to push this on the main just yet, wanted to get your eyes on it and hopefully you get some time to try it out. 3x FASTER totally on GPU, and tiny footprint

RobViren · 2025-06-16T23:57:28Z

Oh dang! You've been at work on this. I had not heard of speechbrain. Is it still avoiding over fitting and sounding like a demon? Kudos on the better GPU usage, super impressive. Gonna run tonight

RobertAgee · 2025-06-17T00:15:35Z

Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go.

Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close.

another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations.

# Conflicts: # utilities/fitness_scorer.py # utilities/initial_selector.py # utilities/kvoicewalk.py

RobViren · 2025-06-17T00:48:15Z

Yeah. I really think maintaining a map of the results would really help to guide other walks. Still believe genetic algo is the way to go. Only even remotely feasible because of the small size. TPU would be great. It is just monkeys on a keyboard trying to clone a voice.

…

On Mon, Jun 16, 2025, 7:15 PM Robert Agee ***@***.***> wrote: *RobertAgee* left a comment (RobViren/kvoicewalk#7) <#7 (comment)> Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go. Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close. another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations. — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEDLIDD7T64HWR2POUQX2DL3D5M3ZAVCNFSM6AAAAAB7OTOX3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZYGUZTEMBYGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RobertAgee · 2025-06-17T02:26:40Z

Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes).

A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction..

B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters.

C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak.

C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster.

======

by the way, I just restarted my wsl instance completely and on a fresh instance, it's clocking a 7-8x speed up (as opposed 3x I'd thought). The audio is a little shorter, but fingers crossed the actual performance gains are higher than expected. If you get benchmarks on your system, please share!

Step:577  Target Sim:0.255 Self Sim:0.673 Feature Sim:0.296 Score:0.34 Diversity:0.10                            | 577/10000 [01:10<17:52,  8.79it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved Sim: 0.010980s,  Self Sim: 0.018089s, Feat Sim: 0.046040s, Total: 0.357781s]
                                                                                                                                                                                                        
Step:580  Target Sim:0.261 Self Sim:0.699 Feature Sim:0.291 Score:0.34 Diversity:0.05                            | 580/10000 [01:11<21:41,  7.24it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved Sim: 0.011041s,  Self Sim: 0.014831s, Feat Sim: 0.046310s, Total: 0.345149s]
                                                                                                                                                                                                        
Step:1228 Target Sim:0.262 Self Sim:0.689 Feature Sim:0.294 Score:0.35 Diversity:0.03                           | 1228/10000 [02:19<15:05,  9.68it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved]]
0it [02:19, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [02:19, ?it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved Sim: 0.010740s,  Self Sim: 0.014637s, Feat Sim: 0.050137s, Total: 0.356808s]
                                                                                                                                                                                                        
Step:1738 Target Sim:0.258 Self Sim:0.696 Feature Sim:0.304 Score:0.35 Diversity:0.06                           | 1738/10000 [03:11<13:48,  9.97it/s, GPU Stats: 0.6759GB allocated, 1.2562GB reserved]]

Random Walk Final Results for my_new_voice
Duration: 17.74 minutes
Best Voice: out/my_new_voice_tpih-78_20250616_215525/my_new_voice_5913_0.40_0.32_tpih-78.pt
Best Score: 0.40
Best Similarity: 0.32
Random Walk pt and wav files ---> out/my_new_voice_tpih-78_20250616_215525
0it [17:44, ?it/s, GPU Stats: 0.6800GB allocated, 1.2562GB reserved
Process Times: Audio1 gen: 0.090282s, Audio2 gen: 0.195020s, Target Sim: 0.010740s,  Self Sim: 0.016085s, Feat Sim: 0.052650s, Total: 0.364867s]

RobertAgee · 2025-06-17T02:49:46Z

Oh, and I should add too, taking the worst performers vs target_audio during the top_performers method, and use them to push the starting voice in the right direction strongly. So even if there's no great matching voice (e.g. no deep masculine voices in KokoroTTS) you can still use really "off" voices to your advantage.

Gravel - https://voca.ro/17dMqrKJIXLR
The Narrator - https://vocaroo.com/1lS1gUoIZYRu
Narrator Lite - https://vocaroo.com/1ezeDU6Nzw9R
King Arthur - https://voca.ro/13JsCly5B1oX

Here's a map to see where they live in Kokoro latent space (right hand side):

hidoba · 2025-07-04T16:16:40Z

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

RobertAgee · 2025-07-04T16:58:15Z

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

Depends on what you're trying to do (ie voice cloning vs voice crafting), and many different hammers can functionally do the same task.

tk-1001 · 2025-12-27T01:04:56Z

Every score for comparison between starting models within this main branch seems to be the same.
af_alloy.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_aoede.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_bella.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_heart.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_jessica.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_kore.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_nicole.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_nova.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_river.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_sarah.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03

b4silio · 2026-06-05T15:09:46Z

Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes).

A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction..

B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters.

C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak.

C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster.

I know it's been a while for you but I've just discovered this and, congrats, this is really a beautiful and super-useful piece of software!

I ran the process on 20+ target samples the past 2 days, getting some hits and some misses, with self sim results between 0.80 and 0.90 with some lucky ones above that. I then started playing around with gradient-less methods instead of Random Walk. I ended up doing a couple of tests using BOBYQA (mostly because I'd been using it in the past, but it turns out it isn't a great choice for higher-dim noisy exploration spaces like what we have here). But I eventually landed on the CMA-ES optimizer, which has been able to land me in the 0.93-0.97 Self Sim range within 3000 steps for almost all of my 26 tests of very different voice targets, and this being much less dependent on the initial starting sample, as multiple samples tend to converge to pretty similar results. There were still some hard cases that weren't able to push much past 0.8, but clearly the results are good. Just mentioning this because your idea of an adaptive random walk is good but there are already strategies that you can plug and play that can do a lot of heavy lifting.

One bit of research that was interesting and allowed me to get consistently past the 0.90 threshold:
The target_feature_penalty uses a relative error:
penalty += abs((value - target_features[key]) / target_features[key])
When a target feature is near zero, 1/|target| explodes and a single feature
dominates the entire penalty, effectively capping the score it can reach. (There's a boilerplate explanation from the AI agent about why that is if you're interested). I ended up dropping the target_feature_penalty and obtained results that are (numerically and perceptively) much better.

Sadly, the code I've written is mostly claude-coded and my coding capabilities in python are far lower than my knowledge in ML, so I don't know to what extent you might want to integrate any of my code in your repo, but even if it never lands in the code itself (and for anyone who wants to instruct their AI agent to test something out), just know that using the CMA-ES optimizer instead of random walk is much faster at converging and converges much better than random walk on basically any test I ran.

Hope this is of some use in any case!

A direct comparison on the same samples :

Voice	Random Walk	CMA	Δ
Julie Andrews	0.93	0.98	+0.05
James Earl Jones	0.85	0.97	+0.12
Emma Thompson	0.85	0.97	+0.12
Morgan Freeman	0.88	0.97	+0.09
David Attenborough	0.89	0.97	+0.08
Barack Obama	0.95	0.97	+0.02
Carl Sagan	0.84	0.87	+0.03
Carl Sagan (better audio)	0.93	0.96	+0.03
Tom Hanks	0.90	0.98	+0.08

And the runs I'd done for all the voices (mostly to see the difference of seeds on results):

Person	Seed 1	tsim	Seed 2	tsim	Best	Δ(1−2)
alain-de-botton	bf_lily	0.940	pm_alex	0.950	0.950	−0.010
alan-rickman	bm_lewis	0.800	bf_lily	0.830	0.830	−0.030
alicia-vikander	af_aoede	0.930	af_bella	0.890	0.930	+0.040
audrey-hepburn	hf_alpha	0.960	hf_beta	0.970	0.970	−0.010
barack-obama	am_michael	0.970	am_onyx	0.960	0.970	+0.010
carl-sagan	em_santa	0.870	pm_santa	0.860	0.870	+0.010
carl-sagan (better audio)	am_onyx	0.950	am_santa	0.960	0.960	-0.010
cate-blanchett	ef_dora	0.900	pf_dora	0.890	0.900	+0.010
david-attenborough	bm_george	0.970	bm_daniel	0.950	0.970	+0.020
emma-thompson	af_nicole	0.920	bf_alice	0.970	0.970	−0.050
helen-mirren	af_kore	0.960	bf_lily	0.970	0.970	−0.010
ian-mckellen	em_santa	0.970	pm_santa	0.970	0.970	0.000
james-earl-jones	em_santa	0.970	pm_santa	0.970	0.970	0.000
jeff-hays	jm_kumo	0.960	am_puck	0.970	0.970	−0.010
jeremy-irons	im_nicola	0.960	bm_lewis	0.940	0.960	+0.020
jim-dale	im_nicola	0.960	pm_alex	0.970	0.970	−0.010
judi-dench	af_nicole	0.960	af_sky	0.970	0.970	−0.010
julie-andrews	jf_gongitsune	0.980	af_nicole	0.950	0.980	+0.030
maggie-smith	bm_daniel	0.900	zm_yunxia	0.870	0.900	+0.030
meryl-streep	af_alloy	0.860	pf_dora	0.870	0.870	−0.010
morgan2	im_nicola	0.970	hm_psi	0.960	0.970	+0.010
patrick-stewart	im_nicola	0.960	em_alex	0.970*	0.970	−0.010
sam-neill	em_santa	0.970	am_santa	0.970	0.970	0.000
stephen-fry	pm_santa	0.960	am_santa	0.960	0.960	0.000
tom-hanks	hm_psi	0.960	am_adam	0.980	0.980	-0.020
vanessa-redgrave	zm_yunxi	0.920	bm_fable	0.890	0.920	+0.030
walter-cronkite	af_sarah	0.840	bm_lewis	0.880	0.880	-0.040
--	--	--	--	--	--	--
average	--	0.936	--	0.935	0.944	0.001

Starting seeds were chosen by running similarity measure on the sample vs default sample generation, and picking the top 2

RobertAgee and others added 17 commits June 13, 2025 01:35

messy, started adding speechbrain implementation

96507c6

Create generated.mp3

aa3c0b6

upload readme audio files

7ccf4ea

Update and rename generated.mp3 to .gitkeep

a4912e8

add speechbrain to fitness_scorer for greater accuracy

ad8c9bb

GPU optimization in process

2fbc497

GPU optimization, tensor cleanup, memory use tracking

f2eede1

Further GPU optimization, memory use tracking

52dc9b6

Benchmark <0.75GB VRAM usage

Full GPU optimization, memory and timing logs, config settings

81f1ea2

messy, started adding speechbrain implementation

4675774

add speechbrain to fitness_scorer for greater accuracy

63910d3

GPU optimization in process

c7e6b6d

GPU optimization, tensor cleanup, memory use tracking

d9fcc8e

Further GPU optimization, memory use tracking

67f5f98

Benchmark <0.75GB VRAM usage

Full GPU optimization, memory and timing logs, config settings

4e67736

Merge branch 'RobViren:main' into similarity-checker

60aee79

Merge pull request #1 from RobertAgee/similarity-checker

b834cf1

Similarity checker

RobertAgee added 2 commits June 16, 2025 20:21

small lints

4535210

Merge remote-tracking branch 'origin/main'

5de2b40

# Conflicts: # utilities/fitness_scorer.py # utilities/initial_selector.py # utilities/kvoicewalk.py

RobertAgee added 3 commits June 16, 2025 21:48

Add packages to toml, update config settings

e607c0e

correct spelling in pyproject.toml

66b2688

correct spelling in pyproject.toml

cbc75f1

RobertAgee added 2 commits June 17, 2025 00:47

small bug fix for interpolate_start

f3cbf03

improve pt saving by moving to cpu first

30c1884

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full GPU Rewrite, Performance Boost + misc#7

Full GPU Rewrite, Performance Boost + misc#7
RobertAgee wants to merge 24 commits into
RobViren:devfrom
RobertAgee:main

RobertAgee commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 16, 2025

Uh oh!

RobViren commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobViren commented Jun 17, 2025 via email

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobertAgee commented Jun 17, 2025 •

edited

Loading

Uh oh!

hidoba commented Jul 4, 2025

Uh oh!

RobertAgee commented Jul 4, 2025

Uh oh!

tk-1001 commented Dec 27, 2025 •

edited

Loading

Uh oh!

b4silio commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

RobertAgee commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 16, 2025

Uh oh!

RobViren commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobViren commented Jun 17, 2025 via email

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobertAgee commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hidoba commented Jul 4, 2025

Uh oh!

RobertAgee commented Jul 4, 2025

Uh oh!

tk-1001 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b4silio commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

RobertAgee commented Jun 17, 2025 •

edited

Loading

tk-1001 commented Dec 27, 2025 •

edited

Loading