Skip to content

Full GPU Rewrite, Performance Boost + misc#7

Open
RobertAgee wants to merge 24 commits into
RobViren:devfrom
RobertAgee:main
Open

Full GPU Rewrite, Performance Boost + misc#7
RobertAgee wants to merge 24 commits into
RobViren:devfrom
RobertAgee:main

Conversation

@RobertAgee

Copy link
Copy Markdown
Collaborator

Full GPU Implementation and Optimization

  • Updated feature analysis to SOTA Speechbrain, TorchAudio, nn.audio methods
  • All feature analysis, tensor operations now on GPU, can run in parallel
  • Reused/simplified tensor operations whereever possible
  • Added minimum feature similarity (feat sim must be >=(best feat sim - 0.01)), before self-sim check to bypass expensive audio gen if feat sim regresses too much
  • ~3x faster on RTX4070 8GB, 10-15% CPU utilization on Intel i9-14900HX
  • <0.75 GB allocation / 1.5 reserved
  • patched Kokoro's memory leak (needs routine cache clearing, capped overhead memory usage)
  • 10,000 iterations:
Random Walk Final Results for gravelierjej
Duration: 56.80 minutes
Best Voice: out/gravelierjej_jejraven_20250616_161107/gravelierjej_9550_0.42_0.48_jejraven.pt
Best Score: 0.42
Best Similarity: 0.48
Random Walk pt and wav files ---> out/gravelierjej_jejraven_20250616_161107
0it [56:48, ?it/s, GPU Stats: 0.6771GB allocated, 1.4491GB reserved
Process Times: Audio1 gen: 0.286750s, Audio2 gen: 0.222268s, Target Sim: 0.020977s,  Self Sim: 0.018055s, Feat Sim: 0.049192s, Total: 0.597364s]

Settings Configuration, Debug, Memory, Process Times logging, misc

  • set true in utilities/kvw_config.json
  • loads automatically at program start
  • lots of stuff.... sweats lol

Stuff ToDo:

  • Most noted in code
  • Revisit scoring methodologies for any new optimizations possible (penalty, weighing, etc)
    • Notably SpeechBrain cosine similarity is more accurate that Resemblyzer
    • Idea: Use collection of target audios compared to themselves to get average similarity, use that for confirmation (~0.80)
  • Clean up docstrings
  • Move scorevoice() -> FitnessScorer
  • Add more feature-wise mutation strategy
  • Clean up variable naming (make it more legible)
  • Add convenient save/reloads
  • Offer disable checkpoint wav/pt saves (useful for early checkpoints, performance crippler)
  • Use smaller kokoro model
  • Diagnose where speech bottlenecks are, speed up
  • Clean up console printing
  • Consider if unifying speech/voice generators makes sense performance wise
  • Reduce signature objects in calls if possible
  • Add more GPU feature analysis
  • Merge some functionalities from my kokovoicelab fork

@RobertAgee

Copy link
Copy Markdown
Collaborator Author

Hey @RobViren Not going to push this on the main just yet, wanted to get your eyes on it and hopefully you get some time to try it out. 3x FASTER totally on GPU, and tiny footprint

@RobViren

Copy link
Copy Markdown
Owner

Oh dang! You've been at work on this. I had not heard of speechbrain. Is it still avoiding over fitting and sounding like a demon? Kudos on the better GPU usage, super impressive. Gonna run tonight

@RobertAgee

Copy link
Copy Markdown
Collaborator Author

Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go.

Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close.

another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations.

# Conflicts:
#	utilities/fitness_scorer.py
#	utilities/initial_selector.py
#	utilities/kvoicewalk.py
@RobViren

RobViren commented Jun 17, 2025 via email

Copy link
Copy Markdown
Owner

@RobertAgee

Copy link
Copy Markdown
Collaborator Author

Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes).

A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction..

B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters.

C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak.

C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster.

======

by the way, I just restarted my wsl instance completely and on a fresh instance, it's clocking a 7-8x speed up (as opposed 3x I'd thought). The audio is a little shorter, but fingers crossed the actual performance gains are higher than expected. If you get benchmarks on your system, please share!

Step:577  Target Sim:0.255 Self Sim:0.673 Feature Sim:0.296 Score:0.34 Diversity:0.10                            | 577/10000 [01:10<17:52,  8.79it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved Sim: 0.010980s,  Self Sim: 0.018089s, Feat Sim: 0.046040s, Total: 0.357781s]
                                                                                                                                                                                                        
Step:580  Target Sim:0.261 Self Sim:0.699 Feature Sim:0.291 Score:0.34 Diversity:0.05                            | 580/10000 [01:11<21:41,  7.24it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved Sim: 0.011041s,  Self Sim: 0.014831s, Feat Sim: 0.046310s, Total: 0.345149s]
                                                                                                                                                                                                        
Step:1228 Target Sim:0.262 Self Sim:0.689 Feature Sim:0.294 Score:0.35 Diversity:0.03                           | 1228/10000 [02:19<15:05,  9.68it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved]]
0it [02:19, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [02:19, ?it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved Sim: 0.010740s,  Self Sim: 0.014637s, Feat Sim: 0.050137s, Total: 0.356808s]
                                                                                                                                                                                                        
Step:1738 Target Sim:0.258 Self Sim:0.696 Feature Sim:0.304 Score:0.35 Diversity:0.06                           | 1738/10000 [03:11<13:48,  9.97it/s, GPU Stats: 0.6759GB allocated, 1.2562GB reserved]]

Random Walk Final Results for my_new_voice
Duration: 17.74 minutes
Best Voice: out/my_new_voice_tpih-78_20250616_215525/my_new_voice_5913_0.40_0.32_tpih-78.pt
Best Score: 0.40
Best Similarity: 0.32
Random Walk pt and wav files ---> out/my_new_voice_tpih-78_20250616_215525
0it [17:44, ?it/s, GPU Stats: 0.6800GB allocated, 1.2562GB reserved
Process Times: Audio1 gen: 0.090282s, Audio2 gen: 0.195020s, Target Sim: 0.010740s,  Self Sim: 0.016085s, Feat Sim: 0.052650s, Total: 0.364867s]

@RobertAgee

RobertAgee commented Jun 17, 2025

Copy link
Copy Markdown
Collaborator Author

Oh, and I should add too, taking the worst performers vs target_audio during the top_performers method, and use them to push the starting voice in the right direction strongly. So even if there's no great matching voice (e.g. no deep masculine voices in KokoroTTS) you can still use really "off" voices to your advantage.

Gravel - https://voca.ro/17dMqrKJIXLR
The Narrator - https://vocaroo.com/1lS1gUoIZYRu
Narrator Lite - https://vocaroo.com/1ezeDU6Nzw9R
King Arthur - https://voca.ro/13JsCly5B1oX

Here's a map to see where they live in Kokoro latent space (right hand side):

voice_pca_plot (2)

@hidoba

hidoba commented Jul 4, 2025

Copy link
Copy Markdown

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

@RobertAgee

Copy link
Copy Markdown
Collaborator Author

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

Depends on what you're trying to do (ie voice cloning vs voice crafting), and many different hammers can functionally do the same task.

@tk-1001

tk-1001 commented Dec 27, 2025

Copy link
Copy Markdown

Every score for comparison between starting models within this main branch seems to be the same.
af_alloy.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_aoede.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_bella.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_heart.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_jessica.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_kore.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_nicole.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_nova.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_river.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03
af_sarah.pt Target Sim:-0.009 Self Sim:0.938 Feature Sim:0.12 Score:-0.03

@b4silio

b4silio commented Jun 5, 2026

Copy link
Copy Markdown

Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes).

A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction..

B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters.

C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak.

C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster.

I know it's been a while for you but I've just discovered this and, congrats, this is really a beautiful and super-useful piece of software!

I ran the process on 20+ target samples the past 2 days, getting some hits and some misses, with self sim results between 0.80 and 0.90 with some lucky ones above that. I then started playing around with gradient-less methods instead of Random Walk. I ended up doing a couple of tests using BOBYQA (mostly because I'd been using it in the past, but it turns out it isn't a great choice for higher-dim noisy exploration spaces like what we have here). But I eventually landed on the CMA-ES optimizer, which has been able to land me in the 0.93-0.97 Self Sim range within 3000 steps for almost all of my 26 tests of very different voice targets, and this being much less dependent on the initial starting sample, as multiple samples tend to converge to pretty similar results. There were still some hard cases that weren't able to push much past 0.8, but clearly the results are good. Just mentioning this because your idea of an adaptive random walk is good but there are already strategies that you can plug and play that can do a lot of heavy lifting.

One bit of research that was interesting and allowed me to get consistently past the 0.90 threshold:
The target_feature_penalty uses a relative error:
penalty += abs((value - target_features[key]) / target_features[key])
When a target feature is near zero, 1/|target| explodes and a single feature
dominates the entire penalty, effectively capping the score it can reach. (There's a boilerplate explanation from the AI agent about why that is if you're interested). I ended up dropping the target_feature_penalty and obtained results that are (numerically and perceptively) much better.

Sadly, the code I've written is mostly claude-coded and my coding capabilities in python are far lower than my knowledge in ML, so I don't know to what extent you might want to integrate any of my code in your repo, but even if it never lands in the code itself (and for anyone who wants to instruct their AI agent to test something out), just know that using the CMA-ES optimizer instead of random walk is much faster at converging and converges much better than random walk on basically any test I ran.

Hope this is of some use in any case!

A direct comparison on the same samples :

Voice Random Walk CMA Δ
Julie Andrews 0.93 0.98 +0.05
James Earl Jones 0.85 0.97 +0.12
Emma Thompson 0.85 0.97 +0.12
Morgan Freeman 0.88 0.97 +0.09
David Attenborough 0.89 0.97 +0.08
Barack Obama 0.95 0.97 +0.02
Carl Sagan 0.84 0.87 +0.03
Carl Sagan (better audio) 0.93 0.96 +0.03
Tom Hanks 0.90 0.98 +0.08

And the runs I'd done for all the voices (mostly to see the difference of seeds on results):

Person Seed 1 tsim Seed 2 tsim Best Δ(1−2)
alain-de-botton bf_lily 0.940 pm_alex 0.950 0.950 −0.010
alan-rickman bm_lewis 0.800 bf_lily 0.830 0.830 −0.030
alicia-vikander af_aoede 0.930 af_bella 0.890 0.930 +0.040
audrey-hepburn hf_alpha 0.960 hf_beta 0.970 0.970 −0.010
barack-obama am_michael 0.970 am_onyx 0.960 0.970 +0.010
carl-sagan em_santa 0.870 pm_santa 0.860 0.870 +0.010
carl-sagan (better audio) am_onyx 0.950 am_santa 0.960 0.960 -0.010
cate-blanchett ef_dora 0.900 pf_dora 0.890 0.900 +0.010
david-attenborough bm_george 0.970 bm_daniel 0.950 0.970 +0.020
emma-thompson af_nicole 0.920 bf_alice 0.970 0.970 −0.050
helen-mirren af_kore 0.960 bf_lily 0.970 0.970 −0.010
ian-mckellen em_santa 0.970 pm_santa 0.970 0.970 0.000
james-earl-jones em_santa 0.970 pm_santa 0.970 0.970 0.000
jeff-hays jm_kumo 0.960 am_puck 0.970 0.970 −0.010
jeremy-irons im_nicola 0.960 bm_lewis 0.940 0.960 +0.020
jim-dale im_nicola 0.960 pm_alex 0.970 0.970 −0.010
judi-dench af_nicole 0.960 af_sky 0.970 0.970 −0.010
julie-andrews jf_gongitsune 0.980 af_nicole 0.950 0.980 +0.030
maggie-smith bm_daniel 0.900 zm_yunxia 0.870 0.900 +0.030
meryl-streep af_alloy 0.860 pf_dora 0.870 0.870 −0.010
morgan2 im_nicola 0.970 hm_psi 0.960 0.970 +0.010
patrick-stewart im_nicola 0.960 em_alex 0.970* 0.970 −0.010
sam-neill em_santa 0.970 am_santa 0.970 0.970 0.000
stephen-fry pm_santa 0.960 am_santa 0.960 0.960 0.000
tom-hanks hm_psi 0.960 am_adam 0.980 0.980 -0.020
vanessa-redgrave zm_yunxi 0.920 bm_fable 0.890 0.920 +0.030
walter-cronkite af_sarah 0.840 bm_lewis 0.880 0.880 -0.040
-- -- -- -- -- -- --
average -- 0.936 -- 0.935 0.944 0.001

Starting seeds were chosen by running similarity measure on the sample vs default sample generation, and picking the top 2

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants