The generation stage synthesizes positive and negative audio clips using a pluggable TTS backend (default: Piper VITS with SLERP speaker blending, or VoxCPM2 voice design) and phoneme-based adversarial phrase generation. Select the engine with tts_backend in the YAML config (piper_vits or voxcpm). Multilingual wake words require tts_backend: voxcpm. Piper follows a single checkpoint locale (bundled default is English–US). See configs/test_voxcpm.yaml and configs/prod_voxcpm.yaml for Chinese (你好 livekit) examples.
Source: src/livekit/wakeword/data/generate.py, src/livekit/wakeword/data/piper/synthesis.py, src/livekit/wakeword/data/tts/
CLI: livekit-wakeword generate <config>
System dependency: espeak-ng must be installed for phonemization (brew install espeak-ng on macOS, apt install espeak-ng on Linux).
Target phrases (e.g., "hey jarvis")
│
├──► TTS backend ──► Positive clips (.wav)
│ (Piper VITS+SLERP or VoxCPM2 voice design)
│
├──► Adversarial phrase generation
│ (phoneme substitution via CMU dict)
│ │
│ └──► TTS backend ──► Negative clips (.wav)
│
└──► Background noise sampling ──► Background clips (.wav)
(random slicing + tiling from background_paths)
| Split | Count | Config Field |
|---|---|---|
| Positive train | 10,000 | n_samples |
| Positive test | 2,000 | n_samples_val |
| Negative train | 10,000 | n_samples |
| Negative test | 2,000 | n_samples_val |
| Background train | 200 | n_background_samples |
| Background test | 40 | n_background_samples_val |
livekit-wakeword setup installs shared data (features, RIRs, backgrounds) under a root data_dir. Backend-specific TTS weights (Piper checkpoint or VoxCPM snapshot) are downloaded only for the backend your config selects.
Pass your wake word YAML so setup aligns with generation:
data_dirfrom the config is the root for all downloads (features, RIRs, backgrounds, Piper checkpoint, VoxCPM snapshot).- Piper is downloaded only if
tts_backendispiper_vits. VoxCPM weights are fetched withsnapshot_downloadonly iftts_backendisvoxcpmand the configured target directory is missing or empty (see VoxCPM2 below). Otherwise TTS-specific downloads are skipped. - Checkpoint path:
piper_tts.checkpoint_relpathis the path to the.ptstate_dict relative todata_dir. The matching JSON config is alwayssame_basename.jsonnext to that file. Defaults topiper/en-us-libritts-high.pt.
Example:
livekit-wakeword setup --config configs/prod.yamlIf you omit --config, setup uses --data-dir (default ./data) and always downloads the default Piper bundle to piper/en-us-libritts-high.pt under that root. This is handy when you do not have a YAML yet. Prefer --config for projects that use a non-default data_dir, a custom piper_tts.checkpoint_relpath, or VoxCPM (VoxCPM is not downloaded without --config).
synthesize_clips() generates speech clips using a VITS model with SLERP (Spherical Linear Interpolation) across 904 speaker embeddings. This approach, adapted from dscripka/piper-sample-generator, creates thousands of unique synthetic voices from speaker pairs.
The VITS model contains 904 speaker embeddings. For each batch:
- Speaker pairs are cycled through all
(i, j)combinations of speaker IDs - SLERP interpolates the two speaker embeddings at each configured weight (e.g., 0.2 = close to speaker 1, 0.5 = midpoint, 0.8 = close to speaker 2)
- The blended embedding is used for inference, producing a voice that sounds like neither original speaker. Multiple weights generate more diverse voices from each pair
- Audio is resampled from 22050 Hz to 16000 Hz and silence-trimmed via WebRTC VAD
With 904 speakers, there are ~409,000 unique speaker pairs, each producing distinct vocal characteristics.
SLERP uses per-element fallback: when a speaker pair's embeddings are nearly parallel (dot product > 0.9995), only that pair falls back to linear interpolation; other pairs in the batch still use true SLERP. Dot products are clamped to [-1, 1] to prevent NaN from acos on float precision overflow.
Each clip is synthesized with a combination of:
| Parameter | Default Values | Description |
|---|---|---|
noise_scales |
[0.98] |
Overall speech variability |
noise_scale_ws |
[0.98] |
Phoneme duration variability |
length_scales |
[0.75, 1.0, 1.25] |
Speaking rate (slow/normal/fast) |
slerp_weights |
[0.2, 0.35, 0.5, 0.65, 0.8] |
Speaker interpolation weights (0=speaker1, 1=speaker2) |
max_speakers |
null |
Cap on speaker IDs (null = all 904) |
The Cartesian product of slerp_weights, length_scales, noise_scales, and noise_scale_ws creates multiple prosody variations. Speaker pairs and settings are cycled until n_samples clips are generated.
The default en-us-libritts-high.pt VITS checkpoint (~166 MB) and its .json config are downloaded by livekit-wakeword setup (see Setup and TTS artifacts). Paths follow piper_tts.checkpoint_relpath under data_dir. The model must be present for piper_vits. Generation raises FileNotFoundError if the checkpoint is missing instead of emitting silent placeholders.
Generated audio is silence-trimmed via WebRTC VAD. If the VAD strips too aggressively (result shorter than 150ms of content beyond the leading samples), the original untrimmed audio is kept instead.
VoxCPM2 is used in voice design mode only: each clip uses a text persona description plus the wake phrase (no reference-audio cloning in this integration). The Python package is optional: install with uv sync --extra train --extra voxcpm (upstream recommends PyTorch ≥ 2.5; see their docs for CUDA).
Weights on disk: livekit-wakeword setup --config your.yaml runs snapshot_download(repo_id=voxcpm_tts.model_id, ...) into voxcpm_local_model_path. By default this is data_dir/voxcpm/VoxCPM2 (voxcpm_tts.model_cache_relpath), or voxcpm_tts.local_model_path if set (relative to data_dir or absolute). If that directory is already non-empty, setup skips the download (e.g. you prefetched or copied weights there).
Diversification: Defaults cover many voice_design_prompts × cfg_values × inference_timesteps_list (see VoxCpmTtsConfig in config.py). Clip i cycles through that Cartesian product so resumes stay aligned with start_index. Output is 16 kHz clip_%06d.wav (model native rate is resampled with librosa).
generate_adversarial_phrases() creates phonetically similar but incorrect phrases to train the model to reject near-misses. The resulting phrases are fed back through the active TTS backend to produce negative clips.
- Load the CMU Pronouncing Dictionary via NLTK
- For each target phrase:
- Expand unknown words: Words not in CMUDict are split into known subwords (e.g.,
"livekit"→["live", "kit"]). The split tries all positions and prefers the longest left match. This enables phoneme substitutions on made-up/compound words that CMUDict doesn't contain. - Get the phoneme sequence for each word (with regex stress wildcards on vowels)
- Generate regex patterns by replacing 1 to
max_replacephonemes (default:len(phones) - 2) with a wildcard(.){1,3} - Search CMUDict with each regex pattern via
pronouncing.search()to find phonetically similar words - Exclude homophones (same pronunciation = not adversarial)
- With probability
include_partial_phrase(default: 1.0), generate all partial phrases (each word removed in turn) - Include individual words with probability
include_input_words(default: 0.2)
- Expand unknown words: Words not in CMUDict are split into known subwords (e.g.,
- Remove exact target phrases from the adversarial list (safety filter)
- Deduplicate and shuffle. When
n_phrasesisNone(default), all unique phrases are returned with no cap.
For each word, phonemes are replaced with a broad regex wildcard (.){1,3} that matches any 1-3 phoneme characters. All combinations of 1 to max_replace replacement positions are tried, generating patterns that range from single-phoneme swaps (close neighbors) to multi-phoneme replacements (more distant matches). This is the same approach used by openWakeWord: broad enough to catch phonetically similar words without requiring a hand-curated substitution map.
Additional negative phrases can be specified via custom_negative_phrases in the config. These are appended to the auto-generated adversarial phrases before synthesis.
_generate_background_clips() samples clips from the raw background audio in augmentation.background_paths and writes them as .wav files into background_train/ and background_test/.
Each clip is generated by:
- Pick a random source file from the background audio pool
- Tile short files: if the source is shorter than
clip_duration, it is extended by concatenating segments with random roll offsets and 50% chance of reversal per segment, breaking audible periodicity - Random slice: pick a random start offset into the (possibly tiled) audio and extract exactly
clip_durationsamples
This ensures every clip sounds different even from the same source file. Set n_background_samples: 0 and n_background_samples_val: 0 to disable background noise training entirely.
Background clips go through the same augmentation pipeline as TTS clips (RIR, EQ, distortion, and additional background noise mixing), so the model sees realistic acoustic conditions on pure noise.
Errors during TTS synthesis are handled per-batch, not per-pipeline. If a batch fails (e.g., espeak-ng can't phonemize a phrase), that batch is skipped with a warning and generation continues. The pipeline raises RuntimeError if:
- 5 consecutive batches fail (indicates a systemic problem)
- Zero clips are generated (total failure)
After generation, the output directory contains:
output/<model_name>/
├── positive_train/ # Positive training clips (.wav)
├── positive_test/ # Positive validation clips (.wav)
├── negative_train/ # Adversarial negative training clips (.wav)
├── negative_test/ # Adversarial negative validation clips (.wav)
├── background_train/ # Background noise training clips (.wav)
└── background_test/ # Background noise validation clips (.wav)
Synthetic speech is pluggable so you can swap Piper VITS for another engine (e.g. cloud or on-device models). Follow this checklist.
- Add a new member to
TtsBackendinconfig.py(e.g.qwen_tts = "qwen_tts"). - Add any engine-specific fields: either nested models on
WakeWordConfig(pattern:PiperTtsConfig+piper_tts) or top-level keys, and document them in YAML. - Put backend defaults that
config.pymust import in a new module underdefaults/(e.g.defaults/qwen_tts.py), not underlivekit.wakeword.data, so you avoid a circular import (config→datapackage → modules that importconfig). Mirror the existing layout: per-backend module, noDEFAULT_prefix on the constants since the module path already qualifies them (e.g.defaults.piper.CHECKPOINT_RELPATH,defaults.voxcpm.MODEL_ID).
-
Add a module under
data/(e.g.data/qwen_tts/) with a class that satisfiesSpeechSynthesizerintts/backends.py:validate_artifacts(): ensure required weights, credentials, or binaries exist; raiseFileNotFoundErrorwith a clear message if not.synthesize_clips(phrases, output_dir, n_samples, *, start_index, batch_size): writeclip_%06d.wavat 16 kHz, honoringstart_indexfor resume the same way Piper does.
-
Put text normalization and voice / speaker diversification inside the backend (not in
run_generate). Piper uses CMUDict + SLERP; another engine should apply its own diversity strategy so training clips are not single-timbre.
- In
get_tts_backend()intts/backends.py, map yourTtsBackendvalue to your class (typicallyYourBackend.from_config(config)so hyperparameters are captured at construction time).
- If the engine needs downloaded assets, extend
setupincli.py: when--configis passed, branch onconfig.tts_backendand download only what that backend needs, using paths from the config (same pattern as Piper +piper_checkpoint_path).
- Add unit tests for config loading and, where feasible, artifact validation.
- Document YAML keys and setup steps in this file.