OpenAI-compatible speech server — any STT/TTS provider, one container.
Open Speech is a self-hosted speech server that speaks the OpenAI API. Plug in any STT or TTS backend, swap models at runtime, and hit the same endpoints your apps already use. One Docker image, CPU or GPU, no vendor lock-in.
🎙️ Speech-to-Text
- OpenAI-compatible
/v1/audio/transcriptionsand/v1/audio/translations - Real-time streaming via WebSocket (
/v1/audio/stream) - Silero VAD — only transcribe when someone's talking
- SRT/VTT subtitle output
- Optional speaker diarization (
?diarize=true) via pyannote - Optional input audio preprocessing (noise reduction + normalization)
🔊 Text-to-Speech
- OpenAI-compatible
/v1/audio/speech - Streaming TTS with chunked transfer
- 50+ voices across backends
- Voice blending — mix voices with
af_bella(2)+af_sky(1)syntax - TTS response caching (configurable disk LRU)
- Pronunciation dictionary + SSML subset (
input_type=ssml) - Output postprocessing (silence trim + normalize)
🧩 Qwen3-TTS Deep Integration (Phase 7a)
- Official
qwen-ttsbackend integration (Qwen3TTSModel) - Three-model auto-selection per request:
CustomVoicefor named premium speakersVoiceDesignfor instruction-only voice creationBasefor voice cloning from reference audio
- 9 premium speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee
- Instruction control via
voice_designfield - In-memory reusable clone-prompt cache (
create_voice_clone_prompt()) - On-demand model loading only (no startup model load)
- Practical GPU note: 8GB VRAM can run one 1.7B model comfortably; loading two 1.7B models is usually tight
🧠 Model Management
- Nothing baked in — models download at runtime
- Unified model browser in the web UI
- Load, unload, hot-swap models via API
- TTL eviction and LRU lifecycle
🔌 OpenAI Realtime API
- Drop-in replacement for OpenAI's
/v1/realtimeWebSocket endpoint - Audio I/O only — STT + TTS, bring your own LLM
- Server VAD mode with Silero VAD for automatic speech detection
- Audio format negotiation: pcm16, g711_ulaw, g711_alaw
- Works with existing OpenAI Realtime API client libraries
🐳 Deployment
- Single image for CPU + GPU (NVIDIA CUDA)
- Web UI with light/dark mode at
/web - Self-signed HTTPS out of the box
- API key auth, rate limiting, CORS
docker run -d -p 8100:8100 jwindsor1/open-speech:cpuOpen https://localhost:8100/web — accept the self-signed cert, and you're in.
For GPU:
docker run -d -p 8100:8100 --gpus all jwindsor1/open-speech:latest
git clone https://github.com/will-assistant/open-speech.git
cd open-speech
pip install -e . # Core (faster-whisper STT + Kokoro TTS)
pip install -e ".[piper]" # + Piper TTS
pip install -e ".[qwen]" # + Qwen3-TTS deep integration (qwen-tts)
pip install -e ".[all]" # All core backends (keeps heavy optional extras separate)
pip install -e ".[diarize]" # + Speaker diarization (pyannote)
pip install -e ".[noise]" # + Noise reduction preprocessing
pip install -e ".[client]" # + Python client SDK deps
pip install -e ".[dev]" # Development tools (pytest, ruff, httpx)
pip install -r requirements.lock # Reproducible pinned core dependenciesAll config via environment variables. OS_ for server, STT_ for speech-to-text, TTS_ for text-to-speech.
| Variable | Default | Description |
|---|---|---|
OS_HOST |
0.0.0.0 |
Bind address |
OS_PORT |
8100 |
Listen port |
OS_API_KEY |
API key (empty = no auth) | |
OS_CORS_ORIGINS |
* |
Comma-separated CORS origins |
OS_SSL_ENABLED |
true |
Enable HTTPS |
OS_MODEL_TTL |
300 |
Idle seconds before auto-unload |
OS_MAX_LOADED_MODELS |
0 |
Max models in memory (0 = unlimited) |
OS_RATE_LIMIT |
0 |
Requests/min per IP (0 = off) |
OS_MAX_UPLOAD_MB |
100 |
Max upload size |
| Variable | Default | Description |
|---|---|---|
STT_MODEL |
deepdml/faster-whisper-large-v3-turbo-ct2 |
Default STT model |
STT_DEVICE |
cuda |
cuda or cpu |
STT_COMPUTE_TYPE |
float16 |
float16, int8, int8_float16 |
STT_PRELOAD_MODELS |
Comma-separated models to preload |
| Variable | Default | Description |
|---|---|---|
TTS_ENABLED |
true |
Enable TTS endpoints |
TTS_MODEL |
kokoro |
Default TTS model |
TTS_VOICE |
af_heart |
Default voice |
TTS_SPEED |
1.0 |
Default speed |
TTS_DEVICE |
(inherits STT) | cuda or cpu |
TTS_MAX_INPUT_LENGTH |
4096 |
Max input text length |
TTS_VOICES_CONFIG |
Path to voice presets YAML | |
TTS_QWEN3_SIZE |
1.7B |
Qwen3 model size: 1.7B or 0.6B |
TTS_QWEN3_FLASH_ATTN |
false |
Enable flash-attention-2 when installed |
TTS_QWEN3_DEVICE |
cuda:0 |
Device override for qwen-tts model loading |
Backwards compatibility: Old env var names (
STT_PORT,STT_HOST, etc.) still work but log deprecation warnings.
| Variable | Default | Description |
|---|---|---|
STT_VAD_ENABLED |
true |
Enable Silero VAD for speech detection |
STT_VAD_THRESHOLD |
0.5 |
Speech probability threshold (0.0–1.0) |
STT_VAD_MIN_SPEECH_MS |
250 |
Minimum speech duration before triggering |
STT_VAD_SILENCE_MS |
800 |
Silence duration to trigger speech_end |
VAD (Voice Activity Detection) uses the Silero VAD ONNX model (<2MB) to detect when someone is speaking. When enabled:
- WebSocket streaming only forwards speech to the STT backend, saving compute
- VAD events (
speech_start/speech_end) are sent to WebSocket clients - Web UI mic shows visual indicators for speech detection state
- Wyoming protocol filters silence before transcription
The VAD model downloads automatically on first use. It runs on CPU with negligible overhead (<5% CPU).
WebSocket VAD query parameter:
ws://host:8100/v1/audio/stream?vad=true # force enable
ws://host:8100/v1/audio/stream?vad=false # force disable
ws://host:8100/v1/audio/stream # use STT_VAD_ENABLED default
| Variable | Default | Description |
|---|---|---|
OS_WYOMING_ENABLED |
false |
Enable Wyoming TCP server |
OS_WYOMING_PORT |
10400 |
Wyoming protocol port |
Open Speech supports the Wyoming protocol, making it a drop-in STT + TTS provider for Home Assistant.
Enable it:
OS_WYOMING_ENABLED=true
OS_WYOMING_PORT=10400 # defaultHome Assistant setup:
Add to your Home Assistant configuration.yaml:
wyoming:
- host: "YOUR_OPEN_SPEECH_IP"
port: 10400Or add via Settings → Devices & Services → Add Integration → Wyoming Protocol and enter your Open Speech server's IP and port 10400.
Once connected, Open Speech appears as both an STT and TTS provider in your voice pipeline configuration.
| Variable | Default | Description |
|---|---|---|
OS_REALTIME_ENABLED |
true |
Enable /v1/realtime WebSocket endpoint |
Open Speech implements the OpenAI Realtime API WebSocket protocol for audio I/O only (STT + TTS). Any app built for OpenAI's voice mode works as a drop-in replacement — just point it at your Open Speech server.
We handle: audio input → transcription, text → speech output, VAD-based turn detection. We don't handle: LLM conversation, function calling, text generation. Bring your own brain.
Usage example (Python):
import asyncio, base64, json, websockets
async def main():
uri = "ws://localhost:8100/v1/realtime?model=whisper-large-v3-turbo"
async with websockets.connect(uri, subprotocols=["realtime"]) as ws:
event = json.loads(await ws.recv()) # session.created
print(f"Session: {event['session']['id']}")
# Send audio (PCM16, 24kHz, mono)
with open("audio.raw", "rb") as f:
audio = f.read()
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio).decode(),
}))
await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
# Receive transcription
while True:
event = json.loads(await ws.recv())
if event["type"] == "conversation.item.input_audio_transcription.completed":
print(f"Transcript: {event['transcript']}")
break
asyncio.run(main())Drop-in replacement for OpenAI clients — just change the base URL:
# Point any OpenAI Realtime client at Open Speech
REALTIME_URL = "ws://localhost:8100/v1/realtime"Models are not baked into the image — they download on first use and persist in the Docker volume.
| Model | Size | Backend | Languages |
|---|---|---|---|
deepdml/faster-whisper-large-v3-turbo-ct2 |
~800MB | faster-whisper | 99+ |
Systran/faster-whisper-large-v3 |
~1.5GB | faster-whisper | 99+ |
Systran/faster-whisper-medium |
~800MB | faster-whisper | 99+ |
Systran/faster-whisper-small |
~250MB | faster-whisper | 99+ |
Systran/faster-whisper-base |
~150MB | faster-whisper | 99+ |
Systran/faster-whisper-tiny |
~75MB | faster-whisper | 99+ |
| Model | Size | Backend | Voices |
|---|---|---|---|
kokoro |
~82MB | Kokoro | 52 voices, blending |
pocket-tts |
~220MB | Pocket TTS | 8 built-in voices, streaming |
piper/en_US-lessac-medium |
~35MB | Piper | 1 |
piper/en_US-joe-medium |
~35MB | Piper | 1 |
piper/en_US-amy-medium |
~35MB | Piper | 1 |
piper/en_US-arctic-medium |
~35MB | Piper | 1 |
piper/en_GB-alan-medium |
~35MB | Piper | 1 |
qwen3-tts/0.6B-CustomVoice |
~1.2GB | Qwen3-TTS | 4 + voice design |
qwen3-tts/1.7B-CustomVoice |
~3.4GB | Qwen3-TTS | 4 + voice design |
Switch models by changing STT_MODEL / TTS_MODEL and restarting, or use the API:
curl -sk -X POST https://localhost:8100/api/models/Systran%2Ffaster-whisper-small/loadAll endpoints return JSON unless noted. Authentication via Authorization: Bearer <key> header when OS_API_KEY is set. Interactive docs at /docs (Swagger UI).
| Method | Path | Description |
|---|---|---|
POST |
/v1/audio/transcriptions |
Transcribe audio to text (OpenAI-compatible) |
POST |
/v1/audio/translations |
Translate audio to English text (OpenAI-compatible) |
WS |
/v1/audio/stream |
Real-time streaming transcription via WebSocket |
GET |
/v1/audio/stream |
Returns 426 — instructs HTTP clients to use WebSocket |
POST /v1/audio/transcriptions
Multipart form upload. Returns transcription in the requested format.
Params (form fields): file (required), model, language, prompt, response_format (json|text|verbose_json|srt|vtt), temperature, diarize (bool)
curl -sk https://localhost:8100/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
-F "response_format=json"Response (json): { "text": "..." }
Response (verbose_json): { "text", "segments": [{ "start", "end", "text" }], "language" }
Response (diarize=true): { "text", "segments": [{ "speaker", "start", "end", "text" }] }
POST /v1/audio/translations
Translates non-English audio to English. Same multipart interface as transcriptions.
Params: file (required), model, prompt, response_format, temperature
Response: { "text": "..." }
WS /v1/audio/stream
Real-time streaming STT via WebSocket. Send raw audio chunks, receive transcript events.
Query params: model, language, sample_rate (default 16000), encoding (pcm_s16le), interim_results (bool), endpointing (ms), vad (bool)
const ws = new WebSocket("wss://localhost:8100/v1/audio/stream?vad=true");
ws.onmessage = (e) => console.log(JSON.parse(e.data));
ws.send(audioChunkArrayBuffer); // PCM16 LE monoEvents: { "type": "transcript", "text", "is_final" }, { "type": "speech_start" }, { "type": "speech_end" }
| Method | Path | Description |
|---|---|---|
POST |
/v1/audio/speech |
Synthesize speech from text (OpenAI-compatible) |
POST |
/v1/audio/speech/clone |
Voice cloning via multipart upload |
GET |
/v1/audio/voices |
List available TTS voices |
GET |
/v1/audio/models |
List TTS models and their load status |
POST |
/v1/audio/models/load |
Load a TTS model into memory |
POST |
/v1/audio/models/unload |
Unload a TTS model from memory |
GET |
/api/tts/capabilities |
Backend capabilities for a TTS model |
GET |
/api/voice-presets |
List configured voice presets |
POST /v1/audio/speech
JSON body. Returns audio bytes in the requested format.
Body: model (string), input (text to speak), voice, speed (float), response_format (mp3|opus|aac|flac|wav|pcm), language, input_type (text|ssml), voice_design (string, Qwen3 only), reference_audio (base64, Qwen3 only), clone_transcript, effects (array)
Query: stream (bool), cache (bool)
curl -sk https://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"Hello world","voice":"af_heart"}' \
-o output.mp3Response: Audio bytes with Content-Type matching format. Header X-Cache: HIT on cache hit.
POST /v1/audio/speech/clone
Multipart form upload for voice cloning with a reference audio sample.
Params: input (required), model, reference_audio (file upload), voice_library_ref (string — use stored voice instead of upload), voice, speed, response_format, transcript, language
curl -sk https://localhost:8100/v1/audio/speech/clone \
-F "input=Hello, I sound like the reference!" \
-F "model=qwen3-tts/0.6B-CustomVoice" \
-F "reference_audio=@reference.wav" \
-o cloned.mp3Response: Audio bytes.
GET /v1/audio/voices
List available TTS voices. Optionally filter by model/provider.
Query: model (optional — filter by provider, e.g. kokoro, piper)
Response: { "voices": [{ "id", "name", "language", "gender" }] }
GET /v1/audio/models
List TTS models and their runtime status (loaded/not loaded).
Response: { "models": [{ "model", "backend", "device", "status", "loaded_at", "last_used_at" }] }
POST /v1/audio/models/load
Body: { "model": "kokoro" } (optional — defaults to TTS_MODEL)
Response: { "status": "loaded", "model": "kokoro" }
POST /v1/audio/models/unload
Body: { "model": "kokoro" } (optional — defaults to TTS_MODEL)
Response: { "status": "unloaded", "model": "kokoro" }
GET /api/tts/capabilities
Returns capability flags for the selected TTS backend.
Query: model (optional — defaults to TTS_MODEL)
Response: { "backend": "kokoro", "capabilities": { "voice_design": false, "voice_clone": false, ... } }
| Method | Path | Description |
|---|---|---|
WS |
/v1/realtime |
OpenAI Realtime API-compatible audio WebSocket |
WS /v1/realtime
Drop-in replacement for OpenAI's Realtime API. Audio I/O only (STT + TTS). Requires OS_REALTIME_ENABLED=true (default).
Query: model (optional)
Subprotocol: realtime
Client events: input_audio_buffer.append ({ "audio": "<base64>" }), input_audio_buffer.commit, response.create
Server events: session.created, conversation.item.input_audio_transcription.completed ({ "transcript" }), response.audio.delta, response.audio.done
| Method | Path | Description |
|---|---|---|
GET |
/v1/models |
List all available models (OpenAI format) |
GET |
/v1/models/{model} |
Get model details |
GET /v1/models
Lists both STT and TTS models. Includes loaded models and configured defaults.
Response: { "object": "list", "data": [{ "id", "object": "model", "owned_by" }] }
| Method | Path | Description |
|---|---|---|
GET |
/api/models |
List all models (available + downloaded + loaded) |
GET |
/api/models/{id}/status |
Get model state (loaded/downloaded/available) |
GET |
/api/models/{id}/progress |
Download/load progress |
POST |
/api/models/{id}/load |
Load a model into memory |
POST |
/api/models/{id}/download |
Download model weights only |
POST |
/api/models/{id}/prefetch |
Cache model weights (alias for download) |
DELETE |
/api/models/{id} |
Unload a model from memory |
DELETE |
/api/models/{id}/artifacts |
Delete cached model weights/artifacts |
GET /api/models
Returns all known models across STT and TTS with state info and TTS capabilities.
Response: { "models": [{ "id", "type", "backend", "state", "capabilities": {} }] }
POST /api/models/{id}/load
curl -sk -X POST https://localhost:8100/api/models/kokoro/loadResponse: { "id", "type", "backend", "state": "loaded" }
Error: { "error": { "message", "code" } } (400 or 500)
GET /api/models/{id}/progress
Poll download/load progress for a model.
Response: { "status": "loading"|"downloading"|"ready"|"idle", "progress": 0.0–1.0 }
| Method | Path | Description |
|---|---|---|
POST |
/api/providers/install |
Install provider Python dependencies (async) |
GET |
/api/providers/install/{job_id} |
Poll provider install job status |
POST /api/providers/install
Starts an async pip install for a provider's dependencies. Returns a job ID to poll.
Body: { "provider": "piper", "model": null }
Response: { "job_id": "uuid", "status": "installing" }
GET /api/providers/install/{job_id}
Response: { "job_id", "status": "installing"|"done"|"failed", "output", "error" }
| Method | Path | Description |
|---|---|---|
POST |
/api/voices/library |
Upload a named voice reference for cloning |
GET |
/api/voices/library |
List all stored voice references |
GET |
/api/voices/library/{name} |
Get voice reference metadata |
DELETE |
/api/voices/library/{name} |
Delete a stored voice reference |
POST /api/voices/library
Multipart upload. Stores a named voice reference audio for use with voice_library_ref in clone requests.
Params: name (form), audio (file upload — WAV format)
Response (201): { "name", "content_type", "size_bytes", "created_at" }
| Method | Path | Description |
|---|---|---|
POST |
/api/profiles |
Create a voice profile |
GET |
/api/profiles |
List all profiles + default |
GET |
/api/profiles/{id} |
Get a profile |
PUT |
/api/profiles/{id} |
Update a profile |
DELETE |
/api/profiles/{id} |
Delete a profile |
POST |
/api/profiles/{id}/default |
Set as default profile |
POST /api/profiles
Body: { "name", "backend", "model", "voice", "speed", "format", "blend", "reference_audio_id", "effects": [] }
Response (201): { "id", "name", "backend", "model", "voice", "speed", "format", ... }
GET /api/profiles
Response: { "profiles": [{ "id", "name", "backend", "voice", ... }], "default_profile_id": "..." }
| Method | Path | Description |
|---|---|---|
GET |
/api/history |
List generation history entries |
DELETE |
/api/history/{id} |
Delete a history entry |
DELETE |
/api/history |
Clear all history |
GET /api/history
Query: type (stt|tts|null), limit (default 50), offset (default 0)
Response: { "items": [{ "id", "type", "model", "created_at", ... }], "total", "limit", "offset" }
| Method | Path | Description |
|---|---|---|
POST |
/api/conversations |
Create a conversation |
GET |
/api/conversations |
List conversations |
GET |
/api/conversations/{id} |
Get a conversation |
POST |
/api/conversations/{id}/turns |
Add a turn to a conversation |
DELETE |
/api/conversations/{id}/turns/{turn_id} |
Delete a turn |
POST |
/api/conversations/{id}/render |
Render conversation to audio |
GET |
/api/conversations/{id}/audio |
Get rendered audio file |
DELETE |
/api/conversations/{id} |
Delete a conversation |
POST /api/conversations
Body: { "name": "Demo", "turns": [{ "speaker", "text", "profile_id", "effects" }] }
Response (201): { "id", "name", "turns": [...], "created_at" }
POST /api/conversations/{id}/render
Body: { "format": "wav", "sample_rate": 24000, "save_turn_audio": true }
Response: { "id", "render_output_path", "format", "duration_s" }
| Method | Path | Description |
|---|---|---|
POST |
/api/composer/render |
Render a multi-track composition |
GET |
/api/composer/renders |
List all composition renders |
GET |
/api/composer/render/{id}/audio |
Get rendered composition audio |
DELETE |
/api/composer/render/{id} |
Delete a composition render |
POST /api/composer/render
Body: { "name", "format": "wav", "sample_rate": 24000, "tracks": [{ "source_path", "offset_s", "volume", "muted", "solo", "effects" }] }
Response: { "id", "render_output_path", "format", "tracks_count" }
| Method | Path | Description |
|---|---|---|
GET |
/api/ps |
List loaded STT models |
POST |
/api/ps/{model} |
Load an STT model |
DELETE |
/api/ps/{model} |
Unload an STT model |
POST |
/api/pull/{model} |
Download a model (load + unload) |
These legacy endpoints are kept for backwards compatibility. Prefer the unified
/api/models/*endpoints.
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/web |
Web UI (HTML) |
GET |
/docs |
OpenAPI/Swagger interactive docs |
GET /health
Response: { "status": "ok", "version": "0.6.0", "models_loaded": 2 }
Transcribe audio
curl -sk https://localhost:8100/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
-F "response_format=json"Formats: json, text, verbose_json, srt, vtt
Text-to-speech
curl -sk https://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"Hello world","voice":"af_heart"}' \
-o output.mp3Formats: mp3, opus, aac, flac, wav, pcm
OpenAI Python SDK
import httpx
from openai import OpenAI
client = OpenAI(
base_url="https://localhost:8100/v1",
api_key="not-needed",
http_client=httpx.Client(verify=False),
)
# STT
with open("audio.wav", "rb") as f:
result = client.audio.transcriptions.create(
model="deepdml/faster-whisper-large-v3-turbo-ct2", file=f
)
print(result.text)
# TTS
response = client.audio.speech.create(
model="kokoro", input="Hello world", voice="af_heart"
)
response.stream_to_file("output.mp3")Open Speech Python client SDK
import asyncio
from src.client import OpenSpeechClient
client = OpenSpeechClient(base_url="http://localhost:8100")
# Sync helpers
result = client.transcribe(open("audio.wav", "rb").read())
print(result["text"])
# Realtime session
rt = client.realtime_session()
rt.on_transcript(lambda ev: print("tx", ev))
rt.on_vad(lambda ev: print("vad", ev))
rt.send_audio(b"\x00\x00" * 2400)
rt.commit()
rt.create_response("Hello from Open Speech", voice="alloy")
rt.close()
# Async streaming transcription
async def run():
async def chunks():
yield b"\x00\x00" * 3200
async for event in client.async_stream_transcribe(chunks()):
print(event)
asyncio.run(run())Open Speech JS/TS client SDK
import { OpenSpeechClient } from "@open-speech/client";
const client = new OpenSpeechClient({ baseUrl: "http://localhost:8100" });
const transcript = await client.transcribe(await (await fetch("/audio.wav")).arrayBuffer());
console.log(transcript.text);
const rt = client.realtimeSession();
rt.onTranscript((ev) => console.log(ev));
rt.sendAudio(pcmChunk);
rt.commit();
rt.createResponse("Hello there", "alloy");Real-time streaming (WebSocket)
const ws = new WebSocket("wss://localhost:8100/v1/audio/stream?model=deepdml/faster-whisper-large-v3-turbo-ct2");
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "transcript") {
console.log(data.is_final ? "FINAL:" : "partial:", data.text);
}
};
// Send PCM16 LE mono 16kHz chunks
ws.send(audioChunkArrayBuffer);Open https://localhost:8100/web for a three-tab interface:
- Transcribe — Upload files, record from mic, or stream in real-time
- Speak — Enter text, pick a voice, generate audio
- Models — Browse available models, download, load/unload
Light and dark themes with OS auto-detection.
services:
open-speech:
image: jwindsor1/open-speech:cpu
ports: ["8100:8100"]
environment:
- STT_MODEL=Systran/faster-whisper-base
- STT_DEVICE=cpu
- TTS_MODEL=kokoro
- TTS_DEVICE=cpu
volumes:
- hf-cache:/root/.cache/huggingface
volumes:
hf-cache:services:
open-speech:
image: jwindsor1/open-speech:latest
ports: ["8100:8100"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- STT_MODEL=deepdml/faster-whisper-large-v3-turbo-ct2
- STT_DEVICE=cuda
- STT_COMPUTE_TYPE=float16
- TTS_MODEL=kokoro
- TTS_DEVICE=cuda
volumes:
- hf-cache:/root/.cache/huggingface
volumes:
hf-cache:Default GPU image bakes kokoro — ready to use immediately with zero setup. You can customize which providers and model weights are baked at build time:
docker build \
--build-arg BAKED_PROVIDERS=kokoro,pocket-tts,piper \
--build-arg BAKED_TTS_MODELS=kokoro,pocket-tts,piper/en_US-ryan-medium \
-t jwindsor1/open-speech:full .BAKED_PROVIDERS: controls which Python packages are pre-installed at build time (fast, adds ~200MB per provider). Default:kokoro.BAKED_TTS_MODELS: controls which model weights are baked into the image (slow, adds GB per model). Default:kokoro.
⚠️ Qwen3-TTS dependency conflict:qwen-ttshard-pinstransformers==4.57.3, which raisesImportErrorat load time ifhuggingface-hub>=1.0is installed (needed byfaster-whisper). This poisons the entire Python environment — kokoro and faster-whisper both break. To include qwen3, build withBAKED_PROVIDERS=kokoro,qwen3— the Dockerfile installsqwen-ttswith--no-depsand wires compatible dependency versions. Qwen3 may fail at runtime if it uses transformers 4.x-only APIs; this is isolated and won't affect other backends.
Ensure NVIDIA Container Toolkit is installed in WSL2, then use the GPU compose file above.
Models download to /root/.cache/huggingface inside the container. Mount a named volume to persist across restarts — no re-downloading.
| Backend | Best for | Languages | Model prefix |
|---|---|---|---|
| faster-whisper | High accuracy, GPU | 99+ | Systran/faster-whisper-*, deepdml/faster-whisper-* |
| Backend | Best for | Voices | Status |
|---|---|---|---|
| Kokoro | Quality + variety | 52 voices, blending | ✅ Stable |
| Pocket TTS | CPU-first, low-latency | 8 built-in voices, streaming | ✅ Stable |
| Piper | Lightweight, fast | Per-model voices | ✅ Stable |
| Qwen3-TTS | Voice design + cloning | 4 built-in + custom | ✅ Stable |
See TTS-BACKENDS.md for the backend roadmap.
Kokoro supports mixing voices with weighted syntax:
curl -sk https://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"Hello","voice":"af_bella(2)+af_sky(1)"}'This blends 2 parts af_bella with 1 part af_sky. See voice-presets.example.yml for saving named presets.
Clone any voice from a reference audio sample (Qwen3-TTS):
# Via multipart upload
curl -sk https://localhost:8100/v1/audio/speech/clone \
-F "input=Hello, I sound like the reference!" \
-F "model=qwen3-tts/0.6B-CustomVoice" \
-F "reference_audio=@reference.wav" \
-o cloned.mp3
# Via JSON with base64
curl -sk https://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-tts/0.6B-CustomVoice","input":"Hello","voice":"default","reference_audio":"<base64>"}' \
-o cloned.mp3Describe a voice in natural language and Qwen3-TTS will generate it:
curl -sk https://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-tts/0.6B-CustomVoice",
"input": "Good morning!",
"voice": "default",
"voice_design": "A warm, deep male voice with a slight British accent"
}' \
-o designed.mp3
voice_designandreference_audioare ignored by backends that don't support them (Kokoro, Piper).
# API key (recommended for all non-local deployments)
OS_API_KEY=my-secret-key docker compose up -d
curl -sk -H "Authorization: Bearer my-secret-key" https://localhost:8100/health
# Enforce auth at startup (fails fast when API key missing)
OS_AUTH_REQUIRED=true
# Realtime/WebSocket origin allowlist (optional)
OS_WS_ALLOWED_ORIGINS=https://myapp.com,https://staging.myapp.com
# Wyoming bind host (default localhost)
OS_WYOMING_HOST=127.0.0.1
# Rate limiting (60 req/min, burst of 10)
OS_RATE_LIMIT=60 OS_RATE_LIMIT_BURST=10
# CORS
OS_CORS_ORIGINS=https://myapp.com,https://staging.myapp.com
# Custom SSL cert
OS_SSL_CERTFILE=/certs/cert.pem OS_SSL_KEYFILE=/certs/key.pemDefaults from src/config.py (grouped by prefix).
| Variable | Default | Description |
|---|---|---|
OS_PORT |
8100 |
HTTP bind port |
OS_HOST |
0.0.0.0 |
HTTP bind host |
OS_API_KEY |
"" |
Bearer API key (empty disables auth) |
OS_AUTH_REQUIRED |
false |
Fail startup if API key missing |
OS_CORS_ORIGINS |
* |
Comma-separated CORS origins |
OS_WS_ALLOWED_ORIGINS |
"" |
Allowed WebSocket Origin values |
OS_TRUST_PROXY |
false |
Trust X-Forwarded-For headers |
OS_MAX_UPLOAD_MB |
100 |
Max upload size in MB |
OS_RATE_LIMIT |
0 |
Requests/min/IP (0 disables) |
OS_RATE_LIMIT_BURST |
0 |
Burst bucket size |
OS_SSL_ENABLED |
true |
Enable HTTPS |
OS_SSL_CERTFILE |
"" |
TLS cert path (auto-generated if empty) |
OS_SSL_KEYFILE |
"" |
TLS key path (auto-generated if empty) |
OS_VOICE_LIBRARY_PATH |
/home/openspeech/data/voices |
Stored voice reference directory |
OS_VOICE_LIBRARY_MAX_COUNT |
100 |
Max stored voice references (0 = unlimited) |
OS_STUDIO_DB_PATH |
/home/openspeech/data/studio.db |
SQLite database for profiles/history |
OS_HISTORY_ENABLED |
true |
Enable TTS/STT history logging |
OS_HISTORY_MAX_ENTRIES |
1000 |
Maximum history rows retained |
OS_HISTORY_RETAIN_AUDIO |
true |
Keep audio output path metadata |
OS_HISTORY_MAX_MB |
2000 |
Max retained audio footprint in MB |
OS_EFFECTS_ENABLED |
true |
Enable TTS effects processing |
OS_CONVERSATIONS_DIR |
/home/openspeech/data/conversations |
Conversation storage directory |
OS_COMPOSER_DIR |
/home/openspeech/data/composer |
Composer storage directory |
OS_PROVIDERS_DIR |
/home/openspeech/data/providers |
User-installed provider package directory |
OS_WYOMING_ENABLED |
false |
Enable Wyoming TCP server |
OS_WYOMING_HOST |
127.0.0.1 |
Wyoming bind host |
OS_WYOMING_PORT |
10400 |
Wyoming port |
OS_REALTIME_ENABLED |
true |
Enable /v1/realtime |
OS_REALTIME_MAX_BUFFER_MB |
50 |
Max realtime audio buffer per session |
OS_REALTIME_IDLE_TIMEOUT_S |
120 |
Realtime idle timeout seconds |
OS_MODEL_TTL |
300 |
Auto-unload idle model TTL (seconds) |
OS_MAX_LOADED_MODELS |
0 |
Max loaded models (0 = unlimited) |
OS_STREAM_CHUNK_MS |
100 |
Streaming chunk window (ms) |
OS_STREAM_VAD_THRESHOLD |
0.5 |
Streaming VAD threshold |
OS_STREAM_ENDPOINTING_MS |
300 |
Silence to finalize utterance (ms) |
OS_STREAM_MAX_CONNECTIONS |
10 |
Max concurrent streaming WS sessions |
| Variable | Default | Description |
|---|---|---|
STT_MODEL |
deepdml/faster-whisper-large-v3-turbo-ct2 |
Default STT model ID |
STT_DEVICE |
cuda |
STT inference device |
STT_COMPUTE_TYPE |
float16 |
STT compute precision |
STT_MODEL_DIR |
None |
Optional local model directory |
STT_PRELOAD_MODELS |
"" |
Comma-separated STT models to preload |
STT_VAD_ENABLED |
true |
Enable VAD by default on streaming endpoint |
STT_VAD_THRESHOLD |
0.5 |
VAD speech probability threshold |
STT_VAD_MIN_SPEECH_MS |
250 |
Min speech duration before activation |
STT_VAD_SILENCE_MS |
800 |
Silence duration for endpointing |
STT_DIARIZE_ENABLED |
false |
Enable diarization support |
STT_NOISE_REDUCE |
false |
Enable STT denoise preprocessing |
STT_NORMALIZE |
true |
Enable STT input normalization |
| Variable | Default | Description |
|---|---|---|
TTS_ENABLED |
true |
Enable TTS endpoints |
TTS_MODEL |
kokoro |
Default TTS model ID |
TTS_VOICE |
af_heart |
Default TTS voice |
TTS_DEVICE |
None |
TTS device override (falls back to STT device) |
TTS_MAX_INPUT_LENGTH |
4096 |
Max text chars per synthesis request |
TTS_DEFAULT_FORMAT |
mp3 |
Default output audio format |
TTS_SPEED |
1.0 |
Default synthesis speed |
TTS_PRELOAD_MODELS |
"" |
Comma-separated TTS models to preload |
TTS_VOICES_CONFIG |
"" |
YAML voice presets file path |
TTS_CACHE_ENABLED |
false |
Enable on-disk synthesis cache |
TTS_CACHE_MAX_MB |
500 |
Max cache size in MB |
TTS_CACHE_DIR |
/var/lib/open-speech/cache |
Cache directory |
TTS_TRIM_SILENCE |
true |
Trim generated silence |
TTS_NORMALIZE_OUTPUT |
true |
Normalize output loudness |
TTS_PRONUNCIATION_DICT |
"" |
Pronunciation dictionary path |
TTS_QWEN3_SIZE |
1.7B |
Qwen3 model size selector |
TTS_QWEN3_FLASH_ATTN |
false |
Enable Qwen3 flash attention |
TTS_QWEN3_DEVICE |
cuda:0 |
Qwen3 backend device override |
See ROADMAP.md for the full phase breakdown and current status.
- ✅ Phases 2–6 complete (multi-model, Docker, advanced TTS, Wyoming, production hardening)
- ✅ Phase 8a+8b — Voice Profiles + Generation History
- ✅ Web UI — ground-up rewrite: Transcribe / Speak / Models / History / Settings
- 🚧 Phase 8c–8e — Conversation mode, Voice effects, Composer
- 🔴 B6, B9 — critical backend bugs (provider install path, streaming event loop)
Open Speech uses a pluggable backend system. To add a new STT or TTS backend:
- Create a new file in
src/tts/orsrc/stt/ - Implement the backend interface (see existing backends for examples)
- Register it in the model registry
- Add tests
- Submit a PR
See TTS-BACKENDS.md for the TTS backend roadmap and design patterns.
MIT © 2026 Jeremy Windsor