Open Speech

OpenAI-compatible speech server — any STT/TTS provider, one container.

What is Open Speech?

Open Speech is a self-hosted speech server that speaks the OpenAI API. Plug in any STT or TTS backend, swap models at runtime, and hit the same endpoints your apps already use. One Docker image, CPU or GPU, no vendor lock-in.

Features

🎙️ Speech-to-Text

OpenAI-compatible /v1/audio/transcriptions and /v1/audio/translations
Real-time streaming via WebSocket (/v1/audio/stream)
Silero VAD — only transcribe when someone's talking
SRT/VTT subtitle output
Optional speaker diarization (?diarize=true) via pyannote
Optional input audio preprocessing (noise reduction + normalization)

🔊 Text-to-Speech

OpenAI-compatible /v1/audio/speech
Streaming TTS with chunked transfer
50+ voices across backends
Voice blending — mix voices with af_bella(2)+af_sky(1) syntax
TTS response caching (configurable disk LRU)
Pronunciation dictionary + SSML subset (input_type=ssml)
Output postprocessing (silence trim + normalize)

🧩 Qwen3-TTS Deep Integration (Phase 7a)

Official qwen-tts backend integration (Qwen3TTSModel)
Three-model auto-selection per request:
- CustomVoice for named premium speakers
- VoiceDesign for instruction-only voice creation
- Base for voice cloning from reference audio
9 premium speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee
Instruction control via voice_design field
In-memory reusable clone-prompt cache (create_voice_clone_prompt())
On-demand model loading only (no startup model load)
Practical GPU note: 8GB VRAM can run one 1.7B model comfortably; loading two 1.7B models is usually tight

🧠 Model Management

Nothing baked in — models download at runtime
Unified model browser in the web UI
Load, unload, hot-swap models via API
TTL eviction and LRU lifecycle

🔌 OpenAI Realtime API

Drop-in replacement for OpenAI's /v1/realtime WebSocket endpoint
Audio I/O only — STT + TTS, bring your own LLM
Server VAD mode with Silero VAD for automatic speech detection
Audio format negotiation: pcm16, g711_ulaw, g711_alaw
Works with existing OpenAI Realtime API client libraries

🐳 Deployment

Single image for CPU + GPU (NVIDIA CUDA)
Web UI with light/dark mode at /web
Self-signed HTTPS out of the box
API key auth, rate limiting, CORS

Quick Start

docker run -d -p 8100:8100 jwindsor1/open-speech:cpu

Open https://localhost:8100/web — accept the self-signed cert, and you're in.

For GPU: docker run -d -p 8100:8100 --gpus all jwindsor1/open-speech:latest

Installation (from source)

git clone https://github.com/will-assistant/open-speech.git
cd open-speech
pip install -e .                    # Core (faster-whisper STT + Kokoro TTS)
pip install -e ".[piper]"           # + Piper TTS
pip install -e ".[qwen]"            # + Qwen3-TTS deep integration (qwen-tts)
pip install -e ".[all]"             # All core backends (keeps heavy optional extras separate)
pip install -e ".[diarize]"         # + Speaker diarization (pyannote)
pip install -e ".[noise]"           # + Noise reduction preprocessing
pip install -e ".[client]"          # + Python client SDK deps
pip install -e ".[dev]"             # Development tools (pytest, ruff, httpx)
pip install -r requirements.lock      # Reproducible pinned core dependencies

Configuration

All config via environment variables. OS_ for server, STT_ for speech-to-text, TTS_ for text-to-speech.

Server (`OS_`)

Variable	Default	Description
`OS_HOST`	`0.0.0.0`	Bind address
`OS_PORT`	`8100`	Listen port
`OS_API_KEY`		API key (empty = no auth)
`OS_CORS_ORIGINS`	`*`	Comma-separated CORS origins
`OS_SSL_ENABLED`	`true`	Enable HTTPS
`OS_MODEL_TTL`	`300`	Idle seconds before auto-unload
`OS_MAX_LOADED_MODELS`	`0`	Max models in memory (0 = unlimited)
`OS_RATE_LIMIT`	`0`	Requests/min per IP (0 = off)
`OS_MAX_UPLOAD_MB`	`100`	Max upload size

Speech-to-Text (`STT_`)

Variable	Default	Description
`STT_MODEL`	`deepdml/faster-whisper-large-v3-turbo-ct2`	Default STT model
`STT_DEVICE`	`cuda`	`cuda` or `cpu`
`STT_COMPUTE_TYPE`	`float16`	`float16`, `int8`, `int8_float16`
`STT_PRELOAD_MODELS`		Comma-separated models to preload

Text-to-Speech (`TTS_`)

Variable	Default	Description
`TTS_ENABLED`	`true`	Enable TTS endpoints
`TTS_MODEL`	`kokoro`	Default TTS model
`TTS_VOICE`	`af_heart`	Default voice
`TTS_SPEED`	`1.0`	Default speed
`TTS_DEVICE`	(inherits STT)	`cuda` or `cpu`
`TTS_MAX_INPUT_LENGTH`	`4096`	Max input text length
`TTS_VOICES_CONFIG`		Path to voice presets YAML
`TTS_QWEN3_SIZE`	`1.7B`	Qwen3 model size: `1.7B` or `0.6B`
`TTS_QWEN3_FLASH_ATTN`	`false`	Enable flash-attention-2 when installed
`TTS_QWEN3_DEVICE`	`cuda:0`	Device override for qwen-tts model loading

Backwards compatibility: Old env var names (STT_PORT, STT_HOST, etc.) still work but log deprecation warnings.

Voice Activity Detection (`STT_VAD_`)

Variable	Default	Description
`STT_VAD_ENABLED`	`true`	Enable Silero VAD for speech detection
`STT_VAD_THRESHOLD`	`0.5`	Speech probability threshold (0.0–1.0)
`STT_VAD_MIN_SPEECH_MS`	`250`	Minimum speech duration before triggering
`STT_VAD_SILENCE_MS`	`800`	Silence duration to trigger speech_end

VAD (Voice Activity Detection) uses the Silero VAD ONNX model (<2MB) to detect when someone is speaking. When enabled:

WebSocket streaming only forwards speech to the STT backend, saving compute
VAD events (speech_start / speech_end) are sent to WebSocket clients
Web UI mic shows visual indicators for speech detection state
Wyoming protocol filters silence before transcription

The VAD model downloads automatically on first use. It runs on CPU with negligible overhead (<5% CPU).

WebSocket VAD query parameter:

ws://host:8100/v1/audio/stream?vad=true    # force enable
ws://host:8100/v1/audio/stream?vad=false   # force disable
ws://host:8100/v1/audio/stream             # use STT_VAD_ENABLED default

Wyoming Protocol (`OS_WYOMING_`)

Variable	Default	Description
`OS_WYOMING_ENABLED`	`false`	Enable Wyoming TCP server
`OS_WYOMING_PORT`	`10400`	Wyoming protocol port

Open Speech supports the Wyoming protocol, making it a drop-in STT + TTS provider for Home Assistant.

Enable it:

OS_WYOMING_ENABLED=true
OS_WYOMING_PORT=10400  # default

Home Assistant setup:

Add to your Home Assistant configuration.yaml:

wyoming:
  - host: "YOUR_OPEN_SPEECH_IP"
    port: 10400

Or add via Settings → Devices & Services → Add Integration → Wyoming Protocol and enter your Open Speech server's IP and port 10400.

Once connected, Open Speech appears as both an STT and TTS provider in your voice pipeline configuration.

OpenAI Realtime API (`OS_REALTIME_`)

Variable	Default	Description
`OS_REALTIME_ENABLED`	`true`	Enable `/v1/realtime` WebSocket endpoint

Open Speech implements the OpenAI Realtime API WebSocket protocol for audio I/O only (STT + TTS). Any app built for OpenAI's voice mode works as a drop-in replacement — just point it at your Open Speech server.

We handle: audio input → transcription, text → speech output, VAD-based turn detection. We don't handle: LLM conversation, function calling, text generation. Bring your own brain.

Usage example (Python):

import asyncio, base64, json, websockets

async def main():
    uri = "ws://localhost:8100/v1/realtime?model=whisper-large-v3-turbo"
    async with websockets.connect(uri, subprotocols=["realtime"]) as ws:
        event = json.loads(await ws.recv())  # session.created
        print(f"Session: {event['session']['id']}")

        # Send audio (PCM16, 24kHz, mono)
        with open("audio.raw", "rb") as f:
            audio = f.read()
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(audio).decode(),
        }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Receive transcription
        while True:
            event = json.loads(await ws.recv())
            if event["type"] == "conversation.item.input_audio_transcription.completed":
                print(f"Transcript: {event['transcript']}")
                break

asyncio.run(main())

Drop-in replacement for OpenAI clients — just change the base URL:

# Point any OpenAI Realtime client at Open Speech
REALTIME_URL = "ws://localhost:8100/v1/realtime"

Models

Models are not baked into the image — they download on first use and persist in the Docker volume.

STT Models

Model	Size	Backend	Languages
`deepdml/faster-whisper-large-v3-turbo-ct2`	~800MB	faster-whisper	99+
`Systran/faster-whisper-large-v3`	~1.5GB	faster-whisper	99+
`Systran/faster-whisper-medium`	~800MB	faster-whisper	99+
`Systran/faster-whisper-small`	~250MB	faster-whisper	99+
`Systran/faster-whisper-base`	~150MB	faster-whisper	99+
`Systran/faster-whisper-tiny`	~75MB	faster-whisper	99+

TTS Models

Model	Size	Backend	Voices
`kokoro`	~82MB	Kokoro	52 voices, blending
`pocket-tts`	~220MB	Pocket TTS	8 built-in voices, streaming
`piper/en_US-lessac-medium`	~35MB	Piper	1
`piper/en_US-joe-medium`	~35MB	Piper	1
`piper/en_US-amy-medium`	~35MB	Piper	1
`piper/en_US-arctic-medium`	~35MB	Piper	1
`piper/en_GB-alan-medium`	~35MB	Piper	1
`qwen3-tts/0.6B-CustomVoice`	~1.2GB	Qwen3-TTS	4 + voice design
`qwen3-tts/1.7B-CustomVoice`	~3.4GB	Qwen3-TTS	4 + voice design

Switch models by changing STT_MODEL / TTS_MODEL and restarting, or use the API:

curl -sk -X POST https://localhost:8100/api/models/Systran%2Ffaster-whisper-small/load

API Reference

All endpoints return JSON unless noted. Authentication via Authorization: Bearer <key> header when OS_API_KEY is set. Interactive docs at /docs (Swagger UI).

Speech-to-Text (STT)

Method	Path	Description
`POST`	`/v1/audio/transcriptions`	Transcribe audio to text (OpenAI-compatible)
`POST`	`/v1/audio/translations`	Translate audio to English text (OpenAI-compatible)
`WS`	`/v1/audio/stream`	Real-time streaming transcription via WebSocket
`GET`	`/v1/audio/stream`	Returns `426` — instructs HTTP clients to use WebSocket

POST /v1/audio/transcriptions

Multipart form upload. Returns transcription in the requested format.

Params (form fields): file (required), model, language, prompt, response_format (json|text|verbose_json|srt|vtt), temperature, diarize (bool)

curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
  -F "response_format=json"

Response (json): { "text": "..." } Response (verbose_json): { "text", "segments": [{ "start", "end", "text" }], "language" } Response (diarize=true): { "text", "segments": [{ "speaker", "start", "end", "text" }] }

POST /v1/audio/translations

Translates non-English audio to English. Same multipart interface as transcriptions.

Params: file (required), model, prompt, response_format, temperature

Response: { "text": "..." }

WS /v1/audio/stream

Real-time streaming STT via WebSocket. Send raw audio chunks, receive transcript events.

Query params: model, language, sample_rate (default 16000), encoding (pcm_s16le), interim_results (bool), endpointing (ms), vad (bool)

const ws = new WebSocket("wss://localhost:8100/v1/audio/stream?vad=true");
ws.onmessage = (e) => console.log(JSON.parse(e.data));
ws.send(audioChunkArrayBuffer); // PCM16 LE mono

Events: { "type": "transcript", "text", "is_final" }, { "type": "speech_start" }, { "type": "speech_end" }

Text-to-Speech (TTS)

Method	Path	Description
`POST`	`/v1/audio/speech`	Synthesize speech from text (OpenAI-compatible)
`POST`	`/v1/audio/speech/clone`	Voice cloning via multipart upload
`GET`	`/v1/audio/voices`	List available TTS voices
`GET`	`/v1/audio/models`	List TTS models and their load status
`POST`	`/v1/audio/models/load`	Load a TTS model into memory
`POST`	`/v1/audio/models/unload`	Unload a TTS model from memory
`GET`	`/api/tts/capabilities`	Backend capabilities for a TTS model
`GET`	`/api/voice-presets`	List configured voice presets

POST /v1/audio/speech

JSON body. Returns audio bytes in the requested format.

Body: model (string), input (text to speak), voice, speed (float), response_format (mp3|opus|aac|flac|wav|pcm), language, input_type (text|ssml), voice_design (string, Qwen3 only), reference_audio (base64, Qwen3 only), clone_transcript, effects (array) Query: stream (bool), cache (bool)

curl -sk https://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Hello world","voice":"af_heart"}' \
  -o output.mp3

Response: Audio bytes with Content-Type matching format. Header X-Cache: HIT on cache hit.

POST /v1/audio/speech/clone

Multipart form upload for voice cloning with a reference audio sample.

Params: input (required), model, reference_audio (file upload), voice_library_ref (string — use stored voice instead of upload), voice, speed, response_format, transcript, language

curl -sk https://localhost:8100/v1/audio/speech/clone \
  -F "input=Hello, I sound like the reference!" \
  -F "model=qwen3-tts/0.6B-CustomVoice" \
  -F "reference_audio=@reference.wav" \
  -o cloned.mp3

Response: Audio bytes.

GET /v1/audio/voices

List available TTS voices. Optionally filter by model/provider.

Query: model (optional — filter by provider, e.g. kokoro, piper)

Response: { "voices": [{ "id", "name", "language", "gender" }] }

GET /v1/audio/models

List TTS models and their runtime status (loaded/not loaded).

Response: { "models": [{ "model", "backend", "device", "status", "loaded_at", "last_used_at" }] }

POST /v1/audio/models/load

Body: { "model": "kokoro" } (optional — defaults to TTS_MODEL)

Response: { "status": "loaded", "model": "kokoro" }

POST /v1/audio/models/unload

Body: { "model": "kokoro" } (optional — defaults to TTS_MODEL)

Response: { "status": "unloaded", "model": "kokoro" }

GET /api/tts/capabilities

Returns capability flags for the selected TTS backend.

Query: model (optional — defaults to TTS_MODEL)

Response: { "backend": "kokoro", "capabilities": { "voice_design": false, "voice_clone": false, ... } }

OpenAI Realtime API

Method	Path	Description
`WS`	`/v1/realtime`	OpenAI Realtime API-compatible audio WebSocket

WS /v1/realtime

Drop-in replacement for OpenAI's Realtime API. Audio I/O only (STT + TTS). Requires OS_REALTIME_ENABLED=true (default).

Query: model (optional) Subprotocol: realtime

Client events: input_audio_buffer.append ({ "audio": "<base64>" }), input_audio_buffer.commit, response.create Server events: session.created, conversation.item.input_audio_transcription.completed ({ "transcript" }), response.audio.delta, response.audio.done

Models (OpenAI-compatible)

Method	Path	Description
`GET`	`/v1/models`	List all available models (OpenAI format)
`GET`	`/v1/models/{model}`	Get model details

GET /v1/models

Lists both STT and TTS models. Includes loaded models and configured defaults.

Response: { "object": "list", "data": [{ "id", "object": "model", "owned_by" }] }

Unified Model Management

Method	Path	Description
`GET`	`/api/models`	List all models (available + downloaded + loaded)
`GET`	`/api/models/{id}/status`	Get model state (loaded/downloaded/available)
`GET`	`/api/models/{id}/progress`	Download/load progress
`POST`	`/api/models/{id}/load`	Load a model into memory
`POST`	`/api/models/{id}/download`	Download model weights only
`POST`	`/api/models/{id}/prefetch`	Cache model weights (alias for download)
`DELETE`	`/api/models/{id}`	Unload a model from memory
`DELETE`	`/api/models/{id}/artifacts`	Delete cached model weights/artifacts

GET /api/models

Returns all known models across STT and TTS with state info and TTS capabilities.

Response: { "models": [{ "id", "type", "backend", "state", "capabilities": {} }] }

POST /api/models/{id}/load

curl -sk -X POST https://localhost:8100/api/models/kokoro/load

Response: { "id", "type", "backend", "state": "loaded" } Error: { "error": { "message", "code" } } (400 or 500)

GET /api/models/{id}/progress

Poll download/load progress for a model.

Response: { "status": "loading"|"downloading"|"ready"|"idle", "progress": 0.0–1.0 }

Provider Management

Method	Path	Description
`POST`	`/api/providers/install`	Install provider Python dependencies (async)
`GET`	`/api/providers/install/{job_id}`	Poll provider install job status

POST /api/providers/install

Starts an async pip install for a provider's dependencies. Returns a job ID to poll.

Body: { "provider": "piper", "model": null }

Response: { "job_id": "uuid", "status": "installing" }

GET /api/providers/install/{job_id}

Response: { "job_id", "status": "installing"|"done"|"failed", "output", "error" }

Voice Library

Method	Path	Description
`POST`	`/api/voices/library`	Upload a named voice reference for cloning
`GET`	`/api/voices/library`	List all stored voice references
`GET`	`/api/voices/library/{name}`	Get voice reference metadata
`DELETE`	`/api/voices/library/{name}`	Delete a stored voice reference

POST /api/voices/library

Multipart upload. Stores a named voice reference audio for use with voice_library_ref in clone requests.

Params: name (form), audio (file upload — WAV format)

Response (201): { "name", "content_type", "size_bytes", "created_at" }

Voice Profiles (Studio)

Method	Path	Description
`POST`	`/api/profiles`	Create a voice profile
`GET`	`/api/profiles`	List all profiles + default
`GET`	`/api/profiles/{id}`	Get a profile
`PUT`	`/api/profiles/{id}`	Update a profile
`DELETE`	`/api/profiles/{id}`	Delete a profile
`POST`	`/api/profiles/{id}/default`	Set as default profile

POST /api/profiles

Body: { "name", "backend", "model", "voice", "speed", "format", "blend", "reference_audio_id", "effects": [] }

Response (201): { "id", "name", "backend", "model", "voice", "speed", "format", ... }

GET /api/profiles

Response: { "profiles": [{ "id", "name", "backend", "voice", ... }], "default_profile_id": "..." }

History

Method	Path	Description
`GET`	`/api/history`	List generation history entries
`DELETE`	`/api/history/{id}`	Delete a history entry
`DELETE`	`/api/history`	Clear all history

GET /api/history

Query: type (stt|tts|null), limit (default 50), offset (default 0)

Response: { "items": [{ "id", "type", "model", "created_at", ... }], "total", "limit", "offset" }

Conversations

Method	Path	Description
`POST`	`/api/conversations`	Create a conversation
`GET`	`/api/conversations`	List conversations
`GET`	`/api/conversations/{id}`	Get a conversation
`POST`	`/api/conversations/{id}/turns`	Add a turn to a conversation
`DELETE`	`/api/conversations/{id}/turns/{turn_id}`	Delete a turn
`POST`	`/api/conversations/{id}/render`	Render conversation to audio
`GET`	`/api/conversations/{id}/audio`	Get rendered audio file
`DELETE`	`/api/conversations/{id}`	Delete a conversation

POST /api/conversations

Body: { "name": "Demo", "turns": [{ "speaker", "text", "profile_id", "effects" }] }

Response (201): { "id", "name", "turns": [...], "created_at" }

POST /api/conversations/{id}/render

Body: { "format": "wav", "sample_rate": 24000, "save_turn_audio": true }

Response: { "id", "render_output_path", "format", "duration_s" }

Composer (Multi-track)

Method	Path	Description
`POST`	`/api/composer/render`	Render a multi-track composition
`GET`	`/api/composer/renders`	List all composition renders
`GET`	`/api/composer/render/{id}/audio`	Get rendered composition audio
`DELETE`	`/api/composer/render/{id}`	Delete a composition render

POST /api/composer/render

Body: { "name", "format": "wav", "sample_rate": 24000, "tracks": [{ "source_path", "offset_s", "volume", "muted", "solo", "effects" }] }

Response: { "id", "render_output_path", "format", "tracks_count" }

Legacy Endpoints

Method	Path	Description
`GET`	`/api/ps`	List loaded STT models
`POST`	`/api/ps/{model}`	Load an STT model
`DELETE`	`/api/ps/{model}`	Unload an STT model
`POST`	`/api/pull/{model}`	Download a model (load + unload)

These legacy endpoints are kept for backwards compatibility. Prefer the unified /api/models/* endpoints.

Health & UI

Method	Path	Description
`GET`	`/health`	Health check
`GET`	`/web`	Web UI (HTML)
`GET`	`/docs`	OpenAPI/Swagger interactive docs

GET /health

Response: { "status": "ok", "version": "0.6.0", "models_loaded": 2 }

Transcribe audio

curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
  -F "response_format=json"

Formats: json, text, verbose_json, srt, vtt

Text-to-speech

curl -sk https://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Hello world","voice":"af_heart"}' \
  -o output.mp3

Formats: mp3, opus, aac, flac, wav, pcm

OpenAI Python SDK

import httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://localhost:8100/v1",
    api_key="not-needed",
    http_client=httpx.Client(verify=False),
)

# STT
with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        model="deepdml/faster-whisper-large-v3-turbo-ct2", file=f
    )
print(result.text)

# TTS
response = client.audio.speech.create(
    model="kokoro", input="Hello world", voice="af_heart"
)
response.stream_to_file("output.mp3")

Open Speech Python client SDK

import asyncio
from src.client import OpenSpeechClient

client = OpenSpeechClient(base_url="http://localhost:8100")

# Sync helpers
result = client.transcribe(open("audio.wav", "rb").read())
print(result["text"])

# Realtime session
rt = client.realtime_session()
rt.on_transcript(lambda ev: print("tx", ev))
rt.on_vad(lambda ev: print("vad", ev))
rt.send_audio(b"\x00\x00" * 2400)
rt.commit()
rt.create_response("Hello from Open Speech", voice="alloy")
rt.close()

# Async streaming transcription
async def run():
    async def chunks():
        yield b"\x00\x00" * 3200
    async for event in client.async_stream_transcribe(chunks()):
        print(event)

asyncio.run(run())

Open Speech JS/TS client SDK

import { OpenSpeechClient } from "@open-speech/client";

const client = new OpenSpeechClient({ baseUrl: "http://localhost:8100" });
const transcript = await client.transcribe(await (await fetch("/audio.wav")).arrayBuffer());
console.log(transcript.text);

const rt = client.realtimeSession();
rt.onTranscript((ev) => console.log(ev));
rt.sendAudio(pcmChunk);
rt.commit();
rt.createResponse("Hello there", "alloy");

Real-time streaming (WebSocket)

const ws = new WebSocket("wss://localhost:8100/v1/audio/stream?model=deepdml/faster-whisper-large-v3-turbo-ct2");
ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "transcript") {
        console.log(data.is_final ? "FINAL:" : "partial:", data.text);
    }
};
// Send PCM16 LE mono 16kHz chunks
ws.send(audioChunkArrayBuffer);

Web UI

Open https://localhost:8100/web for a three-tab interface:

Transcribe — Upload files, record from mic, or stream in real-time
Speak — Enter text, pick a voice, generate audio
Models — Browse available models, download, load/unload

Light and dark themes with OS auto-detection.

Docker

Docker Compose — CPU

services:
  open-speech:
    image: jwindsor1/open-speech:cpu
    ports: ["8100:8100"]
    environment:
      - STT_MODEL=Systran/faster-whisper-base
      - STT_DEVICE=cpu
      - TTS_MODEL=kokoro
      - TTS_DEVICE=cpu
    volumes:
      - hf-cache:/root/.cache/huggingface
volumes:
  hf-cache:

Docker Compose — GPU

services:
  open-speech:
    image: jwindsor1/open-speech:latest
    ports: ["8100:8100"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - STT_MODEL=deepdml/faster-whisper-large-v3-turbo-ct2
      - STT_DEVICE=cuda
      - STT_COMPUTE_TYPE=float16
      - TTS_MODEL=kokoro
      - TTS_DEVICE=cuda
    volumes:
      - hf-cache:/root/.cache/huggingface
volumes:
  hf-cache:

Build-time baked providers (GPU image)

Default GPU image bakes kokoro — ready to use immediately with zero setup. You can customize which providers and model weights are baked at build time:

docker build \
  --build-arg BAKED_PROVIDERS=kokoro,pocket-tts,piper \
  --build-arg BAKED_TTS_MODELS=kokoro,pocket-tts,piper/en_US-ryan-medium \
  -t jwindsor1/open-speech:full .

BAKED_PROVIDERS: controls which Python packages are pre-installed at build time (fast, adds ~200MB per provider). Default: kokoro.
BAKED_TTS_MODELS: controls which model weights are baked into the image (slow, adds GB per model). Default: kokoro.

⚠️ Qwen3-TTS dependency conflict: qwen-tts hard-pins transformers==4.57.3, which raises ImportError at load time if huggingface-hub>=1.0 is installed (needed by faster-whisper). This poisons the entire Python environment — kokoro and faster-whisper both break. To include qwen3, build with BAKED_PROVIDERS=kokoro,qwen3 — the Dockerfile installs qwen-tts with --no-deps and wires compatible dependency versions. Qwen3 may fail at runtime if it uses transformers 4.x-only APIs; this is isolated and won't affect other backends.

Windows GPU (WSL2)

Ensure NVIDIA Container Toolkit is installed in WSL2, then use the GPU compose file above.

Volume Strategy

Models download to /root/.cache/huggingface inside the container. Mount a named volume to persist across restarts — no re-downloading.

STT Backends

Backend	Best for	Languages	Model prefix
faster-whisper	High accuracy, GPU	99+	`Systran/faster-whisper-`, `deepdml/faster-whisper-`

TTS Backends

Backend	Best for	Voices	Status
Kokoro	Quality + variety	52 voices, blending	✅ Stable
Pocket TTS	CPU-first, low-latency	8 built-in voices, streaming	✅ Stable
Piper	Lightweight, fast	Per-model voices	✅ Stable
Qwen3-TTS	Voice design + cloning	4 built-in + custom	✅ Stable

See TTS-BACKENDS.md for the backend roadmap.

Voice Blending

Kokoro supports mixing voices with weighted syntax:

curl -sk https://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Hello","voice":"af_bella(2)+af_sky(1)"}'

This blends 2 parts af_bella with 1 part af_sky. See voice-presets.example.yml for saving named presets.

Voice Cloning

Clone any voice from a reference audio sample (Qwen3-TTS):

# Via multipart upload
curl -sk https://localhost:8100/v1/audio/speech/clone \
  -F "input=Hello, I sound like the reference!" \
  -F "model=qwen3-tts/0.6B-CustomVoice" \
  -F "reference_audio=@reference.wav" \
  -o cloned.mp3

# Via JSON with base64
curl -sk https://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-tts/0.6B-CustomVoice","input":"Hello","voice":"default","reference_audio":"<base64>"}' \
  -o cloned.mp3

Voice Design

Describe a voice in natural language and Qwen3-TTS will generate it:

curl -sk https://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-tts/0.6B-CustomVoice",
    "input": "Good morning!",
    "voice": "default",
    "voice_design": "A warm, deep male voice with a slight British accent"
  }' \
  -o designed.mp3

voice_design and reference_audio are ignored by backends that don't support them (Kokoro, Piper).

Security

# API key (recommended for all non-local deployments)
OS_API_KEY=my-secret-key docker compose up -d
curl -sk -H "Authorization: Bearer my-secret-key" https://localhost:8100/health

# Enforce auth at startup (fails fast when API key missing)
OS_AUTH_REQUIRED=true

# Realtime/WebSocket origin allowlist (optional)
OS_WS_ALLOWED_ORIGINS=https://myapp.com,https://staging.myapp.com

# Wyoming bind host (default localhost)
OS_WYOMING_HOST=127.0.0.1

# Rate limiting (60 req/min, burst of 10)
OS_RATE_LIMIT=60 OS_RATE_LIMIT_BURST=10

# CORS
OS_CORS_ORIGINS=https://myapp.com,https://staging.myapp.com

# Custom SSL cert
OS_SSL_CERTFILE=/certs/cert.pem OS_SSL_KEYFILE=/certs/key.pem

Environment Variables Reference

Defaults from src/config.py (grouped by prefix).

OS_* (server / shared)

Variable	Default	Description
`OS_PORT`	`8100`	HTTP bind port
`OS_HOST`	`0.0.0.0`	HTTP bind host
`OS_API_KEY`	`""`	Bearer API key (empty disables auth)
`OS_AUTH_REQUIRED`	`false`	Fail startup if API key missing
`OS_CORS_ORIGINS`	`*`	Comma-separated CORS origins
`OS_WS_ALLOWED_ORIGINS`	`""`	Allowed WebSocket `Origin` values
`OS_TRUST_PROXY`	`false`	Trust `X-Forwarded-For` headers
`OS_MAX_UPLOAD_MB`	`100`	Max upload size in MB
`OS_RATE_LIMIT`	`0`	Requests/min/IP (`0` disables)
`OS_RATE_LIMIT_BURST`	`0`	Burst bucket size
`OS_SSL_ENABLED`	`true`	Enable HTTPS
`OS_SSL_CERTFILE`	`""`	TLS cert path (auto-generated if empty)
`OS_SSL_KEYFILE`	`""`	TLS key path (auto-generated if empty)
`OS_VOICE_LIBRARY_PATH`	`/home/openspeech/data/voices`	Stored voice reference directory
`OS_VOICE_LIBRARY_MAX_COUNT`	`100`	Max stored voice references (`0` = unlimited)
`OS_STUDIO_DB_PATH`	`/home/openspeech/data/studio.db`	SQLite database for profiles/history
`OS_HISTORY_ENABLED`	`true`	Enable TTS/STT history logging
`OS_HISTORY_MAX_ENTRIES`	`1000`	Maximum history rows retained
`OS_HISTORY_RETAIN_AUDIO`	`true`	Keep audio output path metadata
`OS_HISTORY_MAX_MB`	`2000`	Max retained audio footprint in MB
`OS_EFFECTS_ENABLED`	`true`	Enable TTS effects processing
`OS_CONVERSATIONS_DIR`	`/home/openspeech/data/conversations`	Conversation storage directory
`OS_COMPOSER_DIR`	`/home/openspeech/data/composer`	Composer storage directory
`OS_PROVIDERS_DIR`	`/home/openspeech/data/providers`	User-installed provider package directory
`OS_WYOMING_ENABLED`	`false`	Enable Wyoming TCP server
`OS_WYOMING_HOST`	`127.0.0.1`	Wyoming bind host
`OS_WYOMING_PORT`	`10400`	Wyoming port
`OS_REALTIME_ENABLED`	`true`	Enable `/v1/realtime`
`OS_REALTIME_MAX_BUFFER_MB`	`50`	Max realtime audio buffer per session
`OS_REALTIME_IDLE_TIMEOUT_S`	`120`	Realtime idle timeout seconds
`OS_MODEL_TTL`	`300`	Auto-unload idle model TTL (seconds)
`OS_MAX_LOADED_MODELS`	`0`	Max loaded models (`0` = unlimited)
`OS_STREAM_CHUNK_MS`	`100`	Streaming chunk window (ms)
`OS_STREAM_VAD_THRESHOLD`	`0.5`	Streaming VAD threshold
`OS_STREAM_ENDPOINTING_MS`	`300`	Silence to finalize utterance (ms)
`OS_STREAM_MAX_CONNECTIONS`	`10`	Max concurrent streaming WS sessions

STT_* (speech-to-text)

Variable	Default	Description
`STT_MODEL`	`deepdml/faster-whisper-large-v3-turbo-ct2`	Default STT model ID
`STT_DEVICE`	`cuda`	STT inference device
`STT_COMPUTE_TYPE`	`float16`	STT compute precision
`STT_MODEL_DIR`	`None`	Optional local model directory
`STT_PRELOAD_MODELS`	`""`	Comma-separated STT models to preload
`STT_VAD_ENABLED`	`true`	Enable VAD by default on streaming endpoint
`STT_VAD_THRESHOLD`	`0.5`	VAD speech probability threshold
`STT_VAD_MIN_SPEECH_MS`	`250`	Min speech duration before activation
`STT_VAD_SILENCE_MS`	`800`	Silence duration for endpointing
`STT_DIARIZE_ENABLED`	`false`	Enable diarization support
`STT_NOISE_REDUCE`	`false`	Enable STT denoise preprocessing
`STT_NORMALIZE`	`true`	Enable STT input normalization

TTS_* (text-to-speech)

Variable	Default	Description
`TTS_ENABLED`	`true`	Enable TTS endpoints
`TTS_MODEL`	`kokoro`	Default TTS model ID
`TTS_VOICE`	`af_heart`	Default TTS voice
`TTS_DEVICE`	`None`	TTS device override (falls back to STT device)
`TTS_MAX_INPUT_LENGTH`	`4096`	Max text chars per synthesis request
`TTS_DEFAULT_FORMAT`	`mp3`	Default output audio format
`TTS_SPEED`	`1.0`	Default synthesis speed
`TTS_PRELOAD_MODELS`	`""`	Comma-separated TTS models to preload
`TTS_VOICES_CONFIG`	`""`	YAML voice presets file path
`TTS_CACHE_ENABLED`	`false`	Enable on-disk synthesis cache
`TTS_CACHE_MAX_MB`	`500`	Max cache size in MB
`TTS_CACHE_DIR`	`/var/lib/open-speech/cache`	Cache directory
`TTS_TRIM_SILENCE`	`true`	Trim generated silence
`TTS_NORMALIZE_OUTPUT`	`true`	Normalize output loudness
`TTS_PRONUNCIATION_DICT`	`""`	Pronunciation dictionary path
`TTS_QWEN3_SIZE`	`1.7B`	Qwen3 model size selector
`TTS_QWEN3_FLASH_ATTN`	`false`	Enable Qwen3 flash attention
`TTS_QWEN3_DEVICE`	`cuda:0`	Qwen3 backend device override

Roadmap

See ROADMAP.md for the full phase breakdown and current status.

Current status (v0.5.1)

✅ Phases 2–6 complete (multi-model, Docker, advanced TTS, Wyoming, production hardening)
✅ Phase 8a+8b — Voice Profiles + Generation History
✅ Web UI — ground-up rewrite: Transcribe / Speak / Models / History / Settings
🚧 Phase 8c–8e — Conversation mode, Voice effects, Composer
🔴 B6, B9 — critical backend bugs (provider install path, streaming event loop)

Contributing

Open Speech uses a pluggable backend system. To add a new STT or TTS backend:

Create a new file in src/tts/ or src/stt/
Implement the backend interface (see existing backends for examples)
Register it in the model registry
Add tests
Submit a PR

See TTS-BACKENDS.md for the TTS backend roadmap and design patterns.

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
client-js		client-js
data/voices		data/voices
docs		docs
examples		examples
src		src
tests		tests
.coverage		.coverage
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
FIXES.md		FIXES.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
docker-compose.cpu.yml		docker-compose.cpu.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
pyproject.toml		pyproject.toml
requirements.lock		requirements.lock
voice-presets.example.yml		voice-presets.example.yml

Folders and files

Latest commit

History

Repository files navigation

Open Speech

What is Open Speech?

Features

Quick Start

Installation (from source)

Configuration

Server (OS_)

Speech-to-Text (STT_)

Text-to-Speech (TTS_)

Voice Activity Detection (STT_VAD_)

Wyoming Protocol (OS_WYOMING_)

OpenAI Realtime API (OS_REALTIME_)

Models

STT Models

TTS Models

API Reference

Speech-to-Text (STT)

Text-to-Speech (TTS)

OpenAI Realtime API

Models (OpenAI-compatible)

Unified Model Management

Provider Management

Voice Library

Voice Profiles (Studio)

History

Conversations

Composer (Multi-track)

Legacy Endpoints

Health & UI

Web UI

Docker

Docker Compose — CPU

Docker Compose — GPU

Build-time baked providers (GPU image)

Windows GPU (WSL2)

Volume Strategy

STT Backends

TTS Backends

Voice Blending

Voice Cloning

Voice Design

Security

Environment Variables Reference

OS_* (server / shared)

STT_* (speech-to-text)

TTS_* (text-to-speech)

Roadmap

Current status (v0.5.1)

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Server (`OS_`)

Speech-to-Text (`STT_`)

Text-to-Speech (`TTS_`)

Voice Activity Detection (`STT_VAD_`)

Wyoming Protocol (`OS_WYOMING_`)

OpenAI Realtime API (`OS_REALTIME_`)

Packages