Table of Contents
stt-service is a high-performance, real-time speech-to-text microservice built with Python and FastAPI. It provides enhanced partial and final transcription capabilities with sub-300ms latency for conversational AI applications. The service features dual-mode processing, intelligent voice activity detection, and comprehensive monitoring for production deployment. Designed for seamless integration with conversational AI pipelines, it enables natural real-time interactions with advanced interrupt handling (barge-in) support.
- Partial and Final Transcription: Real-time partial results (~300ms) with high-quality final transcriptions (~600ms)
- Enhanced Voice Activity Detection: Smart triggering with utterance boundary detection and barge-in support
- Dual Processing Modes: Optimized pipelines for speed (partial) vs. accuracy (final) with overlapping audio windows
- WebSocket Real-time API: FastAPI-based server with concurrent connections and intelligent buffering
- Conversational AI Optimized: Purpose-built for STT→LLM→TTS pipelines with interrupt handling
- GPU-Accelerated Processing: Whisper model integration with smart caching and resource management
- Production Monitoring: Comprehensive metrics, health checks, and performance analysis tools
- Backward Compatible: Legacy transcription API support for existing integrations
- Docker & GPU Ready: Complete containerization with CUDA support and scaling configurations
- Performance Testing: Built-in benchmarking tools for latency validation and optimization
└── stt-service/
├── Dockerfile
├── Dockerfile.gpu
├── app
│ ├── __init__.py
│ ├── core
│ │ ├── audio_stream_processor.py # Enhanced dual-mode processing
│ │ ├── websocket_server.py # FastAPI WebSocket server
│ │ ├── whisper_handler.py # Partial/final transcription
│ │ └── microphone_capture.py
│ ├── client_examples
│ │ └── websocket_client_example.py # Enhanced client with partial/final
│ ├── main.py # CLI with WebSocket mode
│ ├── monitoring
│ └── utils
├── assets
│ └── harvard.wav
├── docs
│ ├── MONITORING.md
│ ├── WebSocket_STT_Architecture.md # Complete architecture guide
│ └── PRODUCTION_MONITORING.md
├── main.py # Service entry point
├── test_partial_final_performance.py # Performance validation
├── requirements.txt
├── scripts
│ └── test_endpoints.bat
└── tests
├── __init__.py
├── test_error_handler.py
├── test_logging.py
├── test_monitoring_simple.py
└── test_realtime.py
STT-SERVICE/
__root__
requirements.txt Python package dependencies for running stt-service (see `requirements.txt`).
dockerfile-gpu.txt Optional GPU-enabled Dockerfile snippet for building with CUDA / GPU support.
Dockerfile Production-ready container build for the microservice.
scripts
test_endpoints.bat Windows batch script to exercise HTTP endpoints for quick local testing.
app
main.py Flask application entrypoint and CLI for running the service.
core
whisper_handler.py Model adapter that wraps Whisper (or compatible) transcription models.
realtime_transcription.py Implements realtime transcription flow and request handling utilities.
microphone_capture.py Helpers for capturing audio from a microphone or system input device.
audio_processor.py Audio preprocessing utilities (resampling, chunking, normalization).
monitoring
monitoring_service.py Provides monitoring endpoints and routines used by health checks and probes.
service_monitor.py Background routines to collect service metrics and write JSON test outputs.
utils
logger.py Logging configuration and helpers used throughout the service.
config.py Configuration helpers (environment variables, default settings).
error_handler.py Centralized error handling utilities and testable helpers.
connection_manager.py Network and dependency connection helpers (external model backends, telemetry sinks).
Before getting started with stt-service, ensure your runtime environment meets the following requirements:
- Programming Language: Python
- Package Manager: Pip
- Container Runtime: Docker
Install stt-service using one of the following methods:
Build from source:
- Clone the stt-service repository:
❯ git clone https://github.com/Berkay2002/stt-service
- Navigate to the project directory:
❯ cd stt-service
- Install the project dependencies:
❯ pip install -r requirements.txt
❯ docker build -t Berkay2002/stt-service .
# Start WebSocket server for real-time transcription
❯ python main.py websocket --port 8000
# With custom configuration
❯ python main.py websocket --host 0.0.0.0 --port 8080 --max-connections 100
# Process single audio file
❯ python main.py file input.wav
# With word-level timestamps
❯ python main.py file input.wav --timestamps
# Record and transcribe from microphone
❯ python main.py microphone --timestamps
# Show current configuration
❯ python main.py config --show
CPU Version:
❯ docker build -t stt-service .
❯ docker run -p 8000:8000 -p 9091:9091 stt-service
GPU Version:
❯ docker build -f Dockerfile.gpu -t stt-service-gpu .
❯ docker run --gpus all -p 8000:8000 -p 9091:9091 stt-service-gpu
Docker Compose with GPU:
❯ docker-compose -f docker-compose.gpu.yml up
# Run all unit tests
❯ pytest
# Run specific test modules
❯ pytest tests/test_error_handler.py
❯ pytest tests/test_realtime.py
# Test partial/final transcription performance
❯ python test_partial_final_performance.py
# Expected output:
# ✅ EXCELLENT: Partial latency < 300ms
# ✅ EXCELLENT: Final latency < 600ms
# ✅ CONFIRMED: Partial transcription working
# ✅ CONFIRMED: Final transcription working
# Interactive client with partial/final support
❯ python app/client_examples/websocket_client_example.py
# Choose option 1 for microphone testing
# Choose option 2 for test audio file
# Choose option 3 for connection stress test
# Test HTTP endpoints (Windows)
❯ scripts/test_endpoints.bat
# Test WebSocket health
❯ curl http://localhost:8000/health
❯ curl http://localhost:8000/stats
This project is protected under the MIT License.
Thanks to the open-source projects and communities that make this work possible:
- OpenAI and the Whisper model authors — for the foundational speech-to-text models and faster-whisper optimizations
- FastAPI community — for the high-performance async web framework enabling real-time WebSocket processing
- Python audio ecosystem — NumPy, soundfile, pyaudio, scipy for comprehensive audio processing capabilities
- GPU acceleration libraries — PyTorch, CUDA toolkit for efficient real-time transcription processing
- WebSocket and async communities — for enabling seamless real-time communication protocols
-
Install dependencies:
pip install -r requirements.txt
-
Start WebSocket server:
python main.py websocket
-
Test with performance validator:
python test_partial_final_performance.py
-
Try interactive client:
python app/client_examples/websocket_client_example.py
- Transcription:
ws://localhost:8000/ws/transcribe
- Health Check:
http://localhost:8000/health
- Statistics:
http://localhost:8000/stats
- Monitoring:
http://localhost:9091/health
- Partial Results: <300ms average latency
- Final Results: <600ms average latency
- Concurrent Connections: 50+ streams
- Real-time Factor: <0.3 GPU utilization