Independent Speech-to-Speech AI Model Training & Deployment
Build production-ready speech-to-speech AI models from scratch, with no dependency on external APIs. Train your own Luna-style conversational AI on RunPod with complete control over architecture, data, and costs.
β Duplex Streaming: ~400ms mean latency
β Public HiFiGAN Vocoder: MIT licensed, commercial-safe
β Turn-based Mode: VAD-based conversation management
β WebSocket API: Real-time browser integration
β Self-contained: Automatic model downloading and caching
β Trainable Speech Tokenizer: RVQ-based audio compression
β Hybrid S2S Model: End-to-end speech-to-speech transformer
β Checkpoint Management: Save/resume training anytime
β Cost-Optimized: Train for $300-1,700 (vs $10K+ for GPT-4o)
β LibriSpeech Ready: Works out-of-box with free datasets
β WandB Integration: Track training metrics in real-time
# Clone repository
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
# Install dependencies
pip install -r requirements.txt
# Start server (downloads models automatically)
REPLY_MODE=turn python src/server.py
# Access web interface
open http://localhost:8000See TRAINING_GUIDE.md for complete instructions.
Quick overview:
# 1. Download LibriSpeech dataset (100 hours, free)
wget http://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz -C /workspace/data
# 2. Install training dependencies
pip install -r requirements-training.txt
# 3. Configure training
nano training/configs/tokenizer_config.yaml
# 4. Start training on RunPod A100
python training/train_tokenizer.py
# Estimated cost: $10-15 for 100h dataset (8-12 GPU hours)User Audio (Browser)
β WebSocket
[FastAPI Server]
β
[VAD] β Detect speech activity
β
[Speech Tokenizer] β Audio β Discrete tokens
β
[Hybrid S2S Model] β Token β Token generation
β
[Speech Decoder] β Tokens β Mel spectrogram
β
[HiFiGAN Vocoder] β Mel β Audio waveform
β WebSocket
AI Response (Browser)
Phase 1: Speech Tokenizer
- Architecture: CNN Encoder + RVQ (8 quantizers) + CNN Decoder
- Dataset: LibriSpeech (100-960 hours)
- Training Time: 8-120 GPU hours
- Cost: $10-143 on RunPod A100
Phase 2: Hybrid S2S Model (Coming Soon)
- Architecture: Transformer with dual-stream processing
- Dataset: Conversational pairs (synthetic or real)
- Training Time: 100-400 GPU hours
- Cost: $119-476
Phase 3: Emotional Control (Coming Soon)
- Fine-tuning on emotional speech datasets
- Supports laughter, sighs, whispers, emotions
- Training Time: 50-150 GPU hours
- Cost: $60-179
| Configuration | Dataset | GPU Hours | Cost | Quality |
|---|---|---|---|---|
| Budget | LibriSpeech 100h | 10 | $12 | Testing |
| Standard | LibriSpeech 360h | 40 | $48 | Production |
| Premium | LibriSpeech 960h | 120 | $143 | Excellent |
RunPod A100 80GB pricing: $1.19/hr (Secure Cloud), $0.89/hr (Community Cloud)
Full Training Pipeline (Budget to Production):
- Speech Tokenizer: $12-143
- S2S Model: $119-476
- Emotion Fine-tuning: $60-179
- Total: $191-798
Compare to:
- OpenAI GPT-4o Realtime API: $5-10/hr usage, no model ownership
- Luna AI (Pixa): API-only, pricing TBD, no self-hosting
- Your model: One-time cost, full ownership, unlimited usage
GET /health- Health check and model statusGET /api/stats- Latency statistics and performance metricsWebSocket /ws/stream- Real-time audio streamingGET /- Built-in web interface with audio controls
// Browser WebSocket client
const ws = new WebSocket('wss://your-pod.runpod.net:8000/ws/stream');
// Send audio chunks (24kHz, mono, PCM)
ws.send(audioChunkFloat32Array.buffer);
// Receive AI audio response
ws.onmessage = (event) => {
const audioData = new Float32Array(event.data);
playAudio(audioData);
};Testing-S2S/
βββ src/
β βββ models/
β β βββ speech_tokenizer.py # Inference-only tokenizer
β β βββ speech_tokenizer_trainable.py # Trainable with RVQ (NEW!)
β β βββ hybrid_s2s.py # S2S transformer model
β β βββ hifigan_public.py # MIT-licensed vocoder
β β βββ streaming_processor.py # Real-time inference
β βββ server.py # FastAPI server
βββ training/ # Training scripts (NEW!)
β βββ train_tokenizer.py
β βββ configs/
β βββ tokenizer_config.yaml
βββ scripts/
β βββ test_tokenizer.py # Quality testing (NEW!)
βββ requirements.txt
βββ requirements-training.txt # Training dependencies (NEW!)
βββ TRAINING_GUIDE.md # Complete training guide (NEW!)
βββ README.md
# Clone and setup
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
pip install -e .
# Run tests
python -m pytest tests/
# Test tokenizer quality
python scripts/test_tokenizer.py \
--checkpoint checkpoints/tokenizer_best.pt \
--input test_audio.wav \
--output reconstructed.wav
# Start development server
REPLY_MODE=turn python src/server.py# Inference
export REPLY_MODE=turn # 'stream' or 'turn'
export MODEL_CACHE_DIR=/workspace/cache
export CUDA_VISIBLE_DEVICES=0
# Training
export WANDB_API_KEY=your_key # Optional: WandB logging
export TORCH_HOME=/workspace/cache/torchEdit training/configs/tokenizer_config.yaml:
model:
sample_rate: 24000 # Match Luna AI
codebook_size: 1024
hidden_dim: 512
num_quantizers: 8
data:
data_dir: "/workspace/data"
train_split: "train-clean-100"
training:
epochs: 100
batch_size: 16 # Adjust for GPU memory
learning_rate: 1e-4- TRAINING_GUIDE.md - Complete training tutorial
- RUNPOD_DEPLOYMENT.md - Cloud deployment guide
- VOCODER_NOTICE.md - HiFiGAN license details
- MODEL_STATUS_CRITICAL.md - Model architecture notes
| Mode | Mean Latency | P95 Latency | Notes |
|---|---|---|---|
| Stream | ~400ms | ~600ms | Continuous generation |
| Turn | ~350ms | ~500ms | Wait for speech end |
| Component | Dataset Size | Batch Size | Time/Epoch | Total Time |
|---|---|---|---|---|
| Tokenizer | 100h | 16 | 5 min | 8h (100 epochs) |
| Tokenizer | 360h | 16 | 18 min | 30h (100 epochs) |
| Tokenizer | 960h | 16 | 48 min | 80h (100 epochs) |
# 1. Launch RunPod A100 pod
# 2. Clone repo and install
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
pip install -r requirements.txt
# 3. Download trained checkpoints (or train your own)
wget -O checkpoints/tokenizer_best.pt YOUR_CHECKPOINT_URL
# 4. Start production server
uvicorn src.server:app --host 0.0.0.0 --port 8000 --workers 1
# 5. Access via RunPod proxy
# https://<pod-id>.direct.runpod.net:8000Inference Costs:
- A40 48GB: $0.79/hr (~$570/month for 24/7)
- A100 80GB: $1.19/hr (~$860/month for 24/7)
- Community Cloud: 40% cheaper (spot instances)
Contributions welcome! Areas of focus:
- Hybrid S2S Model Training - Complete Phase 2 implementation
- Emotional Speech - Add laughter, sighs, breathing sounds
- Indian Languages - Multi-lingual training pipelines
- Latency Optimization - Achieve <500ms end-to-end
- Quality Improvements - Better audio reconstruction
Code: MIT License - Full commercial use allowed
HiFiGAN Vocoder: MIT License (see VOCODER_NOTICE.md)
Trained Models: You own 100% of models you train
Datasets:
- LibriSpeech: CC BY 4.0
- Common Voice: CC0 (Public Domain)
Official Links:
- GitHub: https://github.com/devasphn/Testing-S2S
- Training Guide: TRAINING_GUIDE.md
- RunPod: https://runpod.io
Inspiration:
- Luna AI (Pixa): First Indian speech-to-speech model
- Moshi (Kyutai Labs): Real-time duplex voice AI
- GLM-4-Voice (Tsinghua): Intelligent speech chatbot
Datasets:
- LibriSpeech: http://www.openslr.org/12/
- Common Voice: https://commonvoice.mozilla.org/
- IEMOCAP: https://sail.usc.edu/iemocap/
- Ownership: Keep all model weights, no vendor lock-in
- Privacy: Audio data stays on your infrastructure
- Cost: One-time training vs ongoing API fees
- Customization: Full control over behavior, languages, voices
- Independence: No rate limits, no service outages
Luna AI (Pixa) is closed-source API-only. This project:
- β Open-source, self-hostable
- β Similar architecture and capabilities
- β Train on your own data
- β $300-1,700 one-time cost vs ongoing API fees
- β Requires ML expertise to train
Yes! MIT license allows full commercial use. You own:
- All code modifications
- All models you train
- All deployment infrastructure
Training: NVIDIA A100 80GB (RunPod: $0.89-1.19/hr)
Inference: NVIDIA A40 48GB or better ($0.49-0.79/hr)
Local Dev: Any GPU with 8GB+ VRAM (GTX 1080+)
Built with β€οΈ by developers who believe in open, independent AI.
Version: 1.0 (Training Infrastructure Release)
Last Updated: November 17, 2025
Phase 1 (Current): β Speech Tokenizer Training
Phase 2 (Coming Soon):
- Hybrid S2S Model training script
- Conversational dataset generation
- Checkpoint integration with inference server
Phase 3 (Future):
- Emotional speech fine-tuning
- Indian language support
- Real-time streaming optimizations
- <500ms end-to-end latency
Star this repo β to follow development!