Testing-S2S: Train Your Own Luna AI

Independent Speech-to-Speech AI Model Training & Deployment

Build production-ready speech-to-speech AI models from scratch, with no dependency on external APIs. Train your own Luna-style conversational AI on RunPod with complete control over architecture, data, and costs.

🌟 Features

Production-Ready Inference

✓ Duplex Streaming: ~400ms mean latency
✓ Public HiFiGAN Vocoder: MIT licensed, commercial-safe
✓ Turn-based Mode: VAD-based conversation management
✓ WebSocket API: Real-time browser integration
✓ Self-contained: Automatic model downloading and caching

Training Infrastructure (NEW!)

✓ Trainable Speech Tokenizer: RVQ-based audio compression
✓ Hybrid S2S Model: End-to-end speech-to-speech transformer
✓ Checkpoint Management: Save/resume training anytime
✓ Cost-Optimized: Train for $300-1,700 (vs $10K+ for GPT-4o)
✓ LibriSpeech Ready: Works out-of-box with free datasets
✓ WandB Integration: Track training metrics in real-time

🚀 Quick Start

Option 1: Use Pretrained Models (Inference Only)

# Clone repository
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S

# Install dependencies
pip install -r requirements.txt

# Start server (downloads models automatically)
REPLY_MODE=turn python src/server.py

# Access web interface
open http://localhost:8000

Option 2: Train Your Own Models

See TRAINING_GUIDE.md for complete instructions.

Quick overview:

# 1. Download LibriSpeech dataset (100 hours, free)
wget http://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz -C /workspace/data

# 2. Install training dependencies
pip install -r requirements-training.txt

# 3. Configure training
nano training/configs/tokenizer_config.yaml

# 4. Start training on RunPod A100
python training/train_tokenizer.py

# Estimated cost: $10-15 for 100h dataset (8-12 GPU hours)

🏭 Architecture

Inference Pipeline (Current)

User Audio (Browser)
    ↓ WebSocket
[FastAPI Server]
    ↓
[VAD] → Detect speech activity
    ↓
[Speech Tokenizer] → Audio → Discrete tokens
    ↓
[Hybrid S2S Model] → Token → Token generation
    ↓
[Speech Decoder] → Tokens → Mel spectrogram
    ↓
[HiFiGAN Vocoder] → Mel → Audio waveform
    ↓ WebSocket
AI Response (Browser)

Training Pipeline (NEW!)

Phase 1: Speech Tokenizer

Architecture: CNN Encoder + RVQ (8 quantizers) + CNN Decoder
Dataset: LibriSpeech (100-960 hours)
Training Time: 8-120 GPU hours
Cost: $10-143 on RunPod A100

Phase 2: Hybrid S2S Model (Coming Soon)

Architecture: Transformer with dual-stream processing
Dataset: Conversational pairs (synthetic or real)
Training Time: 100-400 GPU hours
Cost: $119-476

Phase 3: Emotional Control (Coming Soon)

Fine-tuning on emotional speech datasets
Supports laughter, sighs, whispers, emotions
Training Time: 50-150 GPU hours
Cost: $60-179

📊 Training Costs

Configuration	Dataset	GPU Hours	Cost	Quality
Budget	LibriSpeech 100h	10	$12	Testing
Standard	LibriSpeech 360h	40	$48	Production
Premium	LibriSpeech 960h	120	$143	Excellent

RunPod A100 80GB pricing: $1.19/hr (Secure Cloud), $0.89/hr (Community Cloud)

Full Training Pipeline (Budget to Production):

Speech Tokenizer: $12-143
S2S Model: $119-476
Emotion Fine-tuning: $60-179
Total: $191-798

Compare to:

OpenAI GPT-4o Realtime API: $5-10/hr usage, no model ownership
Luna AI (Pixa): API-only, pricing TBD, no self-hosting
Your model: One-time cost, full ownership, unlimited usage

🛠️ API Endpoints

Inference Server

GET /health - Health check and model status
GET /api/stats - Latency statistics and performance metrics
WebSocket /ws/stream - Real-time audio streaming
GET / - Built-in web interface with audio controls

Example Usage

// Browser WebSocket client
const ws = new WebSocket('wss://your-pod.runpod.net:8000/ws/stream');

// Send audio chunks (24kHz, mono, PCM)
ws.send(audioChunkFloat32Array.buffer);

// Receive AI audio response
ws.onmessage = (event) => {
    const audioData = new Float32Array(event.data);
    playAudio(audioData);
};

💻 Development

Project Structure

Testing-S2S/
├── src/
│   ├── models/
│   │   ├── speech_tokenizer.py          # Inference-only tokenizer
│   │   ├── speech_tokenizer_trainable.py  # Trainable with RVQ (NEW!)
│   │   ├── hybrid_s2s.py                 # S2S transformer model
│   │   ├── hifigan_public.py             # MIT-licensed vocoder
│   │   └── streaming_processor.py        # Real-time inference
│   └── server.py                     # FastAPI server
├── training/                        # Training scripts (NEW!)
│   ├── train_tokenizer.py
│   └── configs/
│       └── tokenizer_config.yaml
├── scripts/
│   └── test_tokenizer.py            # Quality testing (NEW!)
├── requirements.txt
├── requirements-training.txt        # Training dependencies (NEW!)
├── TRAINING_GUIDE.md                # Complete training guide (NEW!)
└── README.md

Local Development

# Clone and setup
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
pip install -e .

# Run tests
python -m pytest tests/

# Test tokenizer quality
python scripts/test_tokenizer.py \
    --checkpoint checkpoints/tokenizer_best.pt \
    --input test_audio.wav \
    --output reconstructed.wav

# Start development server
REPLY_MODE=turn python src/server.py

⚙️ Configuration

Environment Variables

# Inference
export REPLY_MODE=turn                    # 'stream' or 'turn'
export MODEL_CACHE_DIR=/workspace/cache
export CUDA_VISIBLE_DEVICES=0

# Training
export WANDB_API_KEY=your_key             # Optional: WandB logging
export TORCH_HOME=/workspace/cache/torch

Training Config

Edit training/configs/tokenizer_config.yaml:

model:
  sample_rate: 24000        # Match Luna AI
  codebook_size: 1024
  hidden_dim: 512
  num_quantizers: 8

data:
  data_dir: "/workspace/data"
  train_split: "train-clean-100"
  
training:
  epochs: 100
  batch_size: 16            # Adjust for GPU memory
  learning_rate: 1e-4

📚 Documentation

TRAINING_GUIDE.md - Complete training tutorial
RUNPOD_DEPLOYMENT.md - Cloud deployment guide
VOCODER_NOTICE.md - HiFiGAN license details
MODEL_STATUS_CRITICAL.md - Model architecture notes

🎯 Performance Benchmarks

Inference Latency (Current System)

Mode	Mean Latency	P95 Latency	Notes
Stream	~400ms	~600ms	Continuous generation
Turn	~350ms	~500ms	Wait for speech end

Training Benchmarks (RunPod A100 80GB)

Component	Dataset Size	Batch Size	Time/Epoch	Total Time
Tokenizer	100h	16	5 min	8h (100 epochs)
Tokenizer	360h	16	18 min	30h (100 epochs)
Tokenizer	960h	16	48 min	80h (100 epochs)

🚀 Deployment

RunPod (Recommended)

# 1. Launch RunPod A100 pod
# 2. Clone repo and install
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
pip install -r requirements.txt

# 3. Download trained checkpoints (or train your own)
wget -O checkpoints/tokenizer_best.pt YOUR_CHECKPOINT_URL

# 4. Start production server
uvicorn src.server:app --host 0.0.0.0 --port 8000 --workers 1

# 5. Access via RunPod proxy
# https://<pod-id>.direct.runpod.net:8000

Inference Costs:

A40 48GB: $0.79/hr (~$570/month for 24/7)
A100 80GB: $1.19/hr (~$860/month for 24/7)
Community Cloud: 40% cheaper (spot instances)

🤝 Contributing

Contributions welcome! Areas of focus:

Hybrid S2S Model Training - Complete Phase 2 implementation
Emotional Speech - Add laughter, sighs, breathing sounds
Indian Languages - Multi-lingual training pipelines
Latency Optimization - Achieve <500ms end-to-end
Quality Improvements - Better audio reconstruction

📝 License

Code: MIT License - Full commercial use allowed

HiFiGAN Vocoder: MIT License (see VOCODER_NOTICE.md)

Trained Models: You own 100% of models you train

Datasets:

LibriSpeech: CC BY 4.0
Common Voice: CC0 (Public Domain)

🌐 Resources

Official Links:

GitHub: https://github.com/devasphn/Testing-S2S
Training Guide: TRAINING_GUIDE.md
RunPod: https://runpod.io

Inspiration:

Luna AI (Pixa): First Indian speech-to-speech model
Moshi (Kyutai Labs): Real-time duplex voice AI
GLM-4-Voice (Tsinghua): Intelligent speech chatbot

Datasets:

LibriSpeech: http://www.openslr.org/12/
Common Voice: https://commonvoice.mozilla.org/
IEMOCAP: https://sail.usc.edu/iemocap/

❓ FAQ

Why build your own instead of using APIs?

Ownership: Keep all model weights, no vendor lock-in
Privacy: Audio data stays on your infrastructure
Cost: One-time training vs ongoing API fees
Customization: Full control over behavior, languages, voices
Independence: No rate limits, no service outages

How does this compare to Luna AI?

Luna AI (Pixa) is closed-source API-only. This project:

✓ Open-source, self-hostable
✓ Similar architecture and capabilities
✓ Train on your own data
✓ $300-1,700 one-time cost vs ongoing API fees
✗ Requires ML expertise to train

Can I use this commercially?

Yes! MIT license allows full commercial use. You own:

All code modifications
All models you train
All deployment infrastructure

What hardware do I need?

Training: NVIDIA A100 80GB (RunPod: $0.89-1.19/hr)

Inference: NVIDIA A40 48GB or better ($0.49-0.79/hr)

Local Dev: Any GPU with 8GB+ VRAM (GTX 1080+)

Built with ❤️ by developers who believe in open, independent AI.

Version: 1.0 (Training Infrastructure Release)

Last Updated: November 17, 2025

📣 What's Next?

Phase 1 (Current): ✓ Speech Tokenizer Training

Phase 2 (Coming Soon):

Hybrid S2S Model training script
Conversational dataset generation
Checkpoint integration with inference server

Phase 3 (Future):

Emotional speech fine-tuning
Indian language support
Real-time streaming optimizations
<500ms end-to-end latency

Star this repo ⭐ to follow development!

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
configs		configs
examples		examples
scripts		scripts
src		src
tests		tests
training		training
wav_files		wav_files
.env.example		.env.example
.gitignore		.gitignore
COMMANDS_REFERENCE.md		COMMANDS_REFERENCE.md
MODEL_STATUS_CRITICAL.md		MODEL_STATUS_CRITICAL.md
README.md		README.md
RUNPOD_DEPLOYMENT.md		RUNPOD_DEPLOYMENT.md
TRAINING_GUIDE.md		TRAINING_GUIDE.md
VOCODER_NOTICE.md		VOCODER_NOTICE.md
WHATS_NEW.md		WHATS_NEW.md
apply_throttling_fix.py		apply_throttling_fix.py
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
setup.py		setup.py
setup_training.sh		setup_training.sh

devasphn/Testing-S2S

Folders and files

Latest commit

History

Repository files navigation