Skip to content

Custom Speech-to-Speech AI Model - GLM-4-Voice + Moshi Streaming + GPT-4o Architecture

Notifications You must be signed in to change notification settings

devasphn/Testing-S2S

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

117 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Testing-S2S: Train Your Own Luna AI

Independent Speech-to-Speech AI Model Training & Deployment

Build production-ready speech-to-speech AI models from scratch, with no dependency on external APIs. Train your own Luna-style conversational AI on RunPod with complete control over architecture, data, and costs.


🌟 Features

Production-Ready Inference

βœ“ Duplex Streaming: ~400ms mean latency
βœ“ Public HiFiGAN Vocoder: MIT licensed, commercial-safe
βœ“ Turn-based Mode: VAD-based conversation management
βœ“ WebSocket API: Real-time browser integration
βœ“ Self-contained: Automatic model downloading and caching

Training Infrastructure (NEW!)

βœ“ Trainable Speech Tokenizer: RVQ-based audio compression
βœ“ Hybrid S2S Model: End-to-end speech-to-speech transformer
βœ“ Checkpoint Management: Save/resume training anytime
βœ“ Cost-Optimized: Train for $300-1,700 (vs $10K+ for GPT-4o)
βœ“ LibriSpeech Ready: Works out-of-box with free datasets
βœ“ WandB Integration: Track training metrics in real-time


πŸš€ Quick Start

Option 1: Use Pretrained Models (Inference Only)

# Clone repository
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S

# Install dependencies
pip install -r requirements.txt

# Start server (downloads models automatically)
REPLY_MODE=turn python src/server.py

# Access web interface
open http://localhost:8000

Option 2: Train Your Own Models

See TRAINING_GUIDE.md for complete instructions.

Quick overview:

# 1. Download LibriSpeech dataset (100 hours, free)
wget http://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz -C /workspace/data

# 2. Install training dependencies
pip install -r requirements-training.txt

# 3. Configure training
nano training/configs/tokenizer_config.yaml

# 4. Start training on RunPod A100
python training/train_tokenizer.py

# Estimated cost: $10-15 for 100h dataset (8-12 GPU hours)

🏭 Architecture

Inference Pipeline (Current)

User Audio (Browser)
    ↓ WebSocket
[FastAPI Server]
    ↓
[VAD] β†’ Detect speech activity
    ↓
[Speech Tokenizer] β†’ Audio β†’ Discrete tokens
    ↓
[Hybrid S2S Model] β†’ Token β†’ Token generation
    ↓
[Speech Decoder] β†’ Tokens β†’ Mel spectrogram
    ↓
[HiFiGAN Vocoder] β†’ Mel β†’ Audio waveform
    ↓ WebSocket
AI Response (Browser)

Training Pipeline (NEW!)

Phase 1: Speech Tokenizer

  • Architecture: CNN Encoder + RVQ (8 quantizers) + CNN Decoder
  • Dataset: LibriSpeech (100-960 hours)
  • Training Time: 8-120 GPU hours
  • Cost: $10-143 on RunPod A100

Phase 2: Hybrid S2S Model (Coming Soon)

  • Architecture: Transformer with dual-stream processing
  • Dataset: Conversational pairs (synthetic or real)
  • Training Time: 100-400 GPU hours
  • Cost: $119-476

Phase 3: Emotional Control (Coming Soon)

  • Fine-tuning on emotional speech datasets
  • Supports laughter, sighs, whispers, emotions
  • Training Time: 50-150 GPU hours
  • Cost: $60-179

πŸ“Š Training Costs

Configuration Dataset GPU Hours Cost Quality
Budget LibriSpeech 100h 10 $12 Testing
Standard LibriSpeech 360h 40 $48 Production
Premium LibriSpeech 960h 120 $143 Excellent

RunPod A100 80GB pricing: $1.19/hr (Secure Cloud), $0.89/hr (Community Cloud)

Full Training Pipeline (Budget to Production):

  • Speech Tokenizer: $12-143
  • S2S Model: $119-476
  • Emotion Fine-tuning: $60-179
  • Total: $191-798

Compare to:

  • OpenAI GPT-4o Realtime API: $5-10/hr usage, no model ownership
  • Luna AI (Pixa): API-only, pricing TBD, no self-hosting
  • Your model: One-time cost, full ownership, unlimited usage

πŸ› οΈ API Endpoints

Inference Server

  • GET /health - Health check and model status
  • GET /api/stats - Latency statistics and performance metrics
  • WebSocket /ws/stream - Real-time audio streaming
  • GET / - Built-in web interface with audio controls

Example Usage

// Browser WebSocket client
const ws = new WebSocket('wss://your-pod.runpod.net:8000/ws/stream');

// Send audio chunks (24kHz, mono, PCM)
ws.send(audioChunkFloat32Array.buffer);

// Receive AI audio response
ws.onmessage = (event) => {
    const audioData = new Float32Array(event.data);
    playAudio(audioData);
};

πŸ’» Development

Project Structure

Testing-S2S/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ speech_tokenizer.py          # Inference-only tokenizer
β”‚   β”‚   β”œβ”€β”€ speech_tokenizer_trainable.py  # Trainable with RVQ (NEW!)
β”‚   β”‚   β”œβ”€β”€ hybrid_s2s.py                 # S2S transformer model
β”‚   β”‚   β”œβ”€β”€ hifigan_public.py             # MIT-licensed vocoder
β”‚   β”‚   └── streaming_processor.py        # Real-time inference
β”‚   └── server.py                     # FastAPI server
β”œβ”€β”€ training/                        # Training scripts (NEW!)
β”‚   β”œβ”€β”€ train_tokenizer.py
β”‚   └── configs/
β”‚       └── tokenizer_config.yaml
β”œβ”€β”€ scripts/
β”‚   └── test_tokenizer.py            # Quality testing (NEW!)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ requirements-training.txt        # Training dependencies (NEW!)
β”œβ”€β”€ TRAINING_GUIDE.md                # Complete training guide (NEW!)
└── README.md

Local Development

# Clone and setup
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
pip install -e .

# Run tests
python -m pytest tests/

# Test tokenizer quality
python scripts/test_tokenizer.py \
    --checkpoint checkpoints/tokenizer_best.pt \
    --input test_audio.wav \
    --output reconstructed.wav

# Start development server
REPLY_MODE=turn python src/server.py

βš™οΈ Configuration

Environment Variables

# Inference
export REPLY_MODE=turn                    # 'stream' or 'turn'
export MODEL_CACHE_DIR=/workspace/cache
export CUDA_VISIBLE_DEVICES=0

# Training
export WANDB_API_KEY=your_key             # Optional: WandB logging
export TORCH_HOME=/workspace/cache/torch

Training Config

Edit training/configs/tokenizer_config.yaml:

model:
  sample_rate: 24000        # Match Luna AI
  codebook_size: 1024
  hidden_dim: 512
  num_quantizers: 8

data:
  data_dir: "/workspace/data"
  train_split: "train-clean-100"
  
training:
  epochs: 100
  batch_size: 16            # Adjust for GPU memory
  learning_rate: 1e-4

πŸ“š Documentation


🎯 Performance Benchmarks

Inference Latency (Current System)

Mode Mean Latency P95 Latency Notes
Stream ~400ms ~600ms Continuous generation
Turn ~350ms ~500ms Wait for speech end

Training Benchmarks (RunPod A100 80GB)

Component Dataset Size Batch Size Time/Epoch Total Time
Tokenizer 100h 16 5 min 8h (100 epochs)
Tokenizer 360h 16 18 min 30h (100 epochs)
Tokenizer 960h 16 48 min 80h (100 epochs)

πŸš€ Deployment

RunPod (Recommended)

# 1. Launch RunPod A100 pod
# 2. Clone repo and install
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
pip install -r requirements.txt

# 3. Download trained checkpoints (or train your own)
wget -O checkpoints/tokenizer_best.pt YOUR_CHECKPOINT_URL

# 4. Start production server
uvicorn src.server:app --host 0.0.0.0 --port 8000 --workers 1

# 5. Access via RunPod proxy
# https://<pod-id>.direct.runpod.net:8000

Inference Costs:

  • A40 48GB: $0.79/hr (~$570/month for 24/7)
  • A100 80GB: $1.19/hr (~$860/month for 24/7)
  • Community Cloud: 40% cheaper (spot instances)

🀝 Contributing

Contributions welcome! Areas of focus:

  1. Hybrid S2S Model Training - Complete Phase 2 implementation
  2. Emotional Speech - Add laughter, sighs, breathing sounds
  3. Indian Languages - Multi-lingual training pipelines
  4. Latency Optimization - Achieve <500ms end-to-end
  5. Quality Improvements - Better audio reconstruction

πŸ“ License

Code: MIT License - Full commercial use allowed

HiFiGAN Vocoder: MIT License (see VOCODER_NOTICE.md)

Trained Models: You own 100% of models you train

Datasets:

  • LibriSpeech: CC BY 4.0
  • Common Voice: CC0 (Public Domain)

🌐 Resources

Official Links:

Inspiration:

  • Luna AI (Pixa): First Indian speech-to-speech model
  • Moshi (Kyutai Labs): Real-time duplex voice AI
  • GLM-4-Voice (Tsinghua): Intelligent speech chatbot

Datasets:


❓ FAQ

Why build your own instead of using APIs?

  1. Ownership: Keep all model weights, no vendor lock-in
  2. Privacy: Audio data stays on your infrastructure
  3. Cost: One-time training vs ongoing API fees
  4. Customization: Full control over behavior, languages, voices
  5. Independence: No rate limits, no service outages

How does this compare to Luna AI?

Luna AI (Pixa) is closed-source API-only. This project:

  • βœ“ Open-source, self-hostable
  • βœ“ Similar architecture and capabilities
  • βœ“ Train on your own data
  • βœ“ $300-1,700 one-time cost vs ongoing API fees
  • βœ— Requires ML expertise to train

Can I use this commercially?

Yes! MIT license allows full commercial use. You own:

  • All code modifications
  • All models you train
  • All deployment infrastructure

What hardware do I need?

Training: NVIDIA A100 80GB (RunPod: $0.89-1.19/hr)

Inference: NVIDIA A40 48GB or better ($0.49-0.79/hr)

Local Dev: Any GPU with 8GB+ VRAM (GTX 1080+)


Built with ❀️ by developers who believe in open, independent AI.

Version: 1.0 (Training Infrastructure Release)

Last Updated: November 17, 2025


πŸ“£ What's Next?

Phase 1 (Current): βœ“ Speech Tokenizer Training

Phase 2 (Coming Soon):

  • Hybrid S2S Model training script
  • Conversational dataset generation
  • Checkpoint integration with inference server

Phase 3 (Future):

  • Emotional speech fine-tuning
  • Indian language support
  • Real-time streaming optimizations
  • <500ms end-to-end latency

Star this repo ⭐ to follow development!

About

Custom Speech-to-Speech AI Model - GLM-4-Voice + Moshi Streaming + GPT-4o Architecture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published