Skip to content

Commit 180428a

Browse files
authored
chore(oss/learn): swap to cartesia TTS model (#1799)
1 parent c8d0a8a commit 180428a

File tree

1 file changed

+75
-75
lines changed

1 file changed

+75
-75
lines changed

src/oss/langchain/voice-agent.mdx

Lines changed: 75 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -75,9 +75,9 @@ flowchart LR
7575

7676
This guide demonstrates the **sandwich architecture** to balance performance, controllability, and access to modern model capabilities. The sandwich can achieve sub-700ms latency with some STT and TTS providers while maintaining control over modular components.
7777

78-
### Demo application overview
78+
### Demo Application Overview
7979

80-
We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
80+
We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [Cartesia](https://cartesia.ai/) for TTS (although adapters can be built for most providers).
8181

8282
An end-to-end reference application is available in the [voice-sandwich-demo](https://github.com/langchain-ai/voice-sandwich-demo) repository. We will walk through that application here.
8383

@@ -104,7 +104,7 @@ The demo implements a streaming pipeline where each stage processes data asynchr
104104
- Orchestrates the three-step pipeline:
105105
- [Speech-to-text (STT)](#1-speech-to-text): Forwards audio to the STT provider (e.g., AssemblyAI), receives transcript events
106106
- [Agent](#2-langchain-agent): Processes transcripts with LangChain agent, streams response tokens
107-
- [Text-to-speech (TTS)](#3-text-to-speech): Sends agent responses to the TTS provider (e.g., ElevenLabs), receives audio chunks
107+
- [Text-to-speech (TTS)](#3-text-to-speech): Sends agent responses to the TTS provider (e.g., Cartesia), receives audio chunks
108108

109109
- Returns synthesized audio to the client for playback
110110

@@ -478,15 +478,15 @@ The TTS stage synthesizes agent response text into audio and streams it back to
478478
- **Upstream processing**: Passes through all events and sends agent text chunks to the TTS provider
479479
- **Audio reception**: Receives synthesized audio chunks from the TTS provider
480480

481-
**Streaming TTS**: Some providers (such as [ElevenLabs](https://elevenlabs.io/)) begin synthesizing audio as soon as it receives text, enabling audio playback to start before the agent finishes generating its complete response.
481+
**Streaming TTS**: Some providers (such as [Cartesia](https://cartesia.ai/)) begin synthesizing audio as soon as it receives text, enabling audio playback to start before the agent finishes generating its complete response.
482482

483483
**Event Passthrough**: All upstream events flow through unchanged, allowing the client or other observers to track the full pipeline state.
484484

485485
### Implementation
486486

487487
:::python
488488
```python
489-
from elevenlabs_tts import ElevenLabsTTS
489+
from cartesia_tts import CartesiaTTS
490490
from utils import merge_async_iters
491491

492492
async def tts_stream(
@@ -496,17 +496,17 @@ async def tts_stream(
496496
Transform stream: Voice Events → Voice Events (with Audio)
497497
498498
Merges two concurrent streams:
499-
1. process_upstream(): passes through events and sends text to ElevenLabs
500-
2. tts.receive_events(): yields audio chunks from ElevenLabs
499+
1. process_upstream(): passes through events and sends text to Cartesia
500+
2. tts.receive_events(): yields audio chunks from Cartesia
501501
"""
502-
tts = ElevenLabsTTS()
502+
tts = CartesiaTTS()
503503

504504
async def process_upstream() -> AsyncIterator[VoiceAgentEvent]:
505-
"""Process upstream events and send agent text to ElevenLabs."""
505+
"""Process upstream events and send agent text to Cartesia."""
506506
async for event in event_stream:
507507
# Pass through all events
508508
yield event
509-
# Send agent text to ElevenLabs for synthesis
509+
# Send agent text to Cartesia for synthesis
510510
if event.type == "agent_chunk":
511511
await tts.send_text(event.text)
512512

@@ -525,15 +525,15 @@ async def tts_stream(
525525

526526
:::js
527527
```typescript
528-
import { ElevenLabsTTS } from "./elevenlabs";
528+
import { CartesiaTTS } from "./cartesia";
529529

530530
async function* ttsStream(
531531
eventStream: AsyncIterable<VoiceAgentEvent>
532532
): AsyncGenerator<VoiceAgentEvent> {
533-
const tts = new ElevenLabsTTS();
533+
const tts = new CartesiaTTS();
534534
const passthrough = writableIterator<VoiceAgentEvent>();
535535

536-
// Producer: read upstream events and send text to ElevenLabs
536+
// Producer: read upstream events and send text to Cartesia
537537
const producer = (async () => {
538538
try {
539539
for await (const event of eventStream) {
@@ -547,7 +547,7 @@ async function* ttsStream(
547547
}
548548
})();
549549

550-
// Consumer: receive audio from ElevenLabs
550+
// Consumer: receive audio from Cartesia
551551
const consumer = (async () => {
552552
for await (const event of tts.receiveEvents()) {
553553
passthrough.push(event);
@@ -564,81 +564,89 @@ async function* ttsStream(
564564
```
565565
:::
566566

567-
The application implements an ElevenLabs client to manage the WebSocket connection and audio streaming. See below for implementations; similar adapters can be constructed for other TTS providers.
567+
The application implements an Cartesia client to manage the WebSocket connection and audio streaming. See below for implementations; similar adapters can be constructed for other TTS providers.
568568

569-
<Accordion title="ElevenLabs Client">
569+
<Accordion title="Cartesia Client">
570570

571571
:::python
572572
```python
573573
import base64
574574
import json
575575
import websockets
576576

577-
class ElevenLabsTTS:
577+
class CartesiaTTS:
578578
def __init__(
579579
self,
580-
api_key: str | None = None,
581-
voice_id: str = "21m00Tcm4TlvDq8ikWAM",
582-
model_id: str = "eleven_multilingual_v2",
583-
output_format: str = "pcm_16000",
580+
api_key: Optional[str] = None,
581+
voice_id: str = "f6ff7c0c-e396-40a9-a70b-f7607edb6937",
582+
model_id: str = "sonic-3",
583+
sample_rate: int = 24000,
584+
encoding: str = "pcm_s16le",
584585
):
585-
self.api_key = api_key or os.getenv("ELEVENLABS_API_KEY")
586+
self.api_key = api_key or os.getenv("CARTESIA_API_KEY")
586587
self.voice_id = voice_id
587588
self.model_id = model_id
588-
self.output_format = output_format
589+
self.sample_rate = sample_rate
590+
self.encoding = encoding
589591
self._ws: WebSocketClientProtocol | None = None
590592

593+
def _generate_context_id(self) -> str:
594+
"""Generate a valid context_id for Cartesia."""
595+
timestamp = int(time.time() * 1000)
596+
counter = self._context_counter
597+
self._context_counter += 1
598+
return f"ctx_{timestamp}_{counter}"
599+
591600
async def send_text(self, text: str | None) -> None:
592-
"""Send text to ElevenLabs for synthesis."""
601+
"""Send text to Cartesia for synthesis."""
593602
if not text or not text.strip():
594603
return
595604

596605
ws = await self._ensure_connection()
597-
payload = {"text": text, "try_trigger_generation": False}
606+
payload = {
607+
"model_id": self.model_id,
608+
"transcript": text,
609+
"voice": {
610+
"mode": "id",
611+
"id": self.voice_id,
612+
},
613+
"output_format": {
614+
"container": "raw",
615+
"encoding": self.encoding,
616+
"sample_rate": self.sample_rate,
617+
},
618+
"language": self.language,
619+
"context_id": self._generate_context_id(),
620+
}
598621
await ws.send(json.dumps(payload))
599622

600623
async def receive_events(self) -> AsyncIterator[TTSChunkEvent]:
601-
"""Yield audio chunks as they arrive from ElevenLabs."""
624+
"""Yield audio chunks as they arrive from Cartesia."""
602625
async for raw_message in self._ws:
603626
message = json.loads(raw_message)
604627

605628
# Decode and yield audio chunks
606-
if "audio" in message and message["audio"]:
607-
audio_chunk = base64.b64decode(message["audio"])
629+
if "data" in message and message["data"]:
630+
audio_chunk = base64.b64decode(message["data"])
608631
if audio_chunk:
609632
yield TTSChunkEvent.create(audio_chunk)
610633

611-
# Break on final message
612-
if message.get("isFinal"):
613-
break
614-
615634
async def _ensure_connection(self) -> WebSocketClientProtocol:
616635
"""Establish WebSocket connection if not already connected."""
617636
if self._ws is None:
618637
url = (
619-
f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}/stream-input"
620-
f"?model_id={self.model_id}&output_format={self.output_format}"
638+
f"wss://api.cartesia.ai/tts/websocket"
639+
f"?api_key={self.api_key}&cartesia_version={self.cartesia_version}"
621640
)
622641
self._ws = await websockets.connect(url)
623642

624-
# Send initial configuration message
625-
bos_message = {
626-
"text": " ",
627-
"voice_settings": {
628-
"stability": 0.5,
629-
"similarity_boost": 0.75,
630-
},
631-
"xi_api_key": self.api_key,
632-
}
633-
await self._ws.send(json.dumps(bos_message))
634-
635643
return self._ws
636644
```
637645
:::
638646

639647
:::js
640648
```typescript
641-
export class ElevenLabsTTS {
649+
export class CartesiaTTS {
642650
protected _bufferIterator = writableIterator<VoiceAgentEvent.TTSEvent>();
643651
protected _connectionPromise: Promise<WebSocket> | null = null;
644652

@@ -654,45 +662,37 @@ export class ElevenLabsTTS {
654662
yield* this._bufferIterator;
655663
}
656664

665+
protected _generateContextId(): string {
666+
const timestamp = Date.now();
667+
const counter = this._contextCounter++;
668+
return `ctx_${timestamp}_${counter}`;
669+
}
670+
657671
protected get _connection(): Promise<WebSocket> {
658672
if (this._connectionPromise) return this._connectionPromise;
659673

660674
this._connectionPromise = new Promise((resolve, reject) => {
661-
const url = `wss://api.elevenlabs.io/v1/text-to-speech/${this.voiceId}/stream-input?model_id=${this.modelId}&output_format=${this.outputFormat}`;
675+
const params = new URLSearchParams({
676+
api_key: this.apiKey,
677+
cartesia_version: this.cartesiaVersion,
678+
});
679+
const url = `wss://api.cartesia.ai/tts/websocket?${params.toString()}`;
662680
const ws = new WebSocket(url);
663681

664682
ws.on("open", () => {
665-
// Send initial configuration
666-
const bosMessage = {
667-
text: " ",
668-
voice_settings: {
669-
stability: 0.5,
670-
similarity_boost: 0.75,
671-
},
672-
xi_api_key: this.apiKey,
673-
};
674-
ws.send(JSON.stringify(bosMessage));
675683
resolve(ws);
676684
});
677685

678-
ws.on("message", (data) => {
679-
const message = JSON.parse(data.toString());
680-
681-
// Decode and push audio chunks
682-
if (message.audio) {
683-
const audioChunk = Buffer.from(message.audio, "base64");
684-
if (audioChunk.length > 0) {
685-
this._bufferIterator.push({
686-
type: "tts_chunk",
687-
audio: new Uint8Array(audioChunk),
688-
ts: Date.now()
689-
});
690-
}
691-
}
692-
693-
// Close iterator on final message
694-
if (message.isFinal) {
695-
this._bufferIterator.cancel();
686+
ws.on("message", (data: WebSocket.RawData) => {
687+
const message: CartesiaTTSResponse = JSON.parse(data.toString());
688+
if (message.data) {
689+
this._bufferIterator.push({
690+
type: "tts_chunk",
691+
audio: message.data,
692+
ts: Date.now(),
693+
});
694+
} else if (message.error) {
695+
throw new Error(`Cartesia error: ${message.error}`);
696696
}
697697
});
698698
});

0 commit comments

Comments
 (0)