You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/oss/langchain/voice-agent.mdx
+75-75Lines changed: 75 additions & 75 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,9 +75,9 @@ flowchart LR
75
75
76
76
This guide demonstrates the **sandwich architecture** to balance performance, controllability, and access to modern model capabilities. The sandwich can achieve sub-700ms latency with some STT and TTS providers while maintaining control over modular components.
77
77
78
-
### Demo application overview
78
+
### Demo Application Overview
79
79
80
-
We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
80
+
We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [Cartesia](https://cartesia.ai/) for TTS (although adapters can be built for most providers).
81
81
82
82
An end-to-end reference application is available in the [voice-sandwich-demo](https://github.com/langchain-ai/voice-sandwich-demo) repository. We will walk through that application here.
83
83
@@ -104,7 +104,7 @@ The demo implements a streaming pipeline where each stage processes data asynchr
104
104
- Orchestrates the three-step pipeline:
105
105
-[Speech-to-text (STT)](#1-speech-to-text): Forwards audio to the STT provider (e.g., AssemblyAI), receives transcript events
106
106
-[Agent](#2-langchain-agent): Processes transcripts with LangChain agent, streams response tokens
107
-
-[Text-to-speech (TTS)](#3-text-to-speech): Sends agent responses to the TTS provider (e.g., ElevenLabs), receives audio chunks
107
+
-[Text-to-speech (TTS)](#3-text-to-speech): Sends agent responses to the TTS provider (e.g., Cartesia), receives audio chunks
108
108
109
109
- Returns synthesized audio to the client for playback
110
110
@@ -478,15 +478,15 @@ The TTS stage synthesizes agent response text into audio and streams it back to
478
478
-**Upstream processing**: Passes through all events and sends agent text chunks to the TTS provider
479
479
-**Audio reception**: Receives synthesized audio chunks from the TTS provider
480
480
481
-
**Streaming TTS**: Some providers (such as [ElevenLabs](https://elevenlabs.io/)) begin synthesizing audio as soon as it receives text, enabling audio playback to start before the agent finishes generating its complete response.
481
+
**Streaming TTS**: Some providers (such as [Cartesia](https://cartesia.ai/)) begin synthesizing audio as soon as it receives text, enabling audio playback to start before the agent finishes generating its complete response.
482
482
483
483
**Event Passthrough**: All upstream events flow through unchanged, allowing the client or other observers to track the full pipeline state.
// Producer: read upstream events and send text to ElevenLabs
536
+
// Producer: read upstream events and send text to Cartesia
537
537
const producer = (async () => {
538
538
try {
539
539
forawait (const event ofeventStream) {
@@ -547,7 +547,7 @@ async function* ttsStream(
547
547
}
548
548
})();
549
549
550
-
// Consumer: receive audio from ElevenLabs
550
+
// Consumer: receive audio from Cartesia
551
551
const consumer = (async () => {
552
552
forawait (const event oftts.receiveEvents()) {
553
553
passthrough.push(event);
@@ -564,81 +564,89 @@ async function* ttsStream(
564
564
```
565
565
:::
566
566
567
-
The application implements an ElevenLabs client to manage the WebSocket connection and audio streaming. See below for implementations; similar adapters can be constructed for other TTS providers.
567
+
The application implements an Cartesia client to manage the WebSocket connection and audio streaming. See below for implementations; similar adapters can be constructed for other TTS providers.
0 commit comments