speech is a modular service that provides text-to-speech (TTS) and speech-to-text (STT) capabilities for machines running on the Viam platform.
This module implements the Speech Service API (viam-labs:service:speech). See the documentation for that API to learn more about using it with the Viam SDKs.
On Linux:
build.sh will automatically include the following system dependencies as part of the PyInstaller executable:
python3-pyaudioportaudio19-devalsa-toolsalsa-utilsflac
On MacOS, build.sh will include the following dependencies using Homebrew:
brew install portaudioBefore configuring your speech service, you must also create a machine.
To use this module, follow these instructions to add a module from the Viam Registry and select the viam-labs:speech:speechio model from the speech module.
Navigate to the Config tab of your machine's page in the Viam app.
Click on the Services subtab and click Create service.
Select the speech type, then select the speech:speechio model.
Click Add module, then enter a name for your speech service and click Create.
On the new component panel, copy and paste the following attribute template into your sensor’s Attributes box:
{
"speech_provider": "google|elevenlabs",
"speech_provider_key": "<SECRET-KEY>",
"speech_voice": "<VOICE-OPTION>",
"completion_provider": "openai",
"completion_model": "gpt-4|gpt-3.5-turbo",
"completion_provider_org": "<org-abc123>",
"completion_provider_key": "<sk-mykey>",
"completion_persona": "<PERSONA>",
"listen": true,
"stt_provider": "google",
"use_vosk_vad": false,
"listen_trigger_say": "<TRIGGER-PHRASE>",
"listen_trigger_completion": "<COMPLETION-PHRASE>",
"listen_trigger_command": "<COMMAND-TO-RETRIEVE-STORED-TEXT>",
"listen_command_buffer_length": 10,
"listen_phrase_time_limit": 5,
"mic_device_name": "myMic",
"cache_ahead_completions": false,
"disable_mic": false
}Note
For more information, see Configure a Machine.
The following attributes are available for the viam-labs:speech:speechio speech service:
| Name | Type | Inclusion | Description |
|---|---|---|---|
speech_provider |
string | Optional | The speech provider for the voice service: "google" or "elevenlabs". Default: "google". |
speech_provider_key |
string | Required | The secret key for the provider - only required for elevenlabs. Default: "". |
speech_voice |
string | Optional | If the speech_provider (example: elevenlabs) provides voice options, you can select the voice here. Default: "Josh". |
completion_provider |
string | Optional | "openai". Other providers may be supported in the future. completion_provider_org and completion_provider_key must also be provided. Default: "openai". |
completion_model |
string | Optional | gpt-4o, gpt-4o-mini, etc. completion_provider_org and completion_provider_key must also be provided. Default: "gpt-4o". |
completion_provider_org |
string | Optional | Your org for the completion provider. Default: "". |
completion_provider_key |
string | Optional | Your key for the completion provider. Default: "". |
completion_persona |
string | Optional | If set, will pass "As <completion_persona> respond to '<completion_text>'" to all completion() requests. Default: "". |
listen |
boolean | Optional | If set to true and the robot as an available microphone device, will enable listening in the background. If enabled, it will respond to configured listen_trigger_say, listen_trigger_completion and listen_trigger_command, based on input audio being converted to text. If listen is enabled and listen_triggers_active is disabled, triggers will occur when listen_trigger is called. Note that background (ambient) noise and microphone quality are important factors in the quality of the STT conversion. Currently, Google STT is leveraged. Default: false. |
stt_provider |
string | Optional | This can be set to the name of a configured speech service that provides a to_text command, like stt-vosk*. Otherwise, the Google STT API will be used. Default: "google". |
listen_phrase_time_limit |
float | Optional | The maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If phrase_timeout is None, there will be no phrase time limit. Note: if you are seeing instance where phrases are not being returned for much longer than you expect, try changing this to ~5 or so. Works with both default VAD and Vosk VAD. Default: None. |
listen_trigger_say |
string | Optional | If listen is true, any audio converted to text that is prefixed with listen_trigger_say will be converted to speech and repeated back by the robot. Default: "robot say". |
listen_trigger_completion |
string | Optional | If listen is true, any audio converted to text that is prefixed with listen_trigger_completion will be sent to the completion provider (if configured), converted to speech, and repeated back by the robot. Default: "hey robot". |
listen_trigger_command |
string | Optional | If "listen": true, any audio converted to text that is prefixed with listen_trigger_command will be stored in a LIFO buffer (list of strings) of size listen_command_buffer_length that can be retrieved via get_commands() from the Speech Service API, enabling programmatic voice control of the robot. Default: "robot can you". |
listen_command_buffer_length |
integer | Optional | The buffer length for the command. Default: 10. |
mic_device_name |
string | Optional | If not set, will attempt to use the first available microphone device. If set, will attempt to use a specifically labeled device name. Available microphone device names will logged on module startup. Default: "". |
cache_ahead_completions |
boolean | Optional | If true, will read a second completion for the request and cache it for next time a matching request is made. This is useful for faster completions when completion text is less variable. Default: false. |
disable_mic |
boolean | Optional | If true, will not configure any listening capabilities. This must be set to true if you do not have a valid microphone attached to your system. Default: false. |
disable_audioout |
boolean | Optional | If true, will not configure any audio output capabilities. This must be set to true if you do not have a valid audio output device attached to your system. Default: false. |
use_vosk_vad |
boolean | Optional | If true, will use Vosk for Voice Activity Detection (VAD) instead of the default speech_recognition VAD. The Vosk model will be automatically downloaded (~40MB) on first use. Default: false. |
The following configuration sets up listening mode with local speech-to-text, uses an ElevenLabs voice "Antoni", makes AI completions available, and uses a 'Gollum' persona for AI completions:
{
"completion_provider_org": "org-abc123",
"completion_provider_key": "sk-mykey",
"completion_persona": "Gollum",
"listen": true,
"stt_provider": "google",
"speech_provider": "elevenlabs",
"speech_provider_key": "keygoeshere",
"speech_voice": "Antoni",
"mic_device_name": "myMic"
}The speech service supports two Voice Activity Detection systems:
- Uses the built-in VAD from the
speech_recognitionlibrary - Good for basic voice detection
- Works out of the box with no additional setup
- Better handling of background noise
- More precise speech boundary detection
- Automatic model downloading (~40MB) on first use
- Fallback protection - automatically uses default VAD if Vosk fails
- Phrase time limiting - respects
listen_phrase_time_limitparameter
To use Vosk VAD, simply set "use_vosk_vad": true in your configuration. The system will:
- Automatically download the Vosk model on first use
- Extract and verify the model
- Start Vosk VAD
- Fall back gracefully to default VAD if anything fails
The speech service supports fuzzy matching for wake word detection using Levenshtein distance (edit distance) via the rapidfuzz library. This improves accuracy when speech recognition produces slight variations while preventing partial-word false positives.
To enable fuzzy wake word matching, add to your configuration:
{
"listen_trigger_fuzzy_matching": true,
"listen_trigger_fuzzy_threshold": 2
}Fuzzy matching uses word-boundary matching to allow wake words to trigger even when transcribed slightly differently, while preventing false matches:
- "hey robot" will match "hey Robert" (distance = 2) ✓
- "robot say" will match "robotic say" (distance = 2) ✓
- "robot can you" will match "robot can u" (distance = 1) ✓
- "hey robot" will NOT match "they robotic" (word boundaries prevent partial-word matches) ✗
The system automatically checks alternative transcriptions from Google Speech Recognition for better accuracy. The word-boundary approach achieves 100% accuracy in testing versus 87.5% for character-level matching.
| Attribute | Default | Description |
|---|---|---|
listen_trigger_fuzzy_matching |
false |
Enable/disable fuzzy matching |
listen_trigger_fuzzy_threshold |
2 |
Maximum edit distance (0-5). Lower = stricter matching |
- Threshold 1: Very strict, for short wake words
- Threshold 2-3: Recommended for most wake words (default: 2)
- Threshold 4-5: Lenient, for noisy environments
Not triggering enough? Increase the threshold:
{"listen_trigger_fuzzy_threshold": 3}Triggering too much? Decrease the threshold:
{"listen_trigger_fuzzy_threshold": 1}Not working at all? Check that fuzzy matching is enabled and logs don't show import errors.
This service queries the device for available microphone devices to be used for text-to-speech services.
The output from this discovery can be used to configure a viam-labs:speech:speechio service.
If an expected device doesn't appear, try following the troubleshooting steps in the module README.
The following attribute template can be used to configure this model:
{
}The following attributes are available for this model:
| Name | Type | Inclusion | Description |
|---|
{
}The speechio service requires specific ALSA configuration to work properly with USB microphones and audio output devices. This guide provides step-by-step instructions for setting up persistent ALSA configuration that survives machine reboots.
- Speechio service needs 16000Hz sample rate for speech recognition
- USB microphones typically run at 44100Hz (some at 48000Hz)
- Rate conversion is required using ALSA
plugplugin - Configuration must persist across reboots
First, identify your audio devices:
# List playback devices
aplay -l
# List capture devices
arecord -l
# Check current card assignments
cat /proc/asound/cardsTypical setup:
- Audio output device: Usually card 0 (speakers/DAC)
- USB Microphone: Usually card 1 (microphone input)
Test if you get rate conversion warnings:
# Test recording at 16000Hz (required by speechio)
arecord -f S16_LE -r 16000 -c 1 -t wav /tmp/test_default.wav &
sleep 3
kill %1Common issue: Warning: rate is not accurate (requested = 16000Hz, got = 44100Hz)
First, find your device names:
# Find device names (more stable than card numbers)
cat /proc/asound/cardsThen create /etc/asound.conf using device names:
sudo tee /etc/asound.conf > /dev/null << 'EOF'
# ALSA Configuration for Speechio Service
# Uses device names for stability across reboots
pcm.!default {
type asym
playback.pcm {
type hw
card YourOutputDevice # Replace with your audio output device name
device 0
}
capture.pcm {
type plug # CRITICAL: Use plug for rate conversion
slave.pcm {
type hw
card YourMicDevice # Replace with your USB microphone device name
device 0
}
}
}
ctl.!default {
type hw
card YourOutputDevice
}
EOFVerify rate conversion works without warnings:
# Test recording - should work without rate warnings
arecord -f S16_LE -r 16000 -c 1 -t wav /tmp/test_fixed.wav &
sleep 3
kill %1
# Test playback
speaker-test -D default -t wav -c 2 -l 1Expected result: No rate conversion warnings
Ensure the configuration will survive reboots:
# Check that the file is not managed by packages
dpkg -S /etc/asound.conf 2>/dev/null || echo "✓ File not managed by packages (good)"
# Verify file exists and has correct content
cat /etc/asound.confExpected logs:
"Will listen in background"- Service initialized"speechio heard audio"- Audio detected when you speak"speechio heard <text>"- Speech successfully transcribed
Symptom: Warning: rate is not accurate (requested = 16000Hz, got = 44100Hz)
Solution: Use type plug for capture device in /etc/asound.conf
Symptom: No "speechio heard audio" logs when speaking
Solution:
- Verify microphone with
arecord -l - Check ALSA configuration with
arecord -D default -f S16_LE -r 16000 -c 1 -t wav /tmp/test.wav - Ensure
/etc/asound.confuses correct device names
Symptom: Audio setup works until machine restarts
Solution:
- Use
/etc/asound.conf(not~/.asoundrc) - Verify file is not managed by packages:
dpkg -S /etc/asound.conf - Check file permissions:
sudo chmod 644 /etc/asound.conf
Symptom: Speechio uses wrong microphone or speaker, or stops working after reboot
Cause: Card numbers (0, 1, 2) can change between reboots based on USB detection order
Solution: Use device names instead of card numbers in /etc/asound.conf
# Find stable device names
cat /proc/asound/cards
# Example output:
# 0 [sndrpihifiberry]: HifiBerry-DAC - HifiBerry DAC
# 1 [Device ]: USB-Audio - USB PnP Sound Device
# Use device names in config:
card sndrpihifiberry # Instead of card 0
card Device # Instead of card 1Before deploying speechio service:
- Audio devices detected:
aplay -landarecord -l - Rate conversion works:
arecord -f S16_LE -r 16000 -c 1 -t wav /tmp/test.wav(no warnings) - ALSA config persists:
/etc/asound.confexists and not package-managed - Speechio logs show: "Will listen in background"
- Speaking generates: "speechio heard audio" logs
- Configuration survives reboot
If you prefer to use card numbers instead of device names, you can ensure consistent card ordering:
- Check existing ALSA modules:
cat /proc/asound/modulesExample output:
0 snd_usb_audio
2 snd_soc_meson_card_utils
3 snd_usb_audio
- Set module loading order:
sudo tee /etc/modprobe.d/alsa-base.conf > /dev/null << 'EOF'
# Ensure USB audio devices load first
options snd slots=snd-usb-audio,snd_soc_meson_card_utils
EOF- Use card numbers in ALSA config:
sudo tee /etc/asound.conf > /dev/null << 'EOF'
pcm.!default {
type asym
playback.pcm "plughw:0,0" # First USB device
capture.pcm "plughw:0,0" # Same device for capture
}
EOFNote: Device names are still recommended over this approach for better stability.
- The
plugplugin automatically handles rate conversion (44100Hz → 16000Hz) - Device names are more stable than card numbers across reboots
- System-wide configuration (
/etc/asound.conf) is required for viam-server access - Restart viam-server after ALSA configuration changes (speechio reinitializes audio devices on service restart)
- Test thoroughly before deploying to multiple devices