`speech` modular service

speech is a modular service that provides text-to-speech (TTS) and speech-to-text (STT) capabilities for machines running on the Viam platform.

This module implements the Speech Service API (viam-labs:service:speech). See the documentation for that API to learn more about using it with the Viam SDKs.

Requirements

On Linux:

build.sh will automatically include the following system dependencies as part of the PyInstaller executable:

python3-pyaudio
portaudio19-dev
alsa-tools
alsa-utils
flac

On MacOS, build.sh will include the following dependencies using Homebrew:

brew install portaudio

Before configuring your speech service, you must also create a machine.

Build and run

To use this module, follow these instructions to add a module from the Viam Registry and select the viam-labs:speech:speechio model from the speech module.

Configure your `speech service`

Navigate to the Config tab of your machine's page in the Viam app. Click on the Services subtab and click Create service. Select the speech type, then select the speech:speechio model. Click Add module, then enter a name for your speech service and click Create.

On the new component panel, copy and paste the following attribute template into your sensor’s Attributes box:

{
  "speech_provider": "google|elevenlabs",
  "speech_provider_key": "<SECRET-KEY>",
  "speech_voice": "<VOICE-OPTION>",
  "completion_provider": "openai",
  "completion_model": "gpt-4|gpt-3.5-turbo",
  "completion_provider_org": "<org-abc123>",
  "completion_provider_key": "<sk-mykey>",
  "completion_persona": "<PERSONA>",
  "listen": true,
  "stt_provider": "google",
  "use_vosk_vad": false,
  "listen_trigger_say": "<TRIGGER-PHRASE>",
  "listen_trigger_completion": "<COMPLETION-PHRASE>",
  "listen_trigger_command": "<COMMAND-TO-RETRIEVE-STORED-TEXT>",
  "listen_command_buffer_length": 10,
  "listen_phrase_time_limit": 5,
  "mic_device_name": "myMic",
  "cache_ahead_completions": false,
  "disable_mic": false
}

Note

For more information, see Configure a Machine.

Attributes

The following attributes are available for the viam-labs:speech:speechio speech service:

Name	Type	Inclusion	Description
`speech_provider`	string	Optional	The speech provider for the voice service: `"google"` or `"elevenlabs"`. Default: `"google"`.
`speech_provider_key`	string	Required	The secret key for the provider - only required for elevenlabs. Default: `""`.
`speech_voice`	string	Optional	If the speech_provider (example: elevenlabs) provides voice options, you can select the voice here. Default: `"Josh"`.
`completion_provider`	string	Optional	`"openai"`. Other providers may be supported in the future. completion_provider_org and completion_provider_key must also be provided. Default: `"openai"`.
`completion_model`	string	Optional	`gpt-4o`, `gpt-4o-mini`, etc. completion_provider_org and completion_provider_key must also be provided. Default: `"gpt-4o"`.
`completion_provider_org`	string	Optional	Your org for the completion provider. Default: `""`.
`completion_provider_key`	string	Optional	Your key for the completion provider. Default: `""`.
`completion_persona`	string	Optional	If set, will pass "As <completion_persona> respond to '<completion_text>'" to all completion() requests. Default: `""`.
`listen`	boolean	Optional	If set to true and the robot as an available microphone device, will enable listening in the background. If enabled, it will respond to configured listen_trigger_say, listen_trigger_completion and listen_trigger_command, based on input audio being converted to text. If listen is enabled and listen_triggers_active is disabled, triggers will occur when listen_trigger is called. Note that background (ambient) noise and microphone quality are important factors in the quality of the STT conversion. Currently, Google STT is leveraged. Default: `false`.
`stt_provider`	string	Optional	This can be set to the name of a configured speech service that provides a `to_text` command, like `stt-vosk`*. Otherwise, the Google STT API will be used. Default: `"google"`.
`listen_phrase_time_limit`	float	Optional	The maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If phrase_timeout is None, there will be no phrase time limit. Note: if you are seeing instance where phrases are not being returned for much longer than you expect, try changing this to ~5 or so. Works with both default VAD and Vosk VAD. Default: `None`.
`listen_trigger_say`	string	Optional	If listen is true, any audio converted to text that is prefixed with listen_trigger_say will be converted to speech and repeated back by the robot. Default: `"robot say"`.
`listen_trigger_completion`	string	Optional	If listen is true, any audio converted to text that is prefixed with listen_trigger_completion will be sent to the completion provider (if configured), converted to speech, and repeated back by the robot. Default: `"hey robot"`.
`listen_trigger_command`	string	Optional	If `"listen": true`, any audio converted to text that is prefixed with listen_trigger_command will be stored in a LIFO buffer (list of strings) of size listen_command_buffer_length that can be retrieved via get_commands() from the Speech Service API, enabling programmatic voice control of the robot. Default: `"robot can you"`.
`listen_command_buffer_length`	integer	Optional	The buffer length for the command. Default: `10`.
`mic_device_name`	string	Optional	If not set, will attempt to use the first available microphone device. If set, will attempt to use a specifically labeled device name. Available microphone device names will logged on module startup. Default: `""`.
`cache_ahead_completions`	boolean	Optional	If true, will read a second completion for the request and cache it for next time a matching request is made. This is useful for faster completions when completion text is less variable. Default: `false`.
`disable_mic`	boolean	Optional	If true, will not configure any listening capabilities. This must be set to true if you do not have a valid microphone attached to your system. Default: `false`.
`disable_audioout`	boolean	Optional	If true, will not configure any audio output capabilities. This must be set to true if you do not have a valid audio output device attached to your system. Default: `false`.
`use_vosk_vad`	boolean	Optional	If true, will use Vosk for Voice Activity Detection (VAD) instead of the default speech_recognition VAD. The Vosk model will be automatically downloaded (~40MB) on first use. Default: `false`.

Example configuration

The following configuration sets up listening mode with local speech-to-text, uses an ElevenLabs voice "Antoni", makes AI completions available, and uses a 'Gollum' persona for AI completions:

{
  "completion_provider_org": "org-abc123",
  "completion_provider_key": "sk-mykey",
  "completion_persona": "Gollum",
  "listen": true,
  "stt_provider": "google",
  "speech_provider": "elevenlabs",
  "speech_provider_key": "keygoeshere",
  "speech_voice": "Antoni",
  "mic_device_name": "myMic"
}

Voice Activity Detection (VAD)

The speech service supports two Voice Activity Detection systems:

Default VAD (speech_recognition)

Uses the built-in VAD from the speech_recognition library
Good for basic voice detection
Works out of the box with no additional setup

Vosk VAD

Better handling of background noise
More precise speech boundary detection
Automatic model downloading (~40MB) on first use
Fallback protection - automatically uses default VAD if Vosk fails
Phrase time limiting - respects listen_phrase_time_limit parameter

Enabling Vosk VAD

To use Vosk VAD, simply set "use_vosk_vad": true in your configuration. The system will:

Automatically download the Vosk model on first use
Extract and verify the model
Start Vosk VAD
Fall back gracefully to default VAD if anything fails

Fuzzy Wake Word Matching

The speech service supports fuzzy matching for wake word detection using Levenshtein distance (edit distance) via the rapidfuzz library. This improves accuracy when speech recognition produces slight variations while preventing partial-word false positives.

Enabling Fuzzy Matching

To enable fuzzy wake word matching, add to your configuration:

{
  "listen_trigger_fuzzy_matching": true,
  "listen_trigger_fuzzy_threshold": 2
}

How It Works

Fuzzy matching uses word-boundary matching to allow wake words to trigger even when transcribed slightly differently, while preventing false matches:

"hey robot" will match "hey Robert" (distance = 2) ✓
"robot say" will match "robotic say" (distance = 2) ✓
"robot can you" will match "robot can u" (distance = 1) ✓
"hey robot" will NOT match "they robotic" (word boundaries prevent partial-word matches) ✗

The system automatically checks alternative transcriptions from Google Speech Recognition for better accuracy. The word-boundary approach achieves 100% accuracy in testing versus 87.5% for character-level matching.

Configuration

Attribute	Default	Description
`listen_trigger_fuzzy_matching`	`false`	Enable/disable fuzzy matching
`listen_trigger_fuzzy_threshold`	`2`	Maximum edit distance (0-5). Lower = stricter matching

Threshold Guidelines

Threshold 1: Very strict, for short wake words
Threshold 2-3: Recommended for most wake words (default: 2)
Threshold 4-5: Lenient, for noisy environments

Troubleshooting

Not triggering enough? Increase the threshold:

{"listen_trigger_fuzzy_threshold": 3}

Triggering too much? Decrease the threshold:

{"listen_trigger_fuzzy_threshold": 1}

Not working at all? Check that fuzzy matching is enabled and logs don't show import errors.

Configure the `discovery service`

This service queries the device for available microphone devices to be used for text-to-speech services.

The output from this discovery can be used to configure a viam-labs:speech:speechio service.

If an expected device doesn't appear, try following the troubleshooting steps in the module README.

Configuration

The following attribute template can be used to configure this model:

{
}

Attributes

The following attributes are available for this model:

Name	Type	Inclusion	Description

Example Configuration

{
}

Troubleshooting

ALSA Configuration Guide for Speechio Service

Overview

The speechio service requires specific ALSA configuration to work properly with USB microphones and audio output devices. This guide provides step-by-step instructions for setting up persistent ALSA configuration that survives machine reboots.

Key Requirements

Speechio service needs 16000Hz sample rate for speech recognition
USB microphones typically run at 44100Hz (some at 48000Hz)
Rate conversion is required using ALSA plug plugin
Configuration must persist across reboots

Step-by-Step Setup

1. Identify Your Audio Hardware

First, identify your audio devices:

# List playback devices
aplay -l

# List capture devices  
arecord -l

# Check current card assignments
cat /proc/asound/cards

Typical setup:

Audio output device: Usually card 0 (speakers/DAC)
USB Microphone: Usually card 1 (microphone input)

2. Test Current Configuration

Test if you get rate conversion warnings:

# Test recording at 16000Hz (required by speechio)
arecord -f S16_LE -r 16000 -c 1 -t wav /tmp/test_default.wav &
sleep 3
kill %1

Common issue: Warning: rate is not accurate (requested = 16000Hz, got = 44100Hz)

3. Create Persistent ALSA Configuration

⚠️ Important: Use device names, not card numbers, for reliability across reboots

First, find your device names:

# Find device names (more stable than card numbers)
cat /proc/asound/cards

Then create /etc/asound.conf using device names:

sudo tee /etc/asound.conf > /dev/null << 'EOF'
# ALSA Configuration for Speechio Service
# Uses device names for stability across reboots

pcm.!default {
    type asym
    playback.pcm {
        type hw
        card YourOutputDevice    # Replace with your audio output device name
        device 0
    }
    capture.pcm {
        type plug               # CRITICAL: Use plug for rate conversion
        slave.pcm {
            type hw
            card YourMicDevice  # Replace with your USB microphone device name
            device 0
        }
    }
}

ctl.!default {
    type hw
    card YourOutputDevice
}
EOF

4. Test the Configuration

Verify rate conversion works without warnings:

# Test recording - should work without rate warnings
arecord -f S16_LE -r 16000 -c 1 -t wav /tmp/test_fixed.wav &
sleep 3
kill %1

# Test playback
speaker-test -D default -t wav -c 2 -l 1

Expected result: No rate conversion warnings

6. Verify Configuration Persistence

Ensure the configuration will survive reboots:

# Check that the file is not managed by packages
dpkg -S /etc/asound.conf 2>/dev/null || echo "✓ File not managed by packages (good)"

# Verify file exists and has correct content
cat /etc/asound.conf

Expected logs:

"Will listen in background" - Service initialized
"speechio heard audio" - Audio detected when you speak
"speechio heard <text>" - Speech successfully transcribed

Troubleshooting Common Issues

Issue: Rate Conversion Warnings

Symptom: Warning: rate is not accurate (requested = 16000Hz, got = 44100Hz)

Solution: Use type plug for capture device in /etc/asound.conf

Issue: No Audio Detected by Speechio

Symptom: No "speechio heard audio" logs when speaking

Solution:

Verify microphone with arecord -l
Check ALSA configuration with arecord -D default -f S16_LE -r 16000 -c 1 -t wav /tmp/test.wav
Ensure /etc/asound.conf uses correct device names

Issue: Configuration Resets After Reboot

Symptom: Audio setup works until machine restarts

Solution:

Use /etc/asound.conf (not ~/.asoundrc)
Verify file is not managed by packages: dpkg -S /etc/asound.conf
Check file permissions: sudo chmod 644 /etc/asound.conf

Issue: Wrong Audio Device Selected / Card Numbers Change

Symptom: Speechio uses wrong microphone or speaker, or stops working after reboot

Cause: Card numbers (0, 1, 2) can change between reboots based on USB detection order

Solution: Use device names instead of card numbers in /etc/asound.conf

# Find stable device names
cat /proc/asound/cards

# Example output:
# 0 [sndrpihifiberry]: HifiBerry-DAC - HifiBerry DAC
# 1 [Device        ]: USB-Audio - USB PnP Sound Device

# Use device names in config:
card sndrpihifiberry    # Instead of card 0
card Device             # Instead of card 1

Testing Checklist

Before deploying speechio service:

Audio devices detected: aplay -l and arecord -l
Rate conversion works: arecord -f S16_LE -r 16000 -c 1 -t wav /tmp/test.wav (no warnings)
ALSA config persists: /etc/asound.conf exists and not package-managed
Speechio logs show: "Will listen in background"
Speaking generates: "speechio heard audio" logs
Configuration survives reboot

Advanced Configuration Options

Using modprobe for Consistent Card Numbers

If you prefer to use card numbers instead of device names, you can ensure consistent card ordering:

Check existing ALSA modules:

cat /proc/asound/modules

Example output:

0 snd_usb_audio
2 snd_soc_meson_card_utils
3 snd_usb_audio

Set module loading order:

sudo tee /etc/modprobe.d/alsa-base.conf > /dev/null << 'EOF'
# Ensure USB audio devices load first
options snd slots=snd-usb-audio,snd_soc_meson_card_utils
EOF

Use card numbers in ALSA config:

sudo tee /etc/asound.conf > /dev/null << 'EOF'
pcm.!default {
    type asym
    playback.pcm "plughw:0,0"    # First USB device
    capture.pcm "plughw:0,0"     # Same device for capture
}
EOF

Note: Device names are still recommended over this approach for better stability.

Notes for Developers

The plug plugin automatically handles rate conversion (44100Hz → 16000Hz)
Device names are more stable than card numbers across reboots
System-wide configuration (/etc/asound.conf) is required for viam-server access
Restart viam-server after ALSA configuration changes (speechio reinitializes audio devices on service restart)
Test thoroughly before deploying to multiple devices

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
client.py		client.py
main.py		main.py
meta.json		meta.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
setup.sh		setup.sh
uv.lock		uv.lock

License

viam-labs/speech

Folders and files

Latest commit

History

Repository files navigation

speech modular service

Requirements

Build and run

Configure your speech service

Attributes

Example configuration

Voice Activity Detection (VAD)

Default VAD (speech_recognition)

Vosk VAD

Enabling Vosk VAD

Fuzzy Wake Word Matching

Enabling Fuzzy Matching

How It Works

Configuration

Threshold Guidelines

Troubleshooting

Configure the discovery service

Configuration

Attributes

Example Configuration

Troubleshooting

ALSA Configuration Guide for Speechio Service

Overview

Key Requirements

Step-by-Step Setup

1. Identify Your Audio Hardware

2. Test Current Configuration

3. Create Persistent ALSA Configuration

4. Test the Configuration

6. Verify Configuration Persistence

Troubleshooting Common Issues

Issue: Rate Conversion Warnings

Issue: No Audio Detected by Speechio

Issue: Configuration Resets After Reboot

Issue: Wrong Audio Device Selected / Card Numbers Change

Testing Checklist

Advanced Configuration Options

Using modprobe for Consistent Card Numbers

Notes for Developers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

`speech` modular service

Configure your `speech service`

Configure the `discovery service`

Packages