-
Notifications
You must be signed in to change notification settings - Fork 760
Description
Description
Is your feature request related to a problem? Please describe.
In the Gemini Live API, key real time voice agent controls appear unavailable or not exposed as first class configuration. Specifically:
- Speech rate is fixed or only indirectly controllable.
- Session turn taking behavior is not configurable, such as whether the agent should speak first at session start.
- Interruption and barge in behavior is not configurable, such as whether users can interrupt the agent mid utterance.
These controls are required for production voice agents and contact center deployments. Without them, the experience can feel unnatural, inaccessible, and inconsistent across languages, especially Arabic dialects and English. It also makes it hard to meet different business requirements for inbound versus outbound calls and compliance flows.
Describe the solution you would like
Expose explicit configuration options in the Live API for:
- Speech speed adjustment for generated audio output.
- Speak first behavior toggle to control whether the agent initiates the conversation automatically at session start.
- Allow interruptions toggle to control whether user speech can interrupt or stop the agent while it is speaking.
These settings should be settable at session creation and updateable during an active session without requiring a session restart.
Proposed API capability
A) Speech speed
- Add a parameter in Live API request configuration to set speech speed for generated audio output.
- Allow updates mid session.
- Preserve pitch by default, meaning time stretch rather than pitch shifting, unless explicitly enabled.
Suggested parameter:
-
speech_rate: float- Default:
1.0 - Range:
0.5to2.0
- Default:
B) Speak first toggle
-
speak_first: boolean- Default:
false - When
true, the agent produces an initial greeting immediately after session start without waiting for user audio. - Works for both inbound and outbound scenarios.
- Default:
C) Allow interruptions toggle
-
allow_interruptions: boolean- Default:
true - When
true, user speech should barge in and interrupt agent playback, and the system should stop current agent audio output promptly. - When
false, user speech should be ignored or buffered until the agent finishes speaking, depending on session mode.
- Default:
Optional refinement:
-
interrupt_policy: enumbarge_in_stop_audiobarge_in_duck_audiono_barge_in
Acceptance criteria
- Parameters are documented in the Live API reference.
- Behavior is deterministic and consistent across voices and languages, including Arabic.
- Parameters can be updated during an active session without restarting the session.
- Interruption behavior has low latency, meaning agent audio stops quickly when barge in is enabled.
- Clear errors for invalid values and clearly defined defaults.
Describe alternatives you have considered
- Prompting the model to speak slower or faster, or to wait for user input. This is unreliable.
- External audio time stretching for speech rate, which adds latency and can degrade quality.
- Building custom interruption logic by managing audio streams externally, which is complex and brittle.
- Switching to another TTS stack that has speech rate and barge in controls, which reduces the value of using Gemini Live API.
Additional context
These controls are foundational for real time voice agents:
- Speech speed improves clarity and accessibility, especially for medical instructions and confirmations.
- Speak first is required for outbound calls and for agents that must lead the interaction.
- Allow interruptions is required for natural conversation, higher user satisfaction, and reduced call duration.