Audio WebSocket (Deprecated)
Retell AI connects with user audio source, read and send audio bytes through this WebSocket.
Overview
This WebSocket shall connect directly to user audio source, whether if it’s from web, or from phone through Twilio websocket. You should first Register Call, then generate this endpoint, and pass it to your frontend, or to Twilio.
Different audio sources have different protocol, and have different requirements on websocket messages, leading to different ways to read and send audio bytes. There are multiple protocols that our system would support, and only one of them are defined by us.
The Web frontend / programmable communication tools like Twilio will initiate this WebSocket, and Retell AI will handle it. Upon receiving this socket, the call is considered ongoing, and Retell AI will initiate LLM WebSocket to your server to get agent text responses.
Endpoint
wss://api.retellai.com/audio-websocket/{call_id}
Path Parameters
Call id to identify call spec and to authenticate. Generated from Register Call.
Query Parameters
Whether to send live update about the call (currently contains transcript) in the websocket. Only supported in Web
Protocol.
Commonly used to render live transcript in web frontend.
With this field set to true, the audio will not be sent as raw audio bytes, but encoded to base64 and sent in a json where it contains the text corresponding to the audio. This is useful for alignment and animation purposes. Curently the frontend SDK does not support this option.
Web Frontend Code Walkthrough
We provide an example React frontend code. Detailed guide and github code here.
Supported Protocol: Web
This protocol is defined by us, and is only relevant to you when you directly handle user audio bytes (typically in web frontend).
User raw audio bytes with no encoding shall be sent in chunks in real-time (streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio, smaller chunks have lower latency), while the server would send back raw audio bytes in chunks for frontend to buffer for a smooth playback. The audio encoding and sample rate is contained in the call detail. The message type is “binary”.
When turn changes from agent back to user (for example, user interrupts the agent), server would send a “clear” event to signal frontend to clear its buffer, so that agent stops speaking asap when it should stop.
The connection would close on errors and a reason would get displayed in close
event when closed on error.
Frontend -> Retell Message Event Spec
WebSocket sends mic input to Retell server for speech recognition
content: raw audio bytes
Retell -> Frontend Message Event Spec
There’s a couple of events that can be sent from Retell server to your frontend, depending on your configuration. They are as follows:
Raw Audio Response Event
This event contains Agent audio response in raw audio bytes. The audio is sent whenver the generation is complete, so the audio can be faster than real-time.
- This event is populated when query parameter
enable_audio_alignment
is false. - content: raw audio bytes
Clear Event
Retell sends audio as they synthesize, and the frontend buffers it to play in real-time. However, there are cases where the audio needs to be cleared, for example when the user interrupts the agent. This event signals to clear all audio in buffer.
- content: “clear”
Update Event
Contains Real-time information about the call like transcript, turntaking information, etc.
- This event is populated when query parameter
enable_update
is true. - content: json that contains following property
The text content corresponding to the audio.
Complete live transcript collected in the call so far. Presented in the form of a list of utterances.
Indicates change of speaker (turn taking). This field will be present when speaker changes to user (user turn), or right before
agent is about to speak (agent turn). This field can be helpful determining when to call functions in the call.
Available options: agent_turn
, user_turn
Text Audio Alignment Event
Sometimes you might want to show text in frontend, and usually that’s achievable by subscribing to the Update Event, but sometimes you might want to align the text with the audio precisely (for example, for avatar use cases). This event sends audio along with the text alignment information. Note that the frontend SDK does not yet have support for this event.
- This event is populated when query parameter
enable_audio_alignment
is true. - The Raw Audio Event will not be sent when this event is sent.
- content: json that contains following property
The text content corresponding to the audio.
The text content corresponding to the audio.
Base64 encoded raw audio bytes.
Metadata Event
When your custom LLM pass a metadata event to Retell, it will be forwarded to your frontend here.
Differentiate what this event is.
Available options: metadata
You can put anything here that can be json serialized.
Sample Events
Frontend -> Retell
Retell -> Frontend
Supported Protocol: Twilio WebSocket
You don’t need to worry about handling the socket and format, as we will connect directly to Twilio server. When using Twilio for phone call, you don’t need to touch any audio bytes, just to respond the correct websocket url to Twilio is enough.
See this websocket protocol
for details. Basically audio is encoded in base64 and stored in media event under media.payload
.
Android and iOS support
We don’t have Android or iOS SDK for now, but you can follow web implementation to implement for Android and iOS
-
Implement an audio websocket to connect to your server. It will be responsible of sending and receiving audio bytes between client and Retell server. (Code)
-
Call
register-call
on your server to get the call id -
Pass microphone stream to the websocket. (Code)
-
Handle “clear” event from the server, it means the user has interrupted the agent and you need to clear the local audio data.
FAQ
Was this page helpful?