Audio WebSocket (Deprecated)

We recommend following the Custom Telephony phone call setup guide and web call setup guide. In these guide, the SDK will do all the job and you can ignore this doc. This doc is to show the underlying protocols for those few folks that want to work with audio bytes, websocket, and native js directly.

Overview

This WebSocket shall connect directly to user audio source, whether if it’s from web, or from phone through Twilio websocket. You should first Register Call, then generate this endpoint, and pass it to your frontend, or to Twilio. Different audio sources have different protocol, and have different requirements on websocket messages, leading to different ways to read and send audio bytes. There are multiple protocols that our system would support, and only one of them are defined by us. The Web frontend / programmable communication tools like Twilio will initiate this WebSocket, and Retell AI will handle it. Upon receiving this socket, the call is considered ongoing, and Retell AI will initiate LLM WebSocket to your server to get agent text responses.

Endpoint

Websocket Endpoint: wss://api.retellai.com/audio-websocket/{call_id}

Path Parameters

call_id

string

required

Call id to identify call spec and to authenticate. Generated from Register Call.

Query Parameters

enable_update

boolean

default:"false"

Whether to send live update about the call (currently contains transcript) in the websocket. Only supported in Web Protocol. Commonly used to render live transcript in web frontend.

enable_audio_alignment

boolean

default:"false"

With this field set to true, the audio will not be sent as raw audio bytes, but encoded to base64 and sent in a json where it contains the text corresponding to the audio. This is useful for alignment and animation purposes. Curently the frontend SDK does not support this option.

Web Frontend Code Walkthrough

We provide an example React frontend code. Detailed guide and github code here.

Supported Protocol: Web

This protocol is defined by us, and is only relevant to you when you directly handle user audio bytes (typically in web frontend). User raw audio bytes with no encoding shall be sent in chunks in real-time (streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio, smaller chunks have lower latency), while the server would send back raw audio bytes in chunks for frontend to buffer for a smooth playback. The audio encoding and sample rate is contained in the call detail. The message type is “binary”. When turn changes from agent back to user (for example, user interrupts the agent), server would send a “clear” event to signal frontend to clear its buffer, so that agent stops speaking asap when it should stop. The connection would close on errors and a reason would get displayed in close event when closed on error.

Frontend -> Retell Message Event Spec

WebSocket sends mic input to Retell server for speech recognition content: raw audio bytes

Retell -> Frontend Message Event Spec

There’s a couple of events that can be sent from Retell server to your frontend, depending on your configuration. They are as follows:

Raw Audio Response Event

This event contains Agent audio response in raw audio bytes. The audio is sent whenver the generation is complete, so the audio can be faster than real-time.

This event is populated when query parameter enable_audio_alignment is false.
content: raw audio bytes

Clear Event

Retell sends audio as they synthesize, and the frontend buffers it to play in real-time. However, there are cases where the audio needs to be cleared, for example when the user interrupts the agent. This event signals to clear all audio in buffer.

content: “clear”

Update Event

Contains Real-time information about the call like transcript, turntaking information, etc.

This event is populated when query parameter enable_update is true.
content: json that contains following property

event_type

update

required

The text content corresponding to the audio.

transcript

object[]

required

Complete live transcript collected in the call so far. Presented in the form of a list of utterances.

Show properties

role

enum<string>

required

Indicates whether this utterance is from agent or user. Available options: agent, user

content

string

required

Text content of the utterance.

words

object[]

required

List of words in the utterance. Useful for animation or calculating speaking rate.

Show properties

word

string

required

The word itself.

start

number

required

Start time of the word in seconds relative to the beginning of call.

end

number

required

End time of the word in seconds relative to the beginning of call.

turntaking

enum<string>

Indicates change of speaker (turn taking). This field will be present when speaker changes to user (user turn), or right before agent is about to speak (agent turn). This field can be helpful determining when to call functions in the call. Available options: agent_turn, user_turn

Text Audio Alignment Event

Sometimes you might want to show text in frontend, and usually that’s achievable by subscribing to the Update Event, but sometimes you might want to align the text with the audio precisely (for example, for avatar use cases). This event sends audio along with the text alignment information. Note that the frontend SDK does not yet have support for this event.

This event is populated when query parameter enable_audio_alignment is true.
The Raw Audio Event will not be sent when this event is sent.
content: json that contains following property

event_type

audio_alignment

required

The text content corresponding to the audio.

text

string

required

The text content corresponding to the audio.

audio

string

required

Base64 encoded raw audio bytes.

Metadata Event

When your custom LLM pass a metadata event to Retell, it will be forwarded to your frontend here.

event_type

enum<string>

required

Differentiate what this event is.Available options: metadata

metadata

object

required

You can put anything here that can be json serialized.

Sample Events

Frontend -> Retell

// audio bytes
[0x24, 0xFF, 0xAE, 0x11, ... , 0x1A, 0x3F]

Retell -> Frontend

// audio bytes
[0x24, 0xFF, 0xAE, 0x11, ... , 0x1A, 0x3F]

Supported Protocol: Twilio WebSocket

You don’t need to worry about handling the socket and format, as we will connect directly to Twilio server. When using Twilio for phone call, you don’t need to touch any audio bytes, just to respond the correct websocket url to Twilio is enough. See this websocket protocol for details. Basically audio is encoded in base64 and stored in media event under media.payload.

Android and iOS support

We don’t have Android or iOS SDK for now, but you can follow web implementation to implement for Android and iOS

Implement an audio websocket to connect to your server. It will be responsible of sending and receiving audio bytes between client and Retell server. (Code)
Call register-call on your server to get the call id
Pass microphone stream to the websocket. (Code)
Handle “clear” event from the server, it means the user has interrupted the agent and you need to clear the local audio data.

FAQ

When do I have to write audio capture and playback code?

Do I have to deal with audio bytes myself?

Can I still mutate the audio bytes when making phone call?

Call (V2)

Chat

Phone Number

Agent

Retell LLM Response Engine (for single / multi prompt agent)

Conversation Flow Response Engine (for conversation flow agent)

Knowledge Base

Voice

Batch call

Account

Custom Telephony

Custom LLM

Audio WebSocket (Deprecated)

Overview

Endpoint

Path Parameters

Query Parameters

Web Frontend Code Walkthrough

Supported Protocol: Web

Frontend -> Retell Message Event Spec

Retell -> Frontend Message Event Spec

Raw Audio Response Event

Clear Event

Update Event

Text Audio Alignment Event

Metadata Event

Sample Events

Supported Protocol: Twilio WebSocket

Android and iOS support

FAQ

Call (V2)

Chat

Phone Number

Agent

Retell LLM Response Engine (for single / multi prompt agent)

Conversation Flow Response Engine (for conversation flow agent)

Knowledge Base

Voice

Batch call

Account

Custom Telephony

Custom LLM

​Overview

​Endpoint

​Path Parameters

​Query Parameters

​Web Frontend Code Walkthrough

​Supported Protocol: Web

​Frontend -> Retell Message Event Spec

​Retell -> Frontend Message Event Spec

​Raw Audio Response Event

​Clear Event

​Update Event

​Text Audio Alignment Event

​Metadata Event

​Sample Events

​Supported Protocol: Twilio WebSocket

​Android and iOS support

​FAQ

Overview

Endpoint

Path Parameters

Query Parameters

Web Frontend Code Walkthrough

Supported Protocol: Web

Frontend -> Retell Message Event Spec

Retell -> Frontend Message Event Spec

Raw Audio Response Event

Clear Event

Update Event

Text Audio Alignment Event

Metadata Event

Sample Events

Supported Protocol: Twilio WebSocket

Android and iOS support

FAQ