We recommend following the Custom Twilio phone call setup guide and web call setup guide. In these guide, the SDK will do all the job and you can ignore this doc. This doc is to show the underlying protocols for those few folks that want to work with audio bytes, websocket, and native js directly.

Overview

This WebSocket shall connect directly to user audio source, whether if it’s from web, or from phone through Twilio websocket. You should first Register Call, then generate this endpoint, and pass it to your frontend, or to Twilio.

Different audio sources have different protocol, and have different requirements on websocket messages, leading to different ways to read and send audio bytes. There are multiple protocols that our system would support, and only one of them are defined by us.

The Web frontend / programmable communication tools like Twilio will initiate this WebSocket, and Retell AI will handle it. Upon receiving this socket, the call is considered ongoing, and Retell AI will initiate LLM WebSocket to your server to get agent text responses.

Endpoint

Websocket Endpoint: wss://api.retellai.com/audio-websocket/{call_id}

Path Parameters

call_id
string
required

Call id to identify call spec and to authenticate. Generated from Register Call.

Query Parameters

enable_update
boolean
default: "false"

Whether to send live update about the call (currently contains transcript) in the websocket. Only supported in Web Protocol. Commonly used to render live transcript in web frontend.

enable_audio_alignment
boolean
default: "false"

With this field set to true, the audio will not be sent as raw audio bytes, but encoded to base64 and sent in a json where it contains the text corresponding to the audio. This is useful for alignment and animation purposes. Curently the frontend SDK does not support this option.

Web Frontend Code Walkthrough

We provide an example React frontend code. Detailed guide and github code here.

Supported Protocol: Web

This protocol is defined by us, and is only relevant to you when you directly handle user audio bytes (typically in web frontend).

User raw audio bytes with no encoding shall be sent in chunks in real-time (streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio, smaller chunks have lower latency), while the server would send back raw audio bytes in chunks for frontend to buffer for a smooth playback. The audio encoding and sample rate is contained in the call detail. The message type is “binary”.

When turn changes from agent back to user (for example, user interrupts the agent), server would send a “clear” event to signal frontend to clear its buffer, so that agent stops speaking asap when it should stop.

The connection would close on errors and a reason would get displayed in close event when closed on error.

Frontend -> Retell Message Event Spec

WebSocket sends mic input to Retell server for speech recognition

content: raw audio bytes

Retell -> Frontend Message Event Spec

There’s a couple of events that can be sent from Retell server to your frontend, depending on your configuration. They are as follows:

Raw Audio Response Event

This event contains Agent audio response in raw audio bytes. The audio is sent whenver the generation is complete, so the audio can be faster than real-time.

  • This event is populated when query parameter enable_audio_alignment is false.
  • content: raw audio bytes

Clear Event

Retell sends audio as they synthesize, and the frontend buffers it to play in real-time. However, there are cases where the audio needs to be cleared, for example when the user interrupts the agent. This event signals to clear all audio in buffer.

  • content: “clear”

Update Event

Contains Real-time information about the call like transcript, turntaking information, etc.

  • This event is populated when query parameter enable_update is true.
  • content: json that contains following property
event_type
update
required

The text content corresponding to the audio.

transcript
object[]
required

Complete live transcript collected in the call so far. Presented in the form of a list of utterances.

turntaking
enum<string>

Indicates change of speaker (turn taking). This field will be present when speaker changes to user (user turn), or right before agent is about to speak (agent turn). This field can be helpful determining when to call functions in the call. Available options: agent_turn, user_turn

Text Audio Alignment Event

Sometimes you might want to show text in frontend, and usually that’s achievable by subscribing to the Update Event, but sometimes you might want to align the text with the audio precisely (for example, for avatar use cases). This event sends audio along with the text alignment information. Note that the frontend SDK does not yet have support for this event.

  • This event is populated when query parameter enable_audio_alignment is true.
  • The Raw Audio Event will not be sent when this event is sent.
  • content: json that contains following property
event_type
audio_alignment
required

The text content corresponding to the audio.

text
string
required

The text content corresponding to the audio.

audio
string
required

Base64 encoded raw audio bytes.

Metadata Event

When your custom LLM pass a metadata event to Retell, it will be forwarded to your frontend here.

event_type
enum<string>
required

Differentiate what this event is.

Available options: metadata

metadata
object
required

You can put anything here that can be json serialized.

Sample Events

Frontend -> Retell

// audio bytes
[0x24, 0xFF, 0xAE, 0x11, ... , 0x1A, 0x3F]

Retell -> Frontend

// audio bytes
[0x24, 0xFF, 0xAE, 0x11, ... , 0x1A, 0x3F]

Supported Protocol: Twilio WebSocket

You don’t need to worry about handling the socket and format, as we will connect directly to Twilio server. When using Twilio for phone call, you don’t need to touch any audio bytes, just to respond the correct websocket url to Twilio is enough.

See this websocket protocol for details. Basically audio is encoded in base64 and stored in media event under media.payload.

Android and iOS support

We don’t have Android or iOS SDK for now, but you can follow web implementation to implement for Android and iOS

  1. Implement an audio websocket to connect to your server. It will be responsible of sending and receiving audio bytes between client and Retell server. (Code)

  2. Call register-call on your server to get the call id

  3. Pass microphone stream to the websocket. (Code)

  4. Handle “clear” event from the server, it means the user has interrupted the agent and you need to clear the local audio data.

FAQ