Background

With the advancement of generative AI, we have witnessed significant growth in chatbot products that dominate the market. Simultaneously, voice AI has improved to the extent that smooth conversations with AI are now feasible. Whether you are building AI for inbound and outbound calls, professional services, companion apps, etc., voice remains a core part of the experience and is important for conversion. We can all recall frustrating experiences with AI during calls — robotic voices, awkward silences, long latency periods, and the need to press buttons to interact, which collectively diminish the human-like quality of the experience and occasionally irritate users.

How Human Perform

Before we jump right into how to build a great voice experience, let’s take a moment to recap how a human usually interacts in a conversation. We operate with < 200ms latency when turn-taking happens, we backchannel as needed, we subconsciously understand when the other party finishes their turn, we understand the meaning and emotions of the other party, we have filler words within our sentences, we stop talking when interrupted… The list can go on, but the essential point I’m making here is that there are so many little mechanisms happening behind the scenes when we are having a simple, smooth conversation, and it is extremely HARD for machines to consider all these and perform like humans.

Components & Work Needed

One common question we get asked a lot is why I have to use the Retell API — can’t I just stitch ASR (speech-to-text), LLM, TTS (text-to-speech) together to build a voice conversation? Well, hmm, you totally should if you have the time, and see how far a simple stitching approach can get you. The number one problem we hear from those who make their own voice system is that it’s hard to cut latency down; the number two problem we see is that interruption handling is hard to implement with a simple setup; the number three problem we see is that the agent’s response is not conversational enough to sound like a human. To tackle all these, let’s go over an overview of what components need to be there and the work that needs to be done for a good conversational voice AI experience.

  • Integrate with web frontend or programmable communication tools like Twilio to get user audio.
  • Work with audio bytes and streaming protocols: User audio from various frontends (web, phone call) will come in different encodings, formats, and be sent over via different streaming protocols. This is a strenuous task, as audio bytes are hard to manipulate and time-consuming to work with. Ask any engineer you know that works with audio signals; they will share the same statement.
  • Understand the audio: There are various signals from audio that are vital for a smooth conversation.
    • Text: usually generated from ASR (asynchronous speech recognition), needs to be streaming, needs to be as fast and accurate as possible.
    • Emotion: understanding the emotional state of the other party is vital for humans to make a good response in conversation.
    • Audio signal quality: if there’s background noise, if there’s echo.
    • Speaker diarization: if multiple people are talking, identify speaker identity, and identify who is talking to you and who is not, etc.
    • Tonality and other speaker-specific traits.
    • Pause: whether the user stops talking, usually deduced from VAD (Voice Activity Detection).
  • Decide whether to speak: understand whether the other party will finish their turn soon, or they have already finished their turn, whether they are awaiting a response or just pausing to formulate their thoughts, etc. Needs to combine text, emotion, tonality, pause, and other audio input to generate this decision.
    • The tricky part is users can end at any place when you don’t know them well personally, and AI has to be prepared for those abrupt ends.
    • Users continuing to speak unexpectedly is less of a problem as AI can just continue to listen and revert the decision to talk.
  • Generating the responses: Generating a good response to what the user has said is hard, and very scenario-specific. There are various ways to do this part and it is customized for each use case, so here I will just share one simple flow of response generation.
    • RAG (retrieval augmented generation): Embeds documents, data, knowledge base, etc., into a vector database and retrieves only the relevant information out of it. This relevant info will be part of the prompt that gets fed into the LLM.
    • LLM: This can be a fine-tuned self-hosted model, or API calls to providers. Based on the relevant information from the last step and some other prompts that are customized for the user, generate a response, and stream the response back to a voice generation system.
    • Verification: for certain high-stake phrases / words, maybe verify before streaming back.
  • Synthesize the audio: Usually achieved using TTS (text to speech) models, transform the response text into audio. Needs to have tone and emotion variance that suits the scenario to be humanlike. Ideally, the TTS output should get streamed back for lower latency.
  • Taking actions: AI that can talk is cool, and AI that can take actions is cooler. This usually is achieved with function calling functionalities of certain models, or structured data output, so that downstream can book appointments when needed, can look up information when appropriate.
    • When taking actions, make sure your agent can still respond.
    • A specific use case of this is to transfer the call or end the call.

What Retell AI Offers

I think by now, most folks would agree that this is not as easy as stitching together ASR, LLM, TTS. Thus, let me (shamelessly) introduce how Retell AI can help here. By integrating with Retell AI, you can save months of development, enjoy a state-of-the-art voice experience, and get all the following covered:

  • Low Latency: We apply optimizations at every step, so latency is reduced to the lowest for the audio part. Note that the response generation part is still in your hands, so we have little control over that. Our demo is ~800ms between when the user stops and the agent responds.
  • Robustness toward noise: Optimize performance for different scenarios, different background noise, to achieve robust and consistent performance across different environment.
  • Interruption Handling: The user can interrupt at any time, and the agent reacts blazingly fast to that like a real human.
  • Audio Integration: We can connect directly with Twilio, and we’ve open-sourced web frontend code.
  • Work with audio bytes: we’ve got it covered and can save you hundreds of hours.
  • Audio understanding: we mine insights from audio with the lowest latency and send real-time transcripts (other signals coming soon) to you.
  • When to speak decision: we have our own turn-taking model and will keep iterating it to make better decisions on when to speak.
  • Audio synthesis: we integrate with various TTS providers and hired voice actors to create humanlike voices suited for conversation.
  • Quick demo & dashboard: we have set up our dashboard to support building a demo within 2 minutes.
  • Domain expertise & support: we have domain expertise in audio, modeling, agent creation. We are here to help whenever you run into issues.

What you need to do: keep iterating on your core product to make it better, while we take care of the audio part. Here are the parts you need to work on:

  • Response generation: although we support demo creation on the dashboard, we realize that for every scenario, the response generation process can be drastically different. Therefore, our API will easily integrate your custom response generation solution, no matter if it’s a simple OpenAI call or a complicated agent setup.
  • Taking actions: for the voice agent to take action, you probably have to integrate with different tools (calendars, CRM, etc.). It’s up to you to decide when to take action, and what to take. Don’t forget to keep generating responses while taking actions.
  • Specific call logic: different use cases will transfer / end calls differently, and it’s up to you to decide details about call flow. This can be set up via function calling.

I hope this blog can give you a high-level idea of how to build a great voice agent, and hopefully my (shameless) pitch for Retell AI can shed light on how we can help. Happy building!

Was this page helpful?