Retell provides a sophisticated voice AI orchestration layer that seamlessly integrates frontier audio technologies to create natural, responsive voice interactions optimized for phone call conditions.
Traditional voice AI systems consist of three core components:
Speech-to-Text (STT): Converts spoken words into written text
Large Language Model (LLM): Processes and generates contextual responses
Text-to-Speech (TTS): Converts text responses into natural speech
Recently, Speech-to-Speech (S2S) models have emerged as a new building block in the voice AI stack. Specifically it is capable of understanding audio input and generating audio output without needing the step of generating text first.However, simply utilizing these building blocks often results in high-latency, unnatural interactions, easily interrupted by background noise, and lacks critical capabilities.
Retell’s orchestration layer solves the challenges in optimizing real-time operations, managing scalable infrastructure, and ensuring human-like conversations. It organizes and connects following systems:
Audio Models
Helps manage and scales building blocks mentioned in the previous section, with no need to worry about rate limit and latency
Multiple choices of models and providers to meet different use cases
Advanced status check and automatic fallback mechanisms to ensure minimal disruption to calls
Unified configuration options, providing flexibility with ease of use
Security and compliance propogated down to every underlying provider
Noise Management
Advanced streaming background noise filtering
Echo cancellation
Intelligent Endpointing & Turn-taking
Precise detection of speech completion
Context-aware turn-taking with configurable thresholds
Dynamic Interruption Handling
Graceful handling of mid-conversation interruptions
Adaptive response timing based on user speech patterns
Configurable interruption sensitivity levels
Reminders & Backchanneling
Reminders for when user is not responding
Backchanneling to keep the conversation engaging and natural
Background Sound
Background sound to create a more natural calling experience
Telephony Features
Telephony features like voicemail detection, call transfer, press digits (DTMF)
integrates seamlessly via function calling
End-call Criteria
End-call as a function call, or end when user is not responding
Maximum call time to ensure no outstanding charges
Was this page helpful?
Assistant
Responses are generated using AI and may contain mistakes.