Orchestration Overview
Retell provides a sophisticated voice AI orchestration layer that seamlessly integrates frontier audio technologies to create natural, responsive voice interactions optimized for phone call conditions.
Building Blocks
Traditional voice AI systems consist of three core components:
- Speech-to-Text (STT): Converts spoken words into written text
- Large Language Model (LLM): Processes and generates contextual responses
- Text-to-Speech (TTS): Converts text responses into natural speech
Recently, Speech-to-Speech (S2S) models have emerged as a new building block in the voice AI stack. Specifically it is capable of understanding audio input and generating audio output without needing the step of generating text first.
However, simply utilizing these building blocks often results in high-latency, unnatural interactions, easily interrupted by background noise, and lacks critical capabilities.
Our Orchestration Solution
Retell’s orchestration layer solves the challenges in optimizing real-time operations, managing scalable infrastructure, and ensuring human-like conversations. It organizes and connects following systems:
-
Audio Models
- Helps manage and scales building blocks mentioned in the previous section, with no need to worry about rate limit and latency
- Multiple choices of models and providers to meet different use cases
- Advanced status check and automatic fallback mechanisms to ensure minimal disruption to calls
- Unified configuration options, providing flexibility with ease of use
- Security and compliance propogated down to every underlying provider
-
Noise Management
- Advanced streaming background noise filtering
- Echo cancellation
-
Intelligent Endpointing & Turn-taking
- Precise detection of speech completion
- Context-aware turn-taking with configurable thresholds
-
Dynamic Interruption Handling
- Graceful handling of mid-conversation interruptions
- Adaptive response timing based on user speech patterns
- Configurable interruption sensitivity levels
-
Reminders & Backchanneling
- Reminders for when user is not responding
- Backchanneling to keep the conversation engaging and natural
-
Background Sound
- Background sound to create a more natural calling experience
-
Telephony Features
- Telephony features like voicemail detection, call transfer, press digits (DTMF)
- integrates seamlessly via function calling
-
End-call Criteria
- End-call as a function call, or end when user is not responding
- Maximum call time to ensure no outstanding charges
Was this page helpful?