Building Blocks
Traditional voice AI systems consist of three core components:- Speech-to-Text (STT): Converts spoken words into written text
- Large Language Model (LLM): Processes and generates contextual responses
- Text-to-Speech (TTS): Converts text responses into natural speech
Our Orchestration Solution
Retell’s orchestration layer solves the challenges in optimizing real-time operations, managing scalable infrastructure, and ensuring human-like conversations. It organizes and connects following systems:-
Audio Models
- Helps manage and scales building blocks mentioned in the previous section, with no need to worry about rate limit and latency
- Multiple choices of models and providers to meet different use cases
- Advanced status check and automatic fallback mechanisms to ensure minimal disruption to calls
- Unified configuration options, providing flexibility with ease of use
- Security and compliance propogated down to every underlying provider
-
Noise Management
- Advanced streaming background noise filtering
- Echo cancellation
-
Intelligent Endpointing & Turn-taking
- Precise detection of speech completion
- Context-aware turn-taking with configurable thresholds
-
Dynamic Interruption Handling
- Graceful handling of mid-conversation interruptions
- Adaptive response timing based on user speech patterns
- Configurable interruption sensitivity levels
-
Reminders & Backchanneling
- Reminders for when user is not responding
- Backchanneling to keep the conversation engaging and natural
-
Background Sound
- Background sound to create a more natural calling experience
-
Telephony Features
- Telephony features like voicemail detection, call transfer, press digits (DTMF)
- integrates seamlessly via function calling
-
End-call Criteria
- End-call as a function call, or end when user is not responding
- Maximum call time to ensure no outstanding charges