Latency is a core problem in conversational AI, and for some use cases it’s important to keep it as low as possible to ensure a smooth conversation. This doc covers what configuration impacts latency, and what can be done to reduce latency in your agent.

What Impacts Latency

There’re a few factors that impact latency in your agent:

LLM Response Time

The time it takes for LLM to generate a response. This is usually the biggest factor in latency.

To minimize latency here, you can choose a faster (smaller) LLM model like gpt-3.5-turbo / claude-haiku. We do notice that sometimes with heavier traffic, the model provider’s API gets slower and have more variations on the latency.

Longer prompts (including tool calls) also lead to longer response time, so try to keep the prompt short and concise.

Audio Generation Time

The time it takes to generate audio from the text response varies. The choice of TTS provider impacts this.

Here’s a general latency comparison of different TTS providers:

  • 11 labs: ~450ms
  • OpenAI: ~650ms
  • Deepgram: ~300ms

Language

Laguage with more traffic (en, es, multi) will have better latency compared to other languages.

Responsiveness

This controls how responsive the agent is. If this is set to a lower value, the agent tends to wait longer before speaking, which can increase latency.

Features that Add Latency

Some features we provide will add some latency to the whole process, due to the need of additional processing. If you want to reduce latency, you might want to avoid using these features.

Audio Speed Adjustment

This feature controls how fast the agent speaks. This would require ~50ms of additional processing time.

Normalize Text for Speech

This feature converts numbers, dates, and other entities into spoken form for more consistent speech synthesis. This would require ~75ms of additional processing time.

Boosted Keywords & Disable Transcript Formatting

These features will add around 300-500ms of additional processing time.

V1 vs V2 APIs

V2 APIs are generally faster than V1 APIs, as more optimizations are baked in.

Network

Our servers are deployed mostly in US West. Locating resources closer to our servers can reduce latency.