Latency is a core problem in conversational AI, and for some use cases it’s important to keep it as low as possible to ensure a smooth conversation. This doc covers what configuration impacts latency, and what can be done to reduce latency in your agent.

What Impacts Latency

There’re a few factors that impact latency in your agent:

LLM Response Time

The time it takes for LLM to generate a response. This is usually the biggest factor in latency.

To minimize latency here, you can choose a faster (smaller) LLM model like gpt-3.5-turbo / claude-haiku. We do notice that sometimes with heavier traffic, the model provider’s API gets slower and have more variations on the latency.

Longer prompts (including tool calls) also lead to longer response time, so try to keep the prompt short and concise.

Audio Generation Time

The time it takes to generate audio from the text response varies. The choice of TTS provider impacts this.

Here’s a general latency comparison of different TTS providers:

  • 11 labs: ~450ms
  • OpenAI: ~650ms
  • Deepgram: ~300ms

Responsiveness

This controls how responsive the agent is. If this is set to a lower value, the agent tends to wait longer before speaking, which can increase latency.

Features that Add Latency

Some features we provide will add some latency to the whole process, due to the need of additional processing. If you want to reduce latency, you might want to avoid using these features.

Audio Speed Adjustment

This feature controls how fast the agent speaks. This would require ~50ms of additional processing time.

Amebient Sound

This feature adds ambient environment sound to the call to make experience more realistic. This would require ~75ms of additional processing time.

Normalize Text for Speech

This feature converts numbers, dates, and other entities into spoken form for more consistent speech synthesis. This would require ~75ms of additional processing time.