You have selected voice of your agent, but there’s more to how your agent speaks you might want to customize, like how to pronunciate certain words, where to pause, etc. This doc covers how you can control and direct how the speech generated by the agent.

Some of the features listed here will be processed sequentially, so you can combine them to achieve the desired effect. Here’s the order of the processing if multiple features are used:

  1. Pauses: Adds pauses to where ” - ” is found in the text
  2. Normalize text for speech: Converts numbers, dates, and other entities into spoken form
  3. Pronunciation: Pronounces words in a certain way

Normalize Text for Speech

Normalize the some part of text (number, currency, date, etc) to spoken to its spoken form for more consistent speech synthesis (sometimes the voice synthesize system itself might read these wrong with the raw text).

For example, before starting audio generation, it will convert

Call my number 2137112342 on Jul 5th, 2024 for the $24.12 payment


Call my number two one three seven one one two three four two on july fifth, twenty twenty four for the twenty four dollars twelve cents payment

Note that this feature adds a bit of latency (~100ms) to the whole process.

Add Pauses (How to Read Slowly)

Although you can adjust general speed of the audio by changing the voice speed, you might want to slow down the agent’s speech only at certain points (like reading phone numbers). You can do this by prompting the LLM and generating text with - in between (note, the space around - is important):

The number is 2 - 1 - 3 - 4

Note that this feature requires the text generated from LLM to contain the hyphen in it already, and the text generated from normalized text does not contain it.


This feature only works with English agent with 11labs voices.

You can also control how certain words are pronounced. This is useful when you want to make sure certain uncommon words are pronounced correctly.

To use the feature, you simply set a pronunciation dictionary for the agent, which consists of:

  • the word to be pronunced. For example, actually
  • the phonetic alphabet to use, right now ipa and cmu are supported.
  • the phonetic pronunciation of the word. For example, æktʃuəli

You can search online to find the phonetic pronunciation of a word, or use tools like: