Speech Controllability
Control and direct how the speech is generated.
You have selected voice of your agent, but there’s more to how your agent speaks you might want to customize, like how to pronunciate certain words, where to pause, etc. This doc covers how you can control and direct how the speech generated by the agent.
Some of the features listed here will be processed sequentially, so you can combine them to achieve the desired effect. Here’s the order of the processing if multiple features are used:
- Pauses: Adds pauses to where ” - ” is found in the text
- Normalize text for speech: Converts numbers, dates, and other entities into spoken form
- Pronunciation: Pronounces words in a certain way
Normalize Text for Speech
Normalize the some part of text (number, currency, date, etc) to spoken to its spoken form for more consistent speech synthesis (sometimes the voice synthesize system itself might read these wrong with the raw text).
For example, before starting audio generation, it will convert
Call my number 2137112342 on Jul 5th, 2024 for the $24.12 payment
to
Call my number two one three seven one one two three four two on july fifth, twenty twenty four for the twenty four dollars twelve cents payment
Note that this feature adds a bit of latency (~100ms) to the whole process.
Add Pauses (How to Read Slowly)
Although you can adjust general speed of the audio by changing the
voice speed, you might want to
slow down the agent’s speech only at certain points (like reading phone numbers).
You can do this by prompting the LLM and generating text with -
in between (note, the space around -
is important):
Note that this feature requires the text generated from LLM to contain the hyphen in it already, and the text generated from normalized text does not contain it.
Pronunciation
You can also control how certain words are pronounced. This is useful when you want to make sure certain uncommon words are pronounced correctly.
To use the feature, you simply set a pronunciation dictionary for the agent, which consists of:
- the word to be pronunced. For example,
actually
- the phonetic alphabet to use, right now
ipa
andcmu
are supported. - the phonetic pronunciation of the word. For example,
æktʃuəli
You can search online to find the phonetic pronunciation of a word, or use tools like:
Was this page helpful?