Understand user speech better
Balance between transcription accuracy and latency
Guide on how to select the right transcription mode for your agent.
This guide only applies to cascading agents, if you are using speech to speech models, this feature does not apply.
Real time transcription is often a trade off between latency and accuracy. When relying on interim results, you get the lowest latency but with a higher chance of errors due to less context. When relying on results generated with more context, you risk waiting longer after the user stops speaking.
Transcription modes
- optimize for speed: uses the latest interim results with a low endpointing setting for downstream processing.
- optimize for accuracy: uses the results with a higher endpointing setting for downstream processing, essentially waiting longer with more context to generate more accurate transcripts. It will incur ~200ms latency.
Which mode to use?
From our benchmarking, we found that the optimize for speed
mode and optimize for accuracy
mode have similar WER (Word Error Rate). The difference mainly lies in capturing entities like numbers, dates. If your use case relies heavily on capturing these entities well, you should use the optimize for accuracy
mode. Otherwise you can use the optimize for speed
mode for best latency.