Balance between transcription accuracy and latency

On this page

Transcription modes
Which mode to use?

This guide only applies to cascading agents, if you are using speech to speech models, this feature does not apply.

Real time transcription is often a trade off between latency and accuracy. When relying on interim results, you get the lowest latency but with a higher chance of errors due to less context. When relying on results generated with more context, you risk waiting longer after the user stops speaking.

Transcription modes

optimize for speed: uses the latest interim results with a low endpointing setting for downstream processing.
optimize for accuracy: uses the results with a higher endpointing setting for downstream processing, essentially waiting longer with more context to generate more accurate transcripts. It will incur ~200ms latency.

Which mode to use?

From our benchmarking, we found that the optimize for speed mode and optimize for accuracy mode have similar WER (Word Error Rate). The difference mainly lies in capturing entities like numbers, dates. If your use case relies heavily on capturing these entities well, you should use the optimize for accuracy mode. Otherwise you can use the optimize for speed mode for best latency.

Set language for your agent Handle background speech & noise

Get Started

Build

Test

Deploy

Monitor

Reliability & Debugging

Accounts and Workspace

Advanced Topics

Network Ecosystem

Balance between transcription accuracy and latency

Transcription modes

Which mode to use?

Get Started

Build

Test

Deploy

Monitor

Reliability & Debugging

Accounts and Workspace

Advanced Topics

Network Ecosystem

​Transcription modes

​Which mode to use?

Transcription modes

Which mode to use?