This guide only applies to cascading agents, if you are using speech to speech models, this feature does not apply.

Real time transcription is often a trade off between latency and accuracy. When relying on interim results, you get the lowest latency but with a higher chance of errors due to less context. When relying on results generated with more context, you risk waiting longer after the user stops speaking.

Transcription modes

  • optimize for speed: uses the latest interim results with a low endpointing setting for downstream processing.
  • optimize for accuracy: uses the results with a higher endpointing setting for downstream processing, essentially waiting longer with more context to generate more accurate transcripts. It will incur ~200ms latency.

Which mode to use?

From our benchmarking, we found that the optimize for speed mode and optimize for accuracy mode have similar WER (Word Error Rate). The difference mainly lies in capturing entities like numbers, dates. If your use case relies heavily on capturing these entities well, you should use the optimize for accuracy mode. Otherwise you can use the optimize for speed mode for best latency.