Normalize the some part of text (number, currency, date, etc) to spoken to its spoken form for more consistent speech synthesis (sometimes the voice synthesize system itself might read these wrong with the raw text).

For example, before starting audio generation, it will convert

Call my number 2137112342 on Jul 5th, 2024 for the $24.12 payment

to

Call my number two one three seven one one two three four two on july fifth, twenty twenty four for the twenty four dollars twelve cents payment

Note that this feature adds a bit of latency (~100ms) to the whole process.