Audio Synthesis Benchmarks

Comparative synthesis of 10 STT and 16 TTS engines evaluated. Key metrics, pipeline latency budgets, and decision stakes (2025–2026).

Pipeline latency budgets

End-to-end latency breakdown by architecture profile (STT + LLM + TTS + network). Conversational target: < 1,000 ms.

ProfileSTTLLMTTSNetworkEstimated totalStatus
Voice agent (cloud)150ms600ms120ms60ms930msACCEPTABLE
Voice agent (hybrid)100ms400ms80ms50ms630msTARGET OK
Self-hosted sovereign200ms350ms100ms40ms690msTARGET OK
End-to-end (Ultravox/Moshi)300ms60ms360msTARGET OK

* Best-case estimates. End-to-end profile (Ultravox/Moshi) merges STT+LLM+TTS into a single model.

STT — Speech Recognition (10 engines)

Click headers to sort. WER = Word Error Rate (lower = better). TTFA = Time to First Audio chunk (streaming latency).

EngineWER % ? TTFA (ms) ? Typical$/min LanguagesSovereignty ?Streaming ?Note
Deepgram Nova-3Detail sheet
7.2%75ms200ms$0.003636SELF-HOSTSTREAMBest latency cloud
Inworld STTDetail sheet
5%92ms150ms$0.00620CLOUDSTREAMVoice agent optimized
Whisper TurboDetail sheet
3%100ms200msFree99SELF-HOSTSTREAMSpeed/quality balance
Voxtral ASR (Mistral)Detail sheet
5%120ms250ms$0.00330SELF-HOSTSTREAMEU-sovereign, open-weights
AssemblyAI Universal-2Detail sheet
4.9%150ms300ms$0.006299CLOUDSTREAMBest WER cloud
faster-whisper (CTranslate2)Detail sheet
2.7%150ms300msFree99SELF-HOSTSTREAMWhisper quality + streaming
Azure Speech (Microsoft)Detail sheet
5.9%180ms350ms$0.0167100SELF-HOSTSTREAMEU on-premise available
Google Speech-to-Text v2Detail sheet
6.8%200ms400ms$0.006125CLOUDSTREAMLargest language coverage
Audiogami (Gamilab)Detail sheet
3.5%200ms400msFree5SELF-HOSTSTREAMCH-hosted, FR/DE/Swiss-DE
Whisper Large v3Detail sheet
2.7%300ms800msFree99SELF-HOSTBATCHBest WER overall (open)

TTS — Speech Synthesis (16 engines)

Click headers to sort. TTFA = Time to First Audio. ELO = Artificial Analysis score (0 = not evaluated). Price = cost per minute of generated speech.

EngineTTFA (ms) ? TypicalELO ? $/min Sovereignty ?Note
Cartesia Sonic 3Detail sheet
40ms90ms1054$0.047CLOUDFastest TTFA (SSM arch)
Kokoro 82M v1.0Detail sheet
60ms120ms1059$0.0007SELF-HOSTBest open-source quality/cost
ElevenLabs v3Detail sheet
75ms200ms1108$0.206CLOUDTop 3 quality, best cloning
Deepgram Aura 2Detail sheet
80ms150ms$0.015CLOUDVoice agent optimized
Hume AI Octave 2Detail sheet
100ms200ms1046$0.0076CLOUDEmotion-aware TTS
Kyutai TTS 1.6BDetail sheet
100ms200msFreeSELF-HOSTOpen, multilingual
Ultravox v0.5Detail sheet
100ms300ms$0.05SELF-HOSTEnd-to-end speech LLM
Inworld TTS-1.5Detail sheet
130ms250ms1160$0.01SELF-HOSTELO #1, low cost, on-premise
Chatterbox (Resemble AI)Detail sheet
150ms300ms1050$0.04SELF-HOSTExpressive open-source
Voxtral TTS (Mistral)Detail sheet
150ms300ms$0.02SELF-HOSTEU-sovereign, open-weights
Fish Audio OpenAudio S1Detail sheet
200ms400ms1074$0.015CLOUDBest multilingual cloning
Moshi (Kyutai)Detail sheet
200ms500msFreeSELF-HOSTFull-duplex end-to-end
Orpheus 3BDetail sheet
200ms500msFreeSELF-HOSTEmotional open-source
OpenAI Realtime APIDetail sheet
300ms700ms1106$0.1CLOUDFull-duplex, GPT-4o native
Dia (Nari Labs)Detail sheet
300ms800msFreeSELF-HOSTMulti-speaker dialogue
Sesame CSMDetail sheet
400ms1000msFreeSELF-HOSTContext-aware prosody

Key insights

The Quality / Latency / Cost trilemma

It is impossible to simultaneously optimize all three dimensions with current approaches. Cartesia = minimum latency but average quality. Whisper = best WER but not streaming-native. Inworld TTS = ELO #1 + low cost but US cloud. Fundamental research is needed to break this trilemma.

The open-source window is closing fast

In 2025, open-source models (Whisper, Kokoro, Chatterbox) reach 80–90% of cloud quality at zero marginal cost. But cloud platforms are investing heavily: ElevenLabs ($180M), Deepgram ($1.3B valuation), AssemblyAI ($158M). Quality parity is likely within 12–18 months — in both directions.

Sovereignty: the criterion that changes everything

7 of 10 cloud STT engines have no on-premise option. 8 of 16 TTS engines are cloud-only. For a project subject to GDPR or Swiss nLPD, the choice narrows to: Whisper/faster-whisper (STT), Kokoro/Chatterbox/Voxtral (TTS), Audiogami (CH-hosted STT). Architecture must be designed to switch without major refactoring.

The end-to-end approach: a bet on the future

Ultravox, Moshi, and OpenAI Realtime API merge STT + LLM + TTS into a single model, reducing total latency to 300–400ms. But these approaches sacrifice modularity, controllability, and sovereignty. They are relevant for pure real-time use cases, but risky for applications requiring fine control of content or personality.