DigiDouble Research

Voice PipelineAudio Synthesis Benchmarks

Audio Synthesis Benchmarks

Comparative synthesis of 10 STT and 16 TTS engines evaluated. Key metrics, pipeline latency budgets, and decision stakes (2025–2026).

Pipeline latency budgets

End-to-end latency breakdown by architecture profile (STT + LLM + TTS + network). Conversational target: < 1,000 ms.

Profile	STT	LLM	TTS	Network	Estimated total	Status
Voice agent (cloud)	150ms	600ms	120ms	60ms	930ms	ACCEPTABLE
Voice agent (hybrid)	100ms	400ms	80ms	50ms	630ms	TARGET OK
Self-hosted sovereign	200ms	350ms	100ms	40ms	690ms	TARGET OK
End-to-end (Ultravox/Moshi)	—	—	300ms	60ms	360ms	TARGET OK

* Best-case estimates. End-to-end profile (Ultravox/Moshi) merges STT+LLM+TTS into a single model.

STT — Speech Recognition (10 engines)

Click headers to sort. WER = Word Error Rate (lower = better). TTFA = Time to First Audio chunk (streaming latency).

Engine	WER % ?	TTFA (ms) ?	Typical	$/min	Languages	Sovereignty ?	Streaming ?	Note
Deepgram Nova-3Detail sheet	7.2%	75ms	200ms	$0.0036	36	SELF-HOST	STREAM	Best latency cloud
Inworld STTDetail sheet	5%	92ms	150ms	$0.006	20	CLOUD	STREAM	Voice agent optimized
Whisper TurboDetail sheet	3%	100ms	200ms	Free	99	SELF-HOST	STREAM	Speed/quality balance
Voxtral ASR (Mistral)Detail sheet	5%	120ms	250ms	$0.003	30	SELF-HOST	STREAM	EU-sovereign, open-weights
AssemblyAI Universal-2Detail sheet	4.9%	150ms	300ms	$0.0062	99	CLOUD	STREAM	Best WER cloud
faster-whisper (CTranslate2)Detail sheet	2.7%	150ms	300ms	Free	99	SELF-HOST	STREAM	Whisper quality + streaming
Azure Speech (Microsoft)Detail sheet	5.9%	180ms	350ms	$0.0167	100	SELF-HOST	STREAM	EU on-premise available
Google Speech-to-Text v2Detail sheet	6.8%	200ms	400ms	$0.006	125	CLOUD	STREAM	Largest language coverage
Audiogami (Gamilab)Detail sheet	3.5%	200ms	400ms	Free	5	SELF-HOST	STREAM	CH-hosted, FR/DE/Swiss-DE
Whisper Large v3Detail sheet	2.7%	300ms	800ms	Free	99	SELF-HOST	BATCH	Best WER overall (open)

TTS — Speech Synthesis (16 engines)

Click headers to sort. TTFA = Time to First Audio. ELO = Artificial Analysis score (0 = not evaluated). Price = cost per minute of generated speech.

Engine	TTFA (ms) ?	Typical	ELO ?	$/min	Sovereignty ?	Note
Cartesia Sonic 3Detail sheet	40ms	90ms	1054	$0.047	CLOUD	Fastest TTFA (SSM arch)
Kokoro 82M v1.0Detail sheet	60ms	120ms	1059	$0.0007	SELF-HOST	Best open-source quality/cost
ElevenLabs v3Detail sheet	75ms	200ms	1108	$0.206	CLOUD	Top 3 quality, best cloning
Deepgram Aura 2Detail sheet	80ms	150ms	—	$0.015	CLOUD	Voice agent optimized
Hume AI Octave 2Detail sheet	100ms	200ms	1046	$0.0076	CLOUD	Emotion-aware TTS
Kyutai TTS 1.6BDetail sheet	100ms	200ms	—	Free	SELF-HOST	Open, multilingual
Ultravox v0.5Detail sheet	100ms	300ms	—	$0.05	SELF-HOST	End-to-end speech LLM
Inworld TTS-1.5Detail sheet	130ms	250ms	1160	$0.01	SELF-HOST	ELO #1, low cost, on-premise
Chatterbox (Resemble AI)Detail sheet	150ms	300ms	1050	$0.04	SELF-HOST	Expressive open-source
Voxtral TTS (Mistral)Detail sheet	150ms	300ms	—	$0.02	SELF-HOST	EU-sovereign, open-weights
Fish Audio OpenAudio S1Detail sheet	200ms	400ms	1074	$0.015	CLOUD	Best multilingual cloning
Moshi (Kyutai)Detail sheet	200ms	500ms	—	Free	SELF-HOST	Full-duplex end-to-end
Orpheus 3BDetail sheet	200ms	500ms	—	Free	SELF-HOST	Emotional open-source
OpenAI Realtime APIDetail sheet	300ms	700ms	1106	$0.1	CLOUD	Full-duplex, GPT-4o native
Dia (Nari Labs)Detail sheet	300ms	800ms	—	Free	SELF-HOST	Multi-speaker dialogue
Sesame CSMDetail sheet	400ms	1000ms	—	Free	SELF-HOST	Context-aware prosody

Key insights

The Quality / Latency / Cost trilemma

It is impossible to simultaneously optimize all three dimensions with current approaches. Cartesia = minimum latency but average quality. Whisper = best WER but not streaming-native. Inworld TTS = ELO #1 + low cost but US cloud. Fundamental research is needed to break this trilemma.

The open-source window is closing fast

In 2025, open-source models (Whisper, Kokoro, Chatterbox) reach 80–90% of cloud quality at zero marginal cost. But cloud platforms are investing heavily: ElevenLabs ($180M), Deepgram ($1.3B valuation), AssemblyAI ($158M). Quality parity is likely within 12–18 months — in both directions.

Sovereignty: the criterion that changes everything

7 of 10 cloud STT engines have no on-premise option. 8 of 16 TTS engines are cloud-only. For a project subject to GDPR or Swiss nLPD, the choice narrows to: Whisper/faster-whisper (STT), Kokoro/Chatterbox/Voxtral (TTS), Audiogami (CH-hosted STT). Architecture must be designed to switch without major refactoring.

The end-to-end approach: a bet on the future

Ultravox, Moshi, and OpenAI Realtime API merge STT + LLM + TTS into a single model, reducing total latency to 300–400ms. But these approaches sacrifice modularity, controllability, and sovereignty. They are relevant for pure real-time use cases, but risky for applications requiring fine control of content or personality.

← STT Comparison ← TTS Comparison Custom Scoring →Decision Framework →