Audio Synthesis Benchmarks
Comparative synthesis of 10 STT and 16 TTS engines evaluated. Key metrics, pipeline latency budgets, and decision stakes (2025–2026).
Pipeline latency budgets
End-to-end latency breakdown by architecture profile (STT + LLM + TTS + network). Conversational target: < 1,000 ms.
| Profile | STT | LLM | TTS | Network | Estimated total | Status |
|---|---|---|---|---|---|---|
| Voice agent (cloud) | 150ms | 600ms | 120ms | 60ms | 930ms | ACCEPTABLE |
| Voice agent (hybrid) | 100ms | 400ms | 80ms | 50ms | 630ms | TARGET OK |
| Self-hosted sovereign | 200ms | 350ms | 100ms | 40ms | 690ms | TARGET OK |
| End-to-end (Ultravox/Moshi) | — | — | 300ms | 60ms | 360ms | TARGET OK |
* Best-case estimates. End-to-end profile (Ultravox/Moshi) merges STT+LLM+TTS into a single model.
STT — Speech Recognition (10 engines)
Click headers to sort. WER = Word Error Rate (lower = better). TTFA = Time to First Audio chunk (streaming latency).
| Engine | WER % ? | TTFA (ms) ? | Typical | $/min | Languages | Sovereignty ? | Streaming ? | Note |
|---|---|---|---|---|---|---|---|---|
Deepgram Nova-3Detail sheet | 7.2% | 75ms | 200ms | $0.0036 | 36 | SELF-HOST | STREAM | Best latency cloud |
Inworld STTDetail sheet | 5% | 92ms | 150ms | $0.006 | 20 | CLOUD | STREAM | Voice agent optimized |
Whisper TurboDetail sheet | 3% | 100ms | 200ms | Free | 99 | SELF-HOST | STREAM | Speed/quality balance |
Voxtral ASR (Mistral)Detail sheet | 5% | 120ms | 250ms | $0.003 | 30 | SELF-HOST | STREAM | EU-sovereign, open-weights |
AssemblyAI Universal-2Detail sheet | 4.9% | 150ms | 300ms | $0.0062 | 99 | CLOUD | STREAM | Best WER cloud |
faster-whisper (CTranslate2)Detail sheet | 2.7% | 150ms | 300ms | Free | 99 | SELF-HOST | STREAM | Whisper quality + streaming |
Azure Speech (Microsoft)Detail sheet | 5.9% | 180ms | 350ms | $0.0167 | 100 | SELF-HOST | STREAM | EU on-premise available |
Google Speech-to-Text v2Detail sheet | 6.8% | 200ms | 400ms | $0.006 | 125 | CLOUD | STREAM | Largest language coverage |
Audiogami (Gamilab)Detail sheet | 3.5% | 200ms | 400ms | Free | 5 | SELF-HOST | STREAM | CH-hosted, FR/DE/Swiss-DE |
Whisper Large v3Detail sheet | 2.7% | 300ms | 800ms | Free | 99 | SELF-HOST | BATCH | Best WER overall (open) |
TTS — Speech Synthesis (16 engines)
Click headers to sort. TTFA = Time to First Audio. ELO = Artificial Analysis score (0 = not evaluated). Price = cost per minute of generated speech.
| Engine | TTFA (ms) ? | Typical | ELO ? | $/min | Sovereignty ? | Note |
|---|---|---|---|---|---|---|
Cartesia Sonic 3Detail sheet | 40ms | 90ms | 1054 | $0.047 | CLOUD | Fastest TTFA (SSM arch) |
Kokoro 82M v1.0Detail sheet | 60ms | 120ms | 1059 | $0.0007 | SELF-HOST | Best open-source quality/cost |
ElevenLabs v3Detail sheet | 75ms | 200ms | 1108 | $0.206 | CLOUD | Top 3 quality, best cloning |
Deepgram Aura 2Detail sheet | 80ms | 150ms | — | $0.015 | CLOUD | Voice agent optimized |
Hume AI Octave 2Detail sheet | 100ms | 200ms | 1046 | $0.0076 | CLOUD | Emotion-aware TTS |
Kyutai TTS 1.6BDetail sheet | 100ms | 200ms | — | Free | SELF-HOST | Open, multilingual |
Ultravox v0.5Detail sheet | 100ms | 300ms | — | $0.05 | SELF-HOST | End-to-end speech LLM |
Inworld TTS-1.5Detail sheet | 130ms | 250ms | 1160 | $0.01 | SELF-HOST | ELO #1, low cost, on-premise |
Chatterbox (Resemble AI)Detail sheet | 150ms | 300ms | 1050 | $0.04 | SELF-HOST | Expressive open-source |
Voxtral TTS (Mistral)Detail sheet | 150ms | 300ms | — | $0.02 | SELF-HOST | EU-sovereign, open-weights |
Fish Audio OpenAudio S1Detail sheet | 200ms | 400ms | 1074 | $0.015 | CLOUD | Best multilingual cloning |
Moshi (Kyutai)Detail sheet | 200ms | 500ms | — | Free | SELF-HOST | Full-duplex end-to-end |
Orpheus 3BDetail sheet | 200ms | 500ms | — | Free | SELF-HOST | Emotional open-source |
OpenAI Realtime APIDetail sheet | 300ms | 700ms | 1106 | $0.1 | CLOUD | Full-duplex, GPT-4o native |
Dia (Nari Labs)Detail sheet | 300ms | 800ms | — | Free | SELF-HOST | Multi-speaker dialogue |
Sesame CSMDetail sheet | 400ms | 1000ms | — | Free | SELF-HOST | Context-aware prosody |
Key insights
The Quality / Latency / Cost trilemma
It is impossible to simultaneously optimize all three dimensions with current approaches. Cartesia = minimum latency but average quality. Whisper = best WER but not streaming-native. Inworld TTS = ELO #1 + low cost but US cloud. Fundamental research is needed to break this trilemma.
The open-source window is closing fast
In 2025, open-source models (Whisper, Kokoro, Chatterbox) reach 80–90% of cloud quality at zero marginal cost. But cloud platforms are investing heavily: ElevenLabs ($180M), Deepgram ($1.3B valuation), AssemblyAI ($158M). Quality parity is likely within 12–18 months — in both directions.
Sovereignty: the criterion that changes everything
7 of 10 cloud STT engines have no on-premise option. 8 of 16 TTS engines are cloud-only. For a project subject to GDPR or Swiss nLPD, the choice narrows to: Whisper/faster-whisper (STT), Kokoro/Chatterbox/Voxtral (TTS), Audiogami (CH-hosted STT). Architecture must be designed to switch without major refactoring.
The end-to-end approach: a bet on the future
Ultravox, Moshi, and OpenAI Realtime API merge STT + LLM + TTS into a single model, reducing total latency to 300–400ms. But these approaches sacrifice modularity, controllability, and sovereignty. They are relevant for pure real-time use cases, but risky for applications requiring fine control of content or personality.