TTS & Voice Synthesis
Comparison of voice synthesis solutions for conversational pipelines (2025–2026). Benchmarks, strategic stakes, and decision questions.
Strategic Framing — Beyond Latency & Cost
The Real Question
Choosing a TTS is not just about comparing benchmarks. The real question is: what level of data sovereignty, infrastructure control, and deployment flexibility does your use case require?
Validation vs Production
In the validation phase, cloud APIs allow rapid iteration on quality and experience. In production, sovereignty, cost at scale, and vendor dependency become structural. The key question: does the architecture allow migration without major rework?
2026 Market Signal
ElevenLabs ($11B) is going off-cloud. Inworld is ELO #1 at 75% lower cost. Chatterbox beats ElevenLabs in blind tests. The quality gap between cloud and open-source is closing fast.
Questions to ask before choosing: What is the sensitivity level of the voice data being processed (GDPR, nLPD, HIPAA)? What is the exit strategy if the provider raises prices or is acquired? Does the architecture allow migration to open-source without major rework?
This section covers cloud streaming TTS APIs (2025–2026). They enable fast integration and offer current best-in-class quality, at the cost of vendor dependency and sovereignty constraints to evaluate based on deployment context.
| Solution | TTFA ? | ELO ? | Cloning ? | Emotion | Multilingual | Price/1M | Detail |
|---|---|---|---|---|---|---|---|
ElevenLabs v3 Phase 1 MVP — Référence qualité | 75ms | 1108 | ✓ | ✓ | ✓ 70 | $206 | View → |
Cartesia Sonic 3 Phase 1 MVP — Latence critique | 40ms | 1054 | ✓ | ✓ | ✓ 40 | $46.7 | View → |
Inworld TTS-1.5 + Realtime API Phase 1 MVP — Qualité + Souveraineté + Pipeline complet | 130ms | 1160 | ✓ | ✓ | ✓ 9 | $10 | View → |
Hume AI Octave 2 Phase 1 MVP — Expressivité émotionnelle | 100ms | 1046 | ✓ | ✓ | ✓ 11 | $7.6 | View → |
Fish Audio OpenAudio S1 Phase 1 MVP — Coût/Souveraineté | 200ms | 1074 | ✓ | ✓ | ✓ 13 | $15 | View → |
Deepgram Aura 2 Phase 1 MVP — Stack ASR+TTS intégré | 80ms | — | ✗ | ✗ | ✗ | $15 | View → |
OpenAI Realtime API Phase 1 MVP — Référence benchmark | 300ms | 1106 | ✗ | ✓ | ✓ 50 | Free | View → |
ElevenLabs v3
Industry reference — 380+ voices, 70+ languages, emotional range
Cartesia Sonic 3
Fastest TTFA on the market — 40ms, State Space Model architecture
Inworld TTS-1.5 + Realtime API
#1 quality benchmark — ELO 1160, sub-120ms Mini, Realtime S2S + STT + LLM Router
Hume AI Octave 2
LLM-based emotional TTS — natural language emotion control
Fish Audio OpenAudio S1
Pay-as-you-go voice cloning — 70% cheaper than ElevenLabs
Deepgram Aura 2
Ultra-low latency TTS optimized for voice agents — <100ms
OpenAI Realtime API
GPT-4o speech-to-speech — integrated LLM + voice, WebSocket
Architecture Question: Cascade vs End-to-End Voice-to-Voice
Two approaches compete: (A) Cascading pipeline (ASR → LLM → TTS) — more controllable, voice cloning possible, full sovereignty possible, ~400–800ms latency; or (B) End-to-end Voice-to-Voice (Ultravox, Moshi, Sesame) — ~100ms latency but less controllable, no voice cloning. The choice depends on priorities: if voice cloning and persona control are essential, (A) is unavoidable. If ultra-low latency takes precedence, (B) deserves evaluation. Both approaches can coexist depending on use cases.