Inworld TTS-1.5 + Realtime API
#1 quality benchmark — ELO 1160, sub-120ms Mini, Realtime S2S + STT + LLM Router
Comparative Scores
Architecture
Top candidate for Phase 1 MVP. Best quality/cost ratio (#1 ELO, lowest price). Realtime API (S2S + STT + LLM Router) enables full voice pipeline in a single provider. On-premise option aligns with Swiss sovereignty requirement. Viseme timestamps directly usable for avatar lip-sync (Axis 2). ElevenLabs migration tool simplifies transition.
Analysis
Inworld TTS-1.5 holds #1 position on Artificial Analysis (ELO 1160, March 2026) with 59–61% win rates vs ElevenLabs, Cartesia, and OpenAI. Best price-performance: 116 ELO/dollar vs 5.4 for ElevenLabs. TTS-1.5 Mini achieves <120ms P90 — fastest realtime TTS available. Full platform: TTS + STT (voice profiling, semantic VAD) + Realtime API (full-duplex S2S, tool calling) + LLM Router (200+ models, A/B testing). On-premise for sovereignty. Training framework open-sourced. ElevenLabs migration tool available.
Strengths
- ELO 1160 — #1 quality benchmark (March 2026)
- <120ms P90 Mini — fastest realtime TTS
- 116 ELO/dollar — best price-performance
- Full platform: TTS + STT + Realtime S2S + LLM Router
- On-premise option (GDPR, HIPAA, SOC2 Type II)
- Free zero-shot voice cloning from 5–15s
- Viseme timestamps for avatar lip-sync
- Open-sourced training framework
Weaknesses
- 9 languages (vs 70+ for ElevenLabs)
- Emotion tags experimental outside English
- Realtime API still maturing vs OpenAI Realtime
- LLM Router model list changes frequently
Voice Capabilities
Zero-shot from 5–15 seconds (free). Text-based voice design (describe voice in natural language). ElevenLabs migration tool available. Cloned voices stable across extended outputs.
[happy], [sad], [whisper] audio markup. Non-verbals: [cough], [sigh], [breathe]. Word/char/phoneme/viseme timestamps. +30% expressiveness vs TTS-1. Conversational intelligence: acoustic + metadata signals condition what is said, when, and how.
Streaming-native via WebSocket. P90 TTFA: <120ms Mini / <250ms Max. Median: <100ms Mini / <200ms Max. ~4× faster than TTS-1. Realtime API: full-duplex WebSocket/WebRTC speech-to-speech with turn detection, tool calling mid-session, provider-agnostic LLM routing.
Word, character, phoneme, and viseme-level timestamps. Unity/Unreal SDKs with lipsync templates.
Pricing
TTS-1.5-Max: $10/1M chars. TTS-1.5-Mini: $5/1M chars. Zero-shot cloning: free.
Sovereignty & Compliance
On-premise deployment available. EU + India data residency. SOC2 Type II, GDPR, HIPAA compliant.
Data residency: US, EU, India
Inworld TTS-1.5 + Realtime API — Strategic Positioning
Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?
Inworld TTS-1.5 is the sovereignty-ready challenger: ELO #1 quality at 75% lower cost than ElevenLabs, with full on-premise deployment and EU data residency already available — not promised for 2026.
A. Strategic Positioning
Target customer: Enterprise / Developer — regulated industries, gaming, real-time agents
ELO #1 quality at 75% lower cost than ElevenLabs, with full on-premise deployment and EU data residency — the sovereignty-ready premium TTS.
B. Competitive Moat
- ELO #1 benchmark quality (independent verification) with zero on-premise latency penalty
- 75% cheaper than ElevenLabs — disrupting the premium TTS pricing model
- Full on-premise + EU/India data residency + zero-retention option — sovereignty trifecta
Vulnerability: Open-source models (Chatterbox, Kokoro) are closing the quality gap, potentially commoditizing the premium TTS segment.
E. Strategic Questions for DigiDouble
Sovereignty fit
Full on-premise deployment with zero latency penalty + EU data residency + zero-retention option. Best sovereignty fit among premium TTS providers.
Build vs. Buy
Strong buy case for both Phase 1 and Phase 2. Best quality-sovereignty-cost combination among cloud TTS providers.
Lock-in risk
Proprietary models create some lock-in, but flexible deployment options and competitive pricing reduce dependency risk.
Roadmap alignment
Excellent alignment: on-premise already available (Phase 2 ready), EU data residency, competitive pricing for scale.
Data Freshness
Artificial Analysis Speech Arena, March 2026. 4 of top 7 models are Inworld models.
Update note: Inworld TTS-1.5 Max ELO 1207 (rank #1, Apr 2026). Mini ELO 1149. Realtime API S2S latency <120ms Mini confirmed. Pricing: $10/1M chars (Max), $5/1M (Mini). On-premise available.