Back/Inworld TTS-1.5 + Realtime API
Cloud API#1 Artificial AnalysisCommercial (training framework open-sourced)

Inworld TTS-1.5 + Realtime API

#1 quality benchmark — ELO 1160, sub-120ms Mini, Realtime S2S + STT + LLM Router

130ms
TTFA (best case) ?
250ms
TTFA (typical) ?
$10/1M
Price per million chars
1160
ELO Score ?

Comparative Scores

Voice quality?10/10
Latency?8/10
Voice cloning?9/10
Expressiveness?9/10
Sovereignty?6/10
Price accessibility8/10
Multilingual6/10

Architecture

ArchitectureSpeechLM (streaming-native, quantization-aware)
ParametersN/A (cloud)
Languages9
Self-hostable Yes
Streaming Yes
DigiDouble
Phase 1 MVP — Qualité + Souveraineté + Pipeline complet

Top candidate for Phase 1 MVP. Best quality/cost ratio (#1 ELO, lowest price). Realtime API (S2S + STT + LLM Router) enables full voice pipeline in a single provider. On-premise option aligns with Swiss sovereignty requirement. Viseme timestamps directly usable for avatar lip-sync (Axis 2). ElevenLabs migration tool simplifies transition.

Analysis

Inworld TTS-1.5 holds #1 position on Artificial Analysis (ELO 1160, March 2026) with 59–61% win rates vs ElevenLabs, Cartesia, and OpenAI. Best price-performance: 116 ELO/dollar vs 5.4 for ElevenLabs. TTS-1.5 Mini achieves <120ms P90 — fastest realtime TTS available. Full platform: TTS + STT (voice profiling, semantic VAD) + Realtime API (full-duplex S2S, tool calling) + LLM Router (200+ models, A/B testing). On-premise for sovereignty. Training framework open-sourced. ElevenLabs migration tool available.

Strengths

  • ELO 1160 — #1 quality benchmark (March 2026)
  • <120ms P90 Mini — fastest realtime TTS
  • 116 ELO/dollar — best price-performance
  • Full platform: TTS + STT + Realtime S2S + LLM Router
  • On-premise option (GDPR, HIPAA, SOC2 Type II)
  • Free zero-shot voice cloning from 5–15s
  • Viseme timestamps for avatar lip-sync
  • Open-sourced training framework

Weaknesses

  • 9 languages (vs 70+ for ElevenLabs)
  • Emotion tags experimental outside English
  • Realtime API still maturing vs OpenAI Realtime
  • LLM Router model list changes frequently

Voice Capabilities

Voice Cloning ? Yes

Zero-shot from 5–15 seconds (free). Text-based voice design (describe voice in natural language). ElevenLabs migration tool available. Cloned voices stable across extended outputs.

Emotion Control Yes

[happy], [sad], [whisper] audio markup. Non-verbals: [cough], [sigh], [breathe]. Word/char/phoneme/viseme timestamps. +30% expressiveness vs TTS-1. Conversational intelligence: acoustic + metadata signals condition what is said, when, and how.

Streaming ? Yes

Streaming-native via WebSocket. P90 TTFA: <120ms Mini / <250ms Max. Median: <100ms Mini / <200ms Max. ~4× faster than TTS-1. Realtime API: full-duplex WebSocket/WebRTC speech-to-speech with turn detection, tool calling mid-session, provider-agnostic LLM routing.

Lip-sync Data ? Yes

Word, character, phoneme, and viseme-level timestamps. Unity/Unreal SDKs with lipsync templates.

Pricing

Price / 1M chars
$10
Price / minute
$0.0100
Free tier
Limited free tier available

TTS-1.5-Max: $10/1M chars. TTS-1.5-Mini: $5/1M chars. Zero-shot cloning: free.

Sovereignty & Compliance

On-premise Yes

On-premise deployment available. EU + India data residency. SOC2 Type II, GDPR, HIPAA compliant.

GDPR ? Compliant

Data residency: US, EU, India

Strategic & Business Analysis

Inworld TTS-1.5 + Realtime API — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?

Inworld TTS-1.5 is the sovereignty-ready challenger: ELO #1 quality at 75% lower cost than ElevenLabs, with full on-premise deployment and EU data residency already available — not promised for 2026.

Cloud + On-premise
Lock-in risk:Medium
Sovereignty fit:High
Open-source threat:Medium
Pricing:Falling ↓

A. Strategic Positioning

Target customer: Enterprise / Developer — regulated industries, gaming, real-time agents

ELO #1 quality at 75% lower cost than ElevenLabs, with full on-premise deployment and EU data residency — the sovereignty-ready premium TTS.

B. Competitive Moat

  • ELO #1 benchmark quality (independent verification) with zero on-premise latency penalty
  • 75% cheaper than ElevenLabs — disrupting the premium TTS pricing model
  • Full on-premise + EU/India data residency + zero-retention option — sovereignty trifecta

Vulnerability: Open-source models (Chatterbox, Kokoro) are closing the quality gap, potentially commoditizing the premium TTS segment.

E. Strategic Questions for DigiDouble

Sovereignty fit

Full on-premise deployment with zero latency penalty + EU data residency + zero-retention option. Best sovereignty fit among premium TTS providers.

Build vs. Buy

Strong buy case for both Phase 1 and Phase 2. Best quality-sovereignty-cost combination among cloud TTS providers.

Lock-in risk

Proprietary models create some lock-in, but flexible deployment options and competitive pricing reduce dependency risk.

Roadmap alignment

Excellent alignment: on-premise already available (Phase 2 ready), EU data residency, competitive pricing for scale.

Data Freshness

Updated 30 April 2026

Artificial Analysis Speech Arena, March 2026. 4 of top 7 models are Inworld models.

Update note: Inworld TTS-1.5 Max ELO 1207 (rank #1, Apr 2026). Mini ELO 1149. Realtime API S2S latency <120ms Mini confirmed. Pricing: $10/1M chars (Max), $5/1M (Mini). On-premise available.