Back/Inworld TTS-1.5 + Realtime API

Cloud API#1 Artificial AnalysisCommercial (training framework open-sourced)

Inworld TTS-1.5 + Realtime API

#1 quality benchmark — ELO 1160, sub-120ms Mini, Realtime S2S + STT + LLM Router

Website Docs

130ms

TTFA (best case) ?

250ms

TTFA (typical) ?

$10/1M

Price per million chars

1160

ELO Score ?

Comparative Scores

Voice quality?10/10

Latency?8/10

Voice cloning?9/10

Expressiveness?9/10

Sovereignty?6/10

Price accessibility8/10

Multilingual6/10

Architecture

ArchitectureSpeechLM (streaming-native, quantization-aware)

ParametersN/A (cloud)

Languages9

Self-hostable Yes

Streaming Yes

DigiDouble

Phase 1 MVP — Qualité + Souveraineté + Pipeline complet

Top candidate for Phase 1 MVP. Best quality/cost ratio (#1 ELO, lowest price). Realtime API (S2S + STT + LLM Router) enables full voice pipeline in a single provider. On-premise option aligns with Swiss sovereignty requirement. Viseme timestamps directly usable for avatar lip-sync (Axis 2). ElevenLabs migration tool simplifies transition.

Analysis

Inworld TTS-1.5 holds #1 position on Artificial Analysis (ELO 1160, March 2026) with 59–61% win rates vs ElevenLabs, Cartesia, and OpenAI. Best price-performance: 116 ELO/dollar vs 5.4 for ElevenLabs. TTS-1.5 Mini achieves <120ms P90 — fastest realtime TTS available. Full platform: TTS + STT (voice profiling, semantic VAD) + Realtime API (full-duplex S2S, tool calling) + LLM Router (200+ models, A/B testing). On-premise for sovereignty. Training framework open-sourced. ElevenLabs migration tool available.

Strengths

ELO 1160 — #1 quality benchmark (March 2026)
<120ms P90 Mini — fastest realtime TTS
116 ELO/dollar — best price-performance
Full platform: TTS + STT + Realtime S2S + LLM Router
On-premise option (GDPR, HIPAA, SOC2 Type II)
Free zero-shot voice cloning from 5–15s
Viseme timestamps for avatar lip-sync
Open-sourced training framework

Weaknesses

9 languages (vs 70+ for ElevenLabs)
Emotion tags experimental outside English
Realtime API still maturing vs OpenAI Realtime
LLM Router model list changes frequently

Voice Capabilities

Voice Cloning ? Yes

Zero-shot from 5–15 seconds (free). Text-based voice design (describe voice in natural language). ElevenLabs migration tool available. Cloned voices stable across extended outputs.

Emotion Control Yes

[happy], [sad], [whisper] audio markup. Non-verbals: [cough], [sigh], [breathe]. Word/char/phoneme/viseme timestamps. +30% expressiveness vs TTS-1. Conversational intelligence: acoustic + metadata signals condition what is said, when, and how.

Streaming ? Yes

Streaming-native via WebSocket. P90 TTFA: <120ms Mini / <250ms Max. Median: <100ms Mini / <200ms Max. ~4× faster than TTS-1. Realtime API: full-duplex WebSocket/WebRTC speech-to-speech with turn detection, tool calling mid-session, provider-agnostic LLM routing.

Lip-sync Data ? Yes

Word, character, phoneme, and viseme-level timestamps. Unity/Unreal SDKs with lipsync templates.

Pricing

Price / 1M chars

$10

Price / minute

$0.0100

Free tier

Limited free tier available

TTS-1.5-Max: $10/1M chars. TTS-1.5-Mini: $5/1M chars. Zero-shot cloning: free.

Sovereignty & Compliance

On-premise Yes

On-premise deployment available. EU + India data residency. SOC2 Type II, GDPR, HIPAA compliant.

GDPR ? Compliant

Data residency: US, EU, India

Strategic & Business Analysis

Inworld TTS-1.5 + Realtime API — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?

Inworld TTS-1.5 is the sovereignty-ready challenger: ELO #1 quality at 75% lower cost than ElevenLabs, with full on-premise deployment and EU data residency already available — not promised for 2026.

Cloud + On-premise

Lock-in risk:Medium

Sovereignty fit:High

Open-source threat:Medium

Pricing:Falling ↓

A. Strategic Positioning

Target customer: Enterprise / Developer — regulated industries, gaming, real-time agents

ELO #1 quality at 75% lower cost than ElevenLabs, with full on-premise deployment and EU data residency — the sovereignty-ready premium TTS.

B. Competitive Moat

ELO #1 benchmark quality (independent verification) with zero on-premise latency penalty
75% cheaper than ElevenLabs — disrupting the premium TTS pricing model
Full on-premise + EU/India data residency + zero-retention option — sovereignty trifecta

Vulnerability: Open-source models (Chatterbox, Kokoro) are closing the quality gap, potentially commoditizing the premium TTS segment.

E. Strategic Questions for DigiDouble

Sovereignty fit

Full on-premise deployment with zero latency penalty + EU data residency + zero-retention option. Best sovereignty fit among premium TTS providers.

Build vs. Buy

Strong buy case for both Phase 1 and Phase 2. Best quality-sovereignty-cost combination among cloud TTS providers.

Lock-in risk

Proprietary models create some lock-in, but flexible deployment options and competitive pricing reduce dependency risk.

Roadmap alignment

Excellent alignment: on-premise already available (Phase 2 ready), EU data residency, competitive pricing for scale.

Back to State of the Art View in Benchmarks

Data Freshness

Updated 30 April 2026

Artificial Analysis Speech Arena, March 2026. 4 of top 7 models are Inworld models.

Update note: Inworld TTS-1.5 Max ELO 1207 (rank #1, Apr 2026). Mini ELO 1149. Realtime API S2S latency <120ms Mini confirmed. Pricing: $10/1M chars (Max), $5/1M (Mini). On-premise available.

Reference Sources

Inworld Pricingpricing Inworld TTS Docsdocs Artificial Analysis TTS Arenabenchmark Inworld Realtime APIdocs