Inworld STT
Voice profiling STT — <100ms, Realtime API $0.015/min, clonage vocal natif, 100+ langues
Comparative Scores
Architecture
High strategic value for GamiWays Axis 2 (Avatar Behavior) and Emotional Toolbox. Voice profiling enables real-time emotion detection without a separate model — feeding directly into avatar expression selection and LLM prompt conditioning. <100ms latency compatible with Phase 1 target. Evaluate as primary STT if emotion-aware routing is required in Phase 1.
Analysis
Inworld STT (2025–2026) is the most feature-rich cloud STT API for interactive voice agents. Sub-100ms documented latency, 100+ languages via multi-provider routing (Whisper large-v3 + AssemblyAI). Unique real-time voice profiling extracts emotion (happy/calm/angry/frustrated), accent, age, pitch, and vocal style on every streaming chunk. Realtime API (full pipeline STT+LLM+TTS) from $0.015/min — 4x cheaper than OpenAI Realtime ($0.06/min). Native voice cloning: built-in + cloned + custom voices (up to 3,000 custom voices on Growth plan). RAG via function calling (tool calling mid-conversation). ZDR support. On-premise available on Enterprise. Drop-in compatible with OpenAI Realtime API.
Strengths
- <100ms documented streaming latency (92ms TTFA)
- Realtime API $0.015/min — 4x cheaper than OpenAI Realtime
- Native voice cloning : built-in + cloned + custom (up to 3,000 voices)
- RAG via function calling mid-conversation
- Voice profiling: emotion, accent, age, pitch, vocal style
- 100+ languages (Whisper large-v3 backend)
- Semantic + acoustic VAD
- ZDR — audio never stored
- On-premise available (Enterprise)
- Drop-in compatible with OpenAI Realtime API
- Condition Router: route by emotion/language/tier
Weaknesses
- On-premise Enterprise only (not available on lower tiers)
- Pricing less transparent than Deepgram (credit-based system)
- Multi-provider adds latency variability
- Vendor lock-in risk if using full Inworld stack (STT+TTS+LLM)
- No fine-tuning on custom data
- 400%+ price increase reported in 2026 (market consolidation risk)
STT Capabilities
Bidirectional WebSocket streaming. <100ms documented latency (92ms TTFA). Interim results with voice profile signals on every audio chunk. Semantic + acoustic VAD. Configurable endpointing.
100+ languages
Pricing
STT seul : $0.006–0.012/min selon modèle. Realtime API (pipeline complet STT+LLM+TTS) : à partir de $0.015/min (vs OpenAI Realtime $0.06/min). Growth plan : 40% de réduction ($1,500 crédits/mois). On-prem : Enterprise uniquement. Free tier disponible.
Sovereignty & Compliance
No on-premise. ZDR ensures audio never stored. EU data residency on Enterprise.
Data residency: US (default). EU data residency on Enterprise. ZDR available — audio never stored.
Cloud API only. Zero Data Retention (ZDR) available — audio never stored, processed in real time.
Inworld STT — Strategic Positioning
Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for GamiWays?
Inworld STT offers the best sovereignty credentials among commercial STT providers — full on-premise, EU data residency, SOC 2 Type II + GDPR + HIPAA. But its 400%+ pricing increases are a strategic red flag that accelerates open-source migration.
A. Strategic Positioning
Target customer: Developer / Enterprise — gaming, regulated industries, real-time interactive experiences
Sub-200ms real-time STT with full on-premise deployment, EU/India data residency, and SOC 2 Type II + GDPR + HIPAA compliance.
B. Competitive Moat
- Sub-200ms real-time STT with high accuracy — surpassing comparable models in latency
- Full on-premise deployment + EU/India data residency — sovereignty trifecta
- SOC 2 Type II, GDPR, HIPAA — enterprise compliance for regulated industries
Vulnerability: Significant pricing increases (400%+) reported by users could push clients toward open-source alternatives. Open-source models closing the quality gap.
E. Strategic Questions for GamiWays
Sovereignty fit
Full on-premise deployment + EU data residency + SOC 2 Type II + GDPR + HIPAA. Best sovereignty fit among commercial STT providers.
Build vs. Buy
Buy for Phase 1 real-time requirements. Monitor pricing carefully. For Phase 2, evaluate Whisper/Voxtral self-hosted as cost-effective sovereignty alternative.
Lock-in risk
Proprietary models + significant pricing increases create high lock-in risk. On-premise deployment reduces cloud dependency but not vendor dependency.
Roadmap alignment
Good for Phase 1 real-time agents. Phase 2 sovereignty is technically satisfied but pricing risk is high. Monitor pricing trajectory carefully.
Data Freshness
https://inworld.ai/resources/best-speech-to-text-apis
Update note: Realtime API $0.015/min confirmé (mai 2026) — 4x moins cher qu'OpenAI Realtime. Clonage vocal natif confirmé : built-in + cloné + custom (jusqu'à 3 000 voix custom sur Growth). RAG via function calling mid-conversation. On-premise disponible en Enterprise. ZDR sur tous les plans. EU data residency sur Enterprise. Hausse de prix 400%+ signalée en 2026.