Back/Inworld STT
Cloud APICommercial

Inworld STT

Voice profiling STT — <100ms, Realtime API $0.015/min, clonage vocal natif, 100+ langues

92ms
Latency (best case) ?
150ms
Latency (typical) ?
5%
WER (general audio) ?
$0.0060/min
Price per minute

Comparative Scores

Accuracy (WER)?9/10
Streaming latency?10/10
Multilingual10/10
Sovereignty?3/10
Price accessibility7/10
Streaming quality?10/10

Architecture

ArchitectureMulti-provider router (Whisper large-v3 + AssemblyAI Universal Streaming + proprietary)
ParametersN/A (cloud router)
Languages100+
Self-hostable No
Streaming ? Yes
WER clean audio ?3%
GamiWays
Phase 1 MVP — STT émotionnel + Axe 2 Avatar Behavior

High strategic value for GamiWays Axis 2 (Avatar Behavior) and Emotional Toolbox. Voice profiling enables real-time emotion detection without a separate model — feeding directly into avatar expression selection and LLM prompt conditioning. <100ms latency compatible with Phase 1 target. Evaluate as primary STT if emotion-aware routing is required in Phase 1.

Analysis

Inworld STT (2025–2026) is the most feature-rich cloud STT API for interactive voice agents. Sub-100ms documented latency, 100+ languages via multi-provider routing (Whisper large-v3 + AssemblyAI). Unique real-time voice profiling extracts emotion (happy/calm/angry/frustrated), accent, age, pitch, and vocal style on every streaming chunk. Realtime API (full pipeline STT+LLM+TTS) from $0.015/min — 4x cheaper than OpenAI Realtime ($0.06/min). Native voice cloning: built-in + cloned + custom voices (up to 3,000 custom voices on Growth plan). RAG via function calling (tool calling mid-conversation). ZDR support. On-premise available on Enterprise. Drop-in compatible with OpenAI Realtime API.

Strengths

  • <100ms documented streaming latency (92ms TTFA)
  • Realtime API $0.015/min — 4x cheaper than OpenAI Realtime
  • Native voice cloning : built-in + cloned + custom (up to 3,000 voices)
  • RAG via function calling mid-conversation
  • Voice profiling: emotion, accent, age, pitch, vocal style
  • 100+ languages (Whisper large-v3 backend)
  • Semantic + acoustic VAD
  • ZDR — audio never stored
  • On-premise available (Enterprise)
  • Drop-in compatible with OpenAI Realtime API
  • Condition Router: route by emotion/language/tier

Weaknesses

  • On-premise Enterprise only (not available on lower tiers)
  • Pricing less transparent than Deepgram (credit-based system)
  • Multi-provider adds latency variability
  • Vendor lock-in risk if using full Inworld stack (STT+TTS+LLM)
  • No fine-tuning on custom data
  • 400%+ price increase reported in 2026 (market consolidation risk)

STT Capabilities

Streaming ? Yes

Bidirectional WebSocket streaming. <100ms documented latency (92ms TTFA). Interim results with voice profile signals on every audio chunk. Semantic + acoustic VAD. Configurable endpointing.

Diarization ? Yes
Custom Vocabulary Yes
Word Timestamps Yes
Auto Punctuation Yes
Multilingual Yes

100+ languages

Pricing

Price / minute
$0.0060
Price / hour
$0.360
Free tier
Free tier available. Growth plan: $1,500 credits/month with 40% off.

STT seul : $0.006–0.012/min selon modèle. Realtime API (pipeline complet STT+LLM+TTS) : à partir de $0.015/min (vs OpenAI Realtime $0.06/min). Growth plan : 40% de réduction ($1,500 crédits/mois). On-prem : Enterprise uniquement. Free tier disponible.

Sovereignty & Compliance

On-premise No

No on-premise. ZDR ensures audio never stored. EU data residency on Enterprise.

GDPR ? Compliant

Data residency: US (default). EU data residency on Enterprise. ZDR available — audio never stored.

On-premise No

Cloud API only. Zero Data Retention (ZDR) available — audio never stored, processed in real time.

Strategic & Business Analysis

Inworld STT — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for GamiWays?

Inworld STT offers the best sovereignty credentials among commercial STT providers — full on-premise, EU data residency, SOC 2 Type II + GDPR + HIPAA. But its 400%+ pricing increases are a strategic red flag that accelerates open-source migration.

Cloud + On-premise
Lock-in risk:High
Sovereignty fit:High
Open-source threat:Medium
Pricing:Rising ↑

A. Strategic Positioning

Target customer: Developer / Enterprise — gaming, regulated industries, real-time interactive experiences

Sub-200ms real-time STT with full on-premise deployment, EU/India data residency, and SOC 2 Type II + GDPR + HIPAA compliance.

B. Competitive Moat

  • Sub-200ms real-time STT with high accuracy — surpassing comparable models in latency
  • Full on-premise deployment + EU/India data residency — sovereignty trifecta
  • SOC 2 Type II, GDPR, HIPAA — enterprise compliance for regulated industries

Vulnerability: Significant pricing increases (400%+) reported by users could push clients toward open-source alternatives. Open-source models closing the quality gap.

E. Strategic Questions for GamiWays

Sovereignty fit

Full on-premise deployment + EU data residency + SOC 2 Type II + GDPR + HIPAA. Best sovereignty fit among commercial STT providers.

Build vs. Buy

Buy for Phase 1 real-time requirements. Monitor pricing carefully. For Phase 2, evaluate Whisper/Voxtral self-hosted as cost-effective sovereignty alternative.

Lock-in risk

Proprietary models + significant pricing increases create high lock-in risk. On-premise deployment reduces cloud dependency but not vendor dependency.

Roadmap alignment

Good for Phase 1 real-time agents. Phase 2 sovereignty is technically satisfied but pricing risk is high. Monitor pricing trajectory carefully.

Data Freshness

Updated 2 May 2026

https://inworld.ai/resources/best-speech-to-text-apis

Update note: Realtime API $0.015/min confirmé (mai 2026) — 4x moins cher qu'OpenAI Realtime. Clonage vocal natif confirmé : built-in + cloné + custom (jusqu'à 3 000 voix custom sur Growth). RAG via function calling mid-conversation. On-premise disponible en Enterprise. ZDR sur tous les plans. EU data residency sur Enterprise. Hausse de prix 400%+ signalée en 2026.