Voice PipelineSTT / Speech-to-Text

STT / Speech-to-Text

Comparison of speech recognition solutions for conversational pipelines (2025–2026). Benchmarks, strategic stakes, and decision questions.

🎯

Strategic Framing — STT: Much More Than a WER

The Real Question

WER doesn't tell the whole story. The real question is: where is your voice data processed? Who has access to it? What is your strategy if the provider raises prices or gets acquired?

Infrastructure Spectrum

Cloud-only (Google, AssemblyAI) → Cloud+VPC (Deepgram, Azure) → On-premise (Azure containers, Inworld) → Self-hosted open-source (Whisper, Voxtral). Each step increases sovereignty and reduces lock-in.

2026 Market Signal

Deepgram ($1.3B) and AssemblyAI ($158M) are acquisition targets. Voxtral and Whisper are closing the quality gap. Inworld STT raised prices 400%+. The market is consolidating fast.

Questions to ask before choosing: Is your users' voice data subject to GDPR or Swiss nLPD? Do you need diarization or PII redaction? Does your architecture allow switching to a self-hosted model without major rework? What is your acceptable WER threshold for your specific domain?

Real-time cloud STT APIs — Deepgram Nova-3 is the latency reference (75ms TTFA). Whisper large-v3 is the open-source quality standard. AssemblyAI Universal-2 leads the multilingual WER benchmark.

SolutionTTFA ?WER ?Streaming ?MultilingualDiarization ?Price/hrSovereign ?
Deepgram Nova-3
Phase 1 MVP — ASR streaming
75ms7.2%36 langs$0.216/hr
AssemblyAI Universal-2
Référence précision
150ms4.9%99 langs$0.372/hr
Google Speech-to-Text v2
Multilingue — option secondaire
200ms6.8%125 langs$0.36/hr
Azure Speech (Microsoft)
Souveraineté suisse — institutionnel
180ms5.9%100 langs$1/hr
Inworld STT
Phase 1 MVP — STT émotionnel + Axe 2 Avatar Behavior
92ms5%100 langs$0.36/hr
Deepgram Nova-3
Phase 1 MVP — ASR streaming
Latency75ms
10
Accuracy7.2% WER
8
Price access7/10
7

Deepgram Nova-3 is the industry reference for real-time voice agent ASR. 75ms P90 streaming latency, 36 languages, built-in VAD and endpointing. Optimized for conversational AI pipelines (LiveKit, Pipecat integrations). On-premise option available for partial sovereignty. Used by Tavus, Simli, and most commercial avatar platforms.

Full details →
AssemblyAI Universal-2
Référence précision
Latency150ms
7
Accuracy4.9% WER
10
Price access5/10
5

AssemblyAI Universal-2 achieves 4.9% WER — best-in-class accuracy among cloud ASR APIs. 99 languages, speaker diarization, sentiment analysis, and LeMUR AI features (summarization, Q&A on transcripts). 150ms streaming latency. No on-premise option limits sovereignty. Best choice when accuracy is the primary requirement.

Full details →
Google Speech-to-Text v2
Multilingue — option secondaire
Latency200ms
6
Accuracy6.8% WER
8
Price access6/10
6

Google STT v2 with Chirp 2 (USM 2B) covers 125 languages with competitive accuracy. gRPC bidirectional streaming. Deep integration with Google ecosystem (Dialogflow, Vertex AI). 200ms typical streaming latency. EU data residency available. No on-premise option.

Full details →
Azure Speech (Microsoft)
Souveraineté suisse — institutionnel
Latency180ms
7
Accuracy5.9% WER
9
Price access3/10
3

Azure Speech offers enterprise-grade STT with 100+ languages, custom model training, and Swiss data center (Zurich). 5.9% WER on English. Disconnected container deployment for partial sovereignty. Most expensive option but strongest enterprise compliance. Swiss German custom model available.

Full details →
Inworld STT
Phase 1 MVP — STT émotionnel + Axe 2 Avatar Behavior
Latency92ms
10
Accuracy5% WER
9
Price access7/10
7

Inworld STT (2025–2026) is the most feature-rich cloud STT API for interactive voice agents. Sub-100ms documented latency, 100+ languages via multi-provider routing (Whisper large-v3 + AssemblyAI). Unique real-time voice profiling extracts emotion (happy/calm/angry/frustrated), accent, age, pitch, and vocal style on every streaming chunk — enabling downstream routing (Condition Router) and adaptive TTS (Condition TTS). ZDR support. Integrates natively with Inworld TTS, LLM Router, and Realtime API for end-to-end voice pipelines.

Full details →