Deepgram Nova-3
Fastest streaming ASR — 75ms, 36 languages, voice agents optimized
Comparative Scores
Architecture
Primary candidate for Phase 1 MVP ASR. 75ms latency is critical for sub-2s pipeline. On-premise option aligns with Swiss sovereignty requirements. Audiogami (Gamilab) already in production — Deepgram as fallback/comparison.
Analysis
Deepgram Nova-3 is the industry reference for real-time voice agent ASR. 75ms P90 streaming latency, 36 languages, built-in VAD and endpointing. Optimized for conversational AI pipelines (LiveKit, Pipecat integrations). On-premise option available for partial sovereignty. Used by Tavus, Simli, and most commercial avatar platforms.
Strengths
- 75ms P90 streaming latency
- Built-in VAD + endpointing
- 36 languages
- On-premise option
- LiveKit/Pipecat native integration
- Speaker diarization + word timestamps
Weaknesses
- Cloud-first (sovereignty limited)
- 7.2% WER on English (not best-in-class)
- French/German WER higher (~15%)
- No open-weights
STT Capabilities
Pricing
$0.0036/min (Pay-as-you-go). $0.0024/min (Growth plan). On-premise: custom pricing.
Sovereignty & Compliance
Self-hosted Docker deployment. Enterprise on-premise available. Partial sovereignty.
Data residency: US (default). EU data residency available.
On-premise deployment available (enterprise). Self-hosted via Docker.
On-premise deployment available (enterprise). Self-hosted via Docker.
Deepgram Nova-3 — Strategic Positioning
Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?
Deepgram Nova-3 is the enterprise STT leader with 54.2% lower WER on noisy audio — but its VPC-only stance (no full on-premise) limits sovereignty appeal for regulated European deployments.
A. Strategic Positioning
Target customer: Developer / Enterprise — voice agents, real-time transcription
Unified STT+TTS+LLM platform with 54.2% lower WER on noisy audio vs competitors — the voice AI infrastructure backbone.
B. Competitive Moat
- 54.2% lower WER on noisy audio vs competitors including hyperscalers
- Unified STT+TTS+LLM API — reduces integration complexity and end-to-end latency
- Series C $130M (Jan 2026) — $1.3B valuation — financial strength for R&D
Vulnerability: Open-source models (Whisper, Voxtral) catching up. No full on-premise STT option (VPC only). Pricing pressure from competitors.
E. Strategic Questions for DigiDouble
Sovereignty fit
EU data residency via VPC available. No full on-premise STT. Moderate sovereignty fit — better than pure cloud, worse than self-hosted.
Build vs. Buy
Buy for Phase 1 (best accuracy, unified platform). Evaluate Whisper/Voxtral self-hosted for Phase 2 sovereignty.
Lock-in risk
Unified STT+TTS platform creates integration lock-in. VPC deployment and competitive pricing reduce dependency risk.
Roadmap alignment
Good for Phase 1 voice agents. Phase 2 sovereignty requires on-premise STT — consider Whisper or Voxtral self-hosted.
Data Freshness
Inworld benchmark 2026 + Koenecke et al.
Update note: Pricing updated: $0.0043/min PAYG (Jan 2026). Nova-3 language expansion (36 langs, Jan 2026). TTFA 75ms P90 confirmed by Inworld benchmark.