Back/Voice Pipeline
PHASE 1 MVPReference architecture

Voice-to-Voice Pipeline

Interactive diagram of the complete voice pipeline for DigiDouble Phase 1 MVP. Select components for each block to visualize cumulative latency and estimated cost. Compare the Cascade approach (ASR → LLM → TTS) with end-to-end Voice-to-Voice.

Estimated Latency & Cost

Best case
525ms
Typical
1240ms
Cost/min
$0.082
Cost/hour
$4.94
✓ Cible <2s
Best case
Typical
2s target
User: 10ms · free
ASR: 75ms · $0.004/min
Memory: 20ms · $0.004/min
LLM: 350ms · $0.060/min
TTS: 40ms · $0.014/min
Transport: 30ms · free
Cost breakdown
ASR: 5% Memory: 5% LLM: 73% TTS: 17%

DigiDouble Phase 1 target: <2s end-to-end (voice pipeline only, excluding avatar generation). Avatar generation adds 80–300ms (BeyondPresence) or 3–8s (HeyGen).

Pipeline flow

Converts audio stream to text. Streaming ASR sends partial transcripts to reduce LLM Time-to-First-Token. Critical latency block.

Deepgram Nova-3
RecommandéCloud
75–200ms
$0.0043/min streaming

Industry reference for real-time ASR. Supports 30+ languages. Used by most voice agent frameworks.

Whisper.cpp (local)
AlternatifLocal
200–500ms
Free (MIT)
Souverain

Best sovereign option. Use faster-whisper with streaming VAD for near-real-time. Quantized small.en: ~200ms on CPU.

AssemblyAI Universal-2
AlternatifCloud
100–250ms
$0.0065/min streaming

Good alternative with EU data residency option. Better for multi-speaker scenarios.

Inworld STT
AlternatifCloud
80–180ms
Included in Inworld platform (TTS + STT + Realtime API bundle)
Souverain

Key advantage: when combined with Inworld TTS and LLM Router, eliminates inter-component network hops. Semantic VAD reduces hallucination triggers. Use for Inworld Single-Provider stack.

Cascade Pipeline

ASR → LLM → TTS — modular, controllable, production-ready

Best
505ms
Typical
1150ms

Recommended for MVP Phase 1. Best stack: Deepgram Nova-3 + GPT-4o streaming + Cartesia Sonic 3. Sovereign alternative: Whisper.cpp + Llama 3.1 8B + Kokoro 82M.

End-to-End Voice-to-Voice

Direct audio-in → audio-out — lowest latency, natural prosody

Best
150ms
Typical
350ms

Recommended for R&D Axis 1 exploration (H2 2026). Not suitable for Phase 1 MVP due to lack of voice cloning. Monitor Voxtral TTS (Mistral) and Ultravox v0.5.

Recommended Stacks

Click 'Apply' to load components into the configurator.

MVP RECOMMENDED

MVP Cloud Stack

Fastest path to working prototype

Best latency
505ms
Typical
1130ms
Cost/min
$0.085
US Cloud — acceptable for prototype

Optimal for Phase 1 MVP validation. Deepgram (75ms) + GPT-4o streaming (350ms) + Cartesia (40ms) + WebRTC (30ms) = ~505ms best-case. Voice cloning via Cartesia. No sovereignty — acceptable for prototype.

Sovereign Stack

Full Swiss sovereignty — Exoscale/OVH deployment

Best latency
710ms
Typical
1620ms
Cost/min
$0.015
Full sovereignty — Exoscale/OVH deployable

Full sovereignty for Swiss/EU institutional partners. Whisper.cpp (200ms) + Llama 3.1 8B (150ms) + Chatterbox (150ms) + Mem0 (20ms) + WebRTC (30ms) = ~710ms best-case. Voice cloning via Chatterbox. Deployable on Exoscale Geneva.

MVP RECOMMENDED

Inworld Single-Provider Stack

One provider: STT + LLM Router + TTS — no inter-component latency

Best latency
490ms
Typical
1050ms
Cost/min
$0.060
Full sovereignty — Exoscale/OVH deployable

Key advantage: Inworld STT + LLM Router + TTS share the same internal infrastructure — no inter-component network serialization. Estimated 30–50ms latency gain vs multi-provider cascade. WebRTC (10ms) + Inworld STT (80ms) + Mem0 (20ms) + GPT-4o streaming (350ms) + Inworld TTS Mini (120ms) + WebRTC (30ms) = ~490ms best-case. On-premise option enables Swiss sovereign deployment. Voice cloning + viseme timestamps included.

Hybrid Stack

Best quality/sovereignty balance for production

Best latency
555ms
Typical
1230ms
Cost/min
$0.030
US Cloud — acceptable for prototype

EU-sovereign LLM (Mistral) + sovereign TTS (Kokoro) + best ASR (Deepgram) + Mem0 long-term memory. Deepgram (75ms) + Mistral Nemo (200ms) + Kokoro (60ms) + Mem0 (20ms) + WebRTC (30ms) = ~555ms best-case. No voice cloning — add Chatterbox for persona.

Key decision: voice cloning

Voice cloning is critical for DigiDouble persona. Options: Cartesia (cloud, 40ms), Chatterbox (local, 150ms, MIT), ElevenLabs (cloud, 75ms, $75/1M). Sovereign stack requires Chatterbox + Kokoro combination.

Compare all TTS solutions

14 TTS/V2V solutions with comparative scores on 7 axes.

TTS State of the Art →