Back/Phase 1 MVP/Voice Pipeline

PHASE 1 MVPReference architecture

Voice-to-Voice Pipeline

Interactive diagram of the complete voice pipeline for DigiDouble Phase 1 MVP. Select components for each block to visualize cumulative latency and estimated cost. Compare the Cascade approach (ASR → LLM → TTS) with end-to-end Voice-to-Voice.

Estimated Latency & Cost

Best case

525ms

Typical

1240ms

Cost/min

$0.082

Cost/hour

$4.94

✓ Cible <2s

Best case

Typical

2s target

User: 10ms · free

ASR: 75ms · $0.004/min

Memory: 20ms · $0.004/min

LLM: 350ms · $0.060/min

TTS: 40ms · $0.014/min

Transport: 30ms · free

Cost breakdown

■ ASR: 5%■ Memory: 5%■ LLM: 73%■ TTS: 17%

DigiDouble Phase 1 target: <2s end-to-end (voice pipeline only, excluding avatar generation). Avatar generation adds 80–300ms (BeyondPresence) or 3–8s (HeyGen).

Pipeline flow

Converts audio stream to text. Streaming ASR sends partial transcripts to reduce LLM Time-to-First-Token. Critical latency block.

Deepgram Nova-3

RecommandéCloud

75–200ms

$0.0043/min streaming

Industry reference for real-time ASR. Supports 30+ languages. Used by most voice agent frameworks.

Whisper.cpp (local)

AlternatifLocal

200–500ms

Free (MIT)

Souverain

Best sovereign option. Use faster-whisper with streaming VAD for near-real-time. Quantized small.en: ~200ms on CPU.

AssemblyAI Universal-2

AlternatifCloud

100–250ms

$0.0065/min streaming

Good alternative with EU data residency option. Better for multi-speaker scenarios.

Inworld STT

AlternatifCloud

80–180ms

Included in Inworld platform (TTS + STT + Realtime API bundle)

Souverain

Key advantage: when combined with Inworld TTS and LLM Router, eliminates inter-component network hops. Semantic VAD reduces hallucination triggers. Use for Inworld Single-Provider stack.

Cascade Pipeline

ASR → LLM → TTS — modular, controllable, production-ready

Best

505ms

Typical

1150ms

Recommended for MVP Phase 1. Best stack: Deepgram Nova-3 + GPT-4o streaming + Cartesia Sonic 3. Sovereign alternative: Whisper.cpp + Llama 3.1 8B + Kokoro 82M.

End-to-End Voice-to-Voice

Direct audio-in → audio-out — lowest latency, natural prosody

Best

150ms

Typical

350ms

Recommended for R&D Axis 1 exploration (H2 2026). Not suitable for Phase 1 MVP due to lack of voice cloning. Monitor Voxtral TTS (Mistral) and Ultravox v0.5.

Recommended Stacks

Click 'Apply' to load components into the configurator.

MVP RECOMMENDED

MVP Cloud Stack

Fastest path to working prototype

Best latency

505ms

Typical

1130ms

Cost/min

$0.085

US Cloud — acceptable for prototype

Optimal for Phase 1 MVP validation. Deepgram (75ms) + GPT-4o streaming (350ms) + Cartesia (40ms) + WebRTC (30ms) = ~505ms best-case. Voice cloning via Cartesia. No sovereignty — acceptable for prototype.

Sovereign Stack

Full Swiss sovereignty — Exoscale/OVH deployment

Best latency

710ms

Typical

1620ms

Cost/min

$0.015

Full sovereignty — Exoscale/OVH deployable

Full sovereignty for Swiss/EU institutional partners. Whisper.cpp (200ms) + Llama 3.1 8B (150ms) + Chatterbox (150ms) + Mem0 (20ms) + WebRTC (30ms) = ~710ms best-case. Voice cloning via Chatterbox. Deployable on Exoscale Geneva.

MVP RECOMMENDED

Inworld Single-Provider Stack

One provider: STT + LLM Router + TTS — no inter-component latency

Best latency

490ms

Typical

1050ms

Cost/min

$0.060

Full sovereignty — Exoscale/OVH deployable

Key advantage: Inworld STT + LLM Router + TTS share the same internal infrastructure — no inter-component network serialization. Estimated 30–50ms latency gain vs multi-provider cascade. WebRTC (10ms) + Inworld STT (80ms) + Mem0 (20ms) + GPT-4o streaming (350ms) + Inworld TTS Mini (120ms) + WebRTC (30ms) = ~490ms best-case. On-premise option enables Swiss sovereign deployment. Voice cloning + viseme timestamps included.

Hybrid Stack

Best quality/sovereignty balance for production

Best latency

555ms

Typical

1230ms

Cost/min

$0.030

US Cloud — acceptable for prototype

EU-sovereign LLM (Mistral) + sovereign TTS (Kokoro) + best ASR (Deepgram) + Mem0 long-term memory. Deepgram (75ms) + Mistral Nemo (200ms) + Kokoro (60ms) + Mem0 (20ms) + WebRTC (30ms) = ~555ms best-case. No voice cloning — add Chatterbox for persona.

Key decision: voice cloning

Voice cloning is critical for DigiDouble persona. Options: Cartesia (cloud, 40ms), Chatterbox (local, 150ms, MIT), ElevenLabs (cloud, 75ms, $75/1M). Sovereign stack requires Chatterbox + Kokoro combination.

Compare all TTS solutions

14 TTS/V2V solutions with comparative scores on 7 axes.

TTS State of the Art →

Back to State of the Art Research Challenges