RESEARCH PORTAL·INNOSUISSE / IDIAP

DigiDouble

Platform for creating interactive conversational experiences with video avatars — combining real-time AI dialogue, photorealistic avatar generation, and intelligent cinematographic sequencing.

This portal documents the fundamental research challenges for the Memoways × Gamilab × IDIAP collaboration, within an Innosuisse project.

15–40s
Current latency
per exchange
<2s
Target latency
end-to-end
10–20×
Reduction required
improvement
5
Parallel streams
synchronized
00

Research Axes

AxisChallengeResearcherIDIAP GroupStatus
AX1
Conversational Memory
The Conversational Memory GapDr. Elena EpureLanguage & Information TechnologiesPRIMARY
AX2
Expressive Avatar & Behavioral Fidelity
The Behavioral Fidelity GapDr. Mathew Magimai-DossSpeech & Audio ProcessingPRIMARY
AX3
Deterministic-Organic Orchestration
Balance narrative constraints / AI conversational freedomInternal teamArchitectureSECONDARY
AX4
Multi-Stream Synchronization
Coordinate 5 streams <100ms desyncMemowaysInternal EngineeringINTERNAL

Conversational Pipeline

Each exchange passes through 6 stages — avatar generation is the main bottleneck (5–15s currently, target 500ms).

PIPELINE CONVERSATIONNEL END-TO-ENDUtilisateurParoleASR / STTAudiogami2–5s→ 300msRoutingOrchestration1–2s→ 200msLLMMémoire + RAG3–8s→ 500msTTSSynthèse vocale2–4s→ 200ms⚠ BOTTLENECKAvatarGénération vidéo5–15s→ 500msUtilisateurRéponseLatence actuelle→ CibleGoulot d'étranglement
THE GAP

Why This Research?

Current avatar platforms (HeyGen, Synthesia) produce high-quality video but with 15–40 second latency per exchange — incompatible with natural conversation. Real-time solutions (NVIDIA ACE, Beyond Presence) require proprietary infrastructure and do not allow behavioral personalization from existing archives. DigiDouble aims to bridge this gap: sovereign, open, personalized, and real-time.

The fundamental challenge: achieve a 10–20× latency reduction while preserving behavioral fidelity of the specific person — a problem at the intersection of speech processing, computer vision, NLP, and systems engineering.