02·Research Challenges — IDIAP

Fundamental Research Challenges

Two primary axes for IDIAP collaboration: long-duration conversational memory and personalized expressive avatar generation.

01
Overview — 3 Research Axes

Click an axis to expand technical details and hypotheses.

6s<2s
01
End-to-End Latency
6–10×
reduction required

Avatar generation is the main bottleneck (5–10s). Target: <2s total via distillation + streaming.

Dr. Petr Motlicek · IDIAP
H1Streaming pipeline (LLM → TTS → Avatar) reduces latency by 60%
02
Conversational Memory
−90%
token reduction (Mem0)

Maintain coherence over 1h+ sessions without exploding LLM context. 3-layer architecture.

Dr. Petr Motlicek · IDIAP
H23-layer memory reduces context by 90% while maintaining 95% coherence
03
Expressive Avatar & Behavioral Fidelity
80%
behavioral fidelity target

Beyond lip-sync: extract micro-expressions, gestures, posture from video archives.

Dr. Mathew Magimai-Doss · IDIAP
H3Behavioral extraction from 10min video achieves 80% fidelity
02
Latency — Before & After

The bottleneck: avatar video generation (5–10s)

Hover over each component for details. The red block = main bottleneck.

CURRENT STATE6–12sper exchange
ASR/STT
Transcription
300–500ms
LLM
Generation
800ms–2s
TTS
Voice synthesis
200–500ms
Avatar
Video generation
5–10s
WebRTC
Transport
30–80ms

The bottleneck is avatar video generation (5–10s). All other components are already within targets.

R&D TARGET<2send-to-end
ASR/STT
Streaming
<200ms
SLM
Distilled + RAG
150–400ms
TTS
Streaming
<200ms
Avatar
IDIAP R&D
<500ms
WebRTC
Adaptive
<50ms

6–10× reduction via avatar distillation + streaming pipeline + intelligent cache.

6–12s<2s6–10×
03
Axis 1 — Conversational Memory

3-layer memory architecture

Dr. Petr Motlicek · IDIAP

Maintain coherence over 1h+ sessions without exploding LLM context. Mem0 (2025): −90% tokens, +26% accuracy.

CONVERSATIONAL MEMORY ARCHITECTURE — 3 LAYERSLLM / AgentOrchestratorL1Node MemoryShort-termCurrent conversationNode variablesEmotional stateLLM Context WindowCost: HIGHL2Session MemoryMedium-termVisited node pathProgression scoreSummarized historyVector DB / RAGCost: MEDIUML3User MemoryLong-termLearning profileHistorical sessionsCross-session patternsPostgreSQL + SLMCost: LOWGoal: -90% context window tokens · +26% accuracy (Mem0, 2025)
L1
Working Memory
Short term
·Current conversation context
·Node-specific variables
·Covered concepts tracker
·Emotional state detection
·Selective forgetting on exit
LLM Context WindowCost: High
L2
Episodic Memory
Medium term
·Path of visited nodes
·Global progression score
·Engagement level tracking
·Decisions and branches taken
·Summarized conversation history
Vector DB / RAGCost: Medium
L3
Semantic Memory
Long term
·Learning profile + preferences
·Historical session summaries
·Knowledge level by topic
·Detected interaction style
·Inter-session patterns
PostgreSQL + SLMCost: Low
04
Axis 2 — Expressive Avatar & TTS

Beyond lip-sync: behavioral fidelity

Dr. Mathew Magimai-Doss · IDIAP

Current platforms produce "standardized talking heads" — visually photorealistic but behaviorally generic. This creates an uncanny valley of familiarity.

EXPRESSIVE AVATAR PIPELINE — BEHAVIORAL EXTRACTION → REAL-TIME GENERATIONExtraction fromuncontrolled images?P1Video ArchivesSourceExisting videosVariable anglesUncontrolled lightingRequired audio/videoquantity?P2ExtractionBehavioral analysisMicro-expressionsGestural vocabularyPostural patternsIndividual prosodyMinimal computefor <500ms?P3Behavioral ModelIndividual representationFine-tuned SLMLoRA adaptersProsodic encoderGesture libraryP4Real-Time GenerationTarget: <500msLip-sync + bodyEmotional coherencePersonalized TTSVideo streaming⚠ Uncanny valley of familiarity — users recognize the face but not the behavior → suspension of disbelief destroyed
AXE 22a

Behavioral Extraction from Archives

Extract individual behavioral patterns from existing videos — without new capture sessions. Identify: micro-expression repertoire, gestural vocabulary, gesture-speech temporal relationships, postural habits.

Key question:

Can we automatically extract an individual's gestural vocabulary from uncontrolled footage?

AXE 22b

Coherent Body Language Generation

Go beyond lip-sync. Generate coordinated body behavior: synchronized with speech content and emotional tone, culturally appropriate, consistent with the defined personality.

Key question:

Most current systems focus on the face only. The body is absent or from a template library.

AXE 22c

Personalized Expressive TTS

Generate speech capturing not only vocal timbre but the prosodic fingerprint: rhythm, emphasis patterns, pause distribution, emotional modulation.

Key question:

How much source audio is needed to capture prosodic individuality? Minutes or hours?

AXE 22d

Cost / Quality / Latency Optimization

Approaches: pre-rendered base + real-time lip-sync, model distillation, intelligent cache, graceful degradation (full video → face → stylized avatar → audio only).

Key question:

What is the minimum compute for acceptable personalized avatar generation at <500ms?

05
IDIAP Partnership — Mutual Contributions

DigiDouble brings

Sovereign ASR pipeline (Audiogami) — operational, Swiss-hosted
Multi-stream expertise — 14 years of synchronized multimedia delivery
Two validated prototypes with real user testing and documented feedback
Domain expertise in interactive narrative design and pedagogical structuring
Swiss GPU infrastructure — Exoscale partnership for sovereign compute

DigiDouble expects from IDIAP

Fundamental research on memory architectures for long-duration conversational AI
Research in speech synthesis for personalized, expressive, real-time TTS
Evaluation frameworks — scientific metrics for behavioral authenticity and engagement
Publications in relevant venues (Interspeech, SIGDIAL, ACL, CHI, CVPR)
PhD/postdoc capacity to advance these axes over the project duration