Two primary axes for IDIAP collaboration: long-duration conversational memory and personalized expressive avatar generation.
Click an axis to expand technical details and hypotheses.
Avatar generation is the main bottleneck (5–10s). Target: <2s total via distillation + streaming.
Maintain coherence over 1h+ sessions without exploding LLM context. 3-layer architecture.
Beyond lip-sync: extract micro-expressions, gestures, posture from video archives.
Hover over each component for details. The red block = main bottleneck.
The bottleneck is avatar video generation (5–10s). All other components are already within targets.
6–10× reduction via avatar distillation + streaming pipeline + intelligent cache.
Dr. Petr Motlicek · IDIAP
Maintain coherence over 1h+ sessions without exploding LLM context. Mem0 (2025): −90% tokens, +26% accuracy.
Dr. Mathew Magimai-Doss · IDIAP
Current platforms produce "standardized talking heads" — visually photorealistic but behaviorally generic. This creates an uncanny valley of familiarity.
Extract individual behavioral patterns from existing videos — without new capture sessions. Identify: micro-expression repertoire, gestural vocabulary, gesture-speech temporal relationships, postural habits.
Can we automatically extract an individual's gestural vocabulary from uncontrolled footage?
Go beyond lip-sync. Generate coordinated body behavior: synchronized with speech content and emotional tone, culturally appropriate, consistent with the defined personality.
Most current systems focus on the face only. The body is absent or from a template library.
Generate speech capturing not only vocal timbre but the prosodic fingerprint: rhythm, emphasis patterns, pause distribution, emotional modulation.
How much source audio is needed to capture prosodic individuality? Minutes or hours?
Approaches: pre-rendered base + real-time lip-sync, model distillation, intelligent cache, graceful degradation (full video → face → stylized avatar → audio only).
What is the minimum compute for acceptable personalized avatar generation at <500ms?