Research Challenges
Three research axes, all converging toward a central goal: a fluid, personalized, near-real-time conversational experience. These challenges will be addressed through an Innosuisse project with IDIAP as research partner (expected start: autumn 2026).
Target architecture: available blocks (green), R&D required (blue), Memoways internal (yellow). The <2s target latency budget is the constraint that structures all architectural choices.
6–12 seconds break the illusion of presence
Latency is not just a technical problem — it is a user experience problem. Beyond 2 seconds, users lose their train of thought, the avatar stops being a presence and becomes a tool. DigiDouble's goal is to cross the conversational naturalness threshold: <2s end-to-end, with first sound within 500ms.
Cognitive thresholds of perceptive latency
| Threshold | Qualification | UX Impact | Achievable (DD) |
|---|---|---|---|
| 100ms | Instantaneous | 'Immediate' response threshold. User perceives no delay. Target for micro-interactions (click, hover). | ✓ Yes |
| 300ms | Fluid | Perceptive fluidity threshold. User perceives slight delay but interaction remains natural. Target for TTS first audio. | ✓ Yes |
| 1s | Acceptable | Conversational comfort threshold. Beyond this, users start anticipating the wait. Target for TTFB (first video frame). | ✓ Yes |
| 2s | Natural limit | Conversational naturalness threshold (Nielsen 1993, validated by human dialogue research). Beyond this, conversation becomes a series of waits. DigiDouble TTFR target. | R&D Goal |
| 6–12s | Engagement break | Current DigiDouble latency (HeyGem OS). User loses the thread, avatar stops being a presence. High drop-off rate. This is the problem to solve. | Current problem |
Comparative latency benchmark (March 2026)
Data from analysis of 11 solutions. Full technical profiles in State of the Art.
| Solution | Latency | Type | Sovereign | Cost | Note | Links |
|---|---|---|---|---|---|---|
| Beyond Presence | <100ms | Commercial | ✗ | Enterprise | Proprietary infra | |
| NVIDIA ACE | <100ms | Commercial | ✗ | NVIDIA infra | NVIDIA lock-in | |
| Simli Trinity-1 | <300ms | Commercial | ✗ | $0.009/min | Gaussian Splatting | |
| Anam | Good | Commercial | ✗ | ~$0.18/min | WebRTC Pion | |
| Runway Characters | <500ms | Commercial | ✗ | $0.20/min | WebRTC GWM-1 | |
| D-ID V4 | Improved V4 | Commercial | ✗ | ~$0.35/min | WebRTC Janus | |
| HeyGen | 2–5s | Commercial | ✗ | High | Streaming | |
| DigiDouble (current) | 6–12s | Open-source | ✓ | Exoscale GPU | HeyGem OS | |
| DigiDouble (R&D target) | <2s | R&D | ✓ | Sovereign GPU | Axis 1 R&D | |
| SoulX-FlashTalk | 0.87s startup | Research | ✗ | 8xH800 | 14B DiT | |
| AvatarForcing | Real-time | Research | ✗ | Research GPU | 1-step diffusion |
Competitive positioning: Latency × Sovereignty
The DigiDouble gap is visible: fast solutions have no sovereignty, sovereign solutions are not fast. The R&D goal is to bridge this gap (dashed arrow).
Hover a point to see details
Target UX metrics
Beyond 2s, users lose their train of thought. The conversation becomes a series of waits, not a natural exchange.
Audio must precede or accompany video. Prolonged silence before speech breaks the illusion of presence.
The first video frame must appear within a second. A frozen avatar while audio plays creates cognitive dissonance.
Complete sequence of a conversational exchange with latency budget per component. The main bottleneck is avatar video generation (5–8s out of the 6–12s total).
3-layer memory: coherence without context overload
Speech & Audio Processing
Memory is a sub-problem of latency: each memory layer must be accessible without adding perceptible delay. Mem0 (2025) demonstrates −90% tokens, +26% accuracy — but the impact on generation latency remains to be measured in our context.
Personalization & evaluation metrics
Does the generated voice match the individual prosodic fingerprint (rhythm, emphasis, pauses)? Metric: MOS + DTW on pause patterns.
Do micro-expressions and gestures match the extracted behavioral repertoire? Metric: FID (Fréchet Inception Distance) adapted to facial sequences.
Does the user maintain engagement over time? Metrics: session duration, completion rate, subjective naturalness score (Likert 1–5).
Does the avatar correctly recall relevant information from previous sessions? Metric: LoCoMo benchmark (Snap Research 2024).
Two independent streams, one dual-stream output
Computer Vision & Speech
The system strictly separates source video analysis (Stream A, offline, non-critical) from avatar construction (Stream B, main R&D). The avatar training video is never played in the experience. Axis 2's challenge is making Stream B fast enough to meet Axis 1's latency budget.
Behavioral Extraction from Archives
Extract individual behavioral patterns from existing videos — without new capture sessions. Identify: micro-expression repertoire, gestural vocabulary, gesture-speech temporal relationships, postural habits.
Can we automatically extract an individual's gestural vocabulary from uncontrolled footage?
Coherent Body Language Generation
Go beyond lip-sync. Generate coordinated body behavior: synchronized with speech content and emotional tone, culturally appropriate, consistent with the defined personality.
Most current systems focus on the face only. The body is absent or from a template library.
Personalized Expressive TTS
Generate speech capturing not only vocal timbre but the prosodic fingerprint: rhythm, emphasis patterns, pause distribution, emotional modulation. The voice must match the avatar's emotional state.
How much source audio is needed to capture prosodic individuality? Minutes or hours?
Cost / Quality / Latency Optimization
Approaches: pre-rendered base + real-time lip-sync, model distillation, intelligent cache, graceful degradation. The goal is an acceptable personalized avatar at <500ms on accessible hardware.
What is the minimum compute for acceptable personalized avatar generation at <500ms?
Deterministic vs organic: the orchestration trilemma
Each conversation node can define its own freedom degree (0% = scripted, 90%+ = free AI). The R&D challenge: guarantee mandatory content coverage while maintaining conversational naturalness — and without adding latency from the orchestration decision.
Orchestration relies on a multi-agent architecture of specialized agents, each responsible for one dimension of the conversation (content coverage, narrative progression, evaluation, memory). The challenge is coordinating these agents without introducing perceptible latency or behavioral divergence.
Cinema-grade character design
A conversational avatar is not just a talking face. Behavioral fidelity requires an explicit emotional design layer: defining, encoding, and activating a repertoire of emotional states consistent with the character's personality, history, and interaction context.
Emotional repertoire
Define a set of discrete and continuous emotional states per character. Each state encodes: facial expression, vocal prosody, cadence, posture, micro-behaviors.
Transition & coherence
Transitions between emotional states must be smooth, personality-consistent, and not create perceptible breaks in the experience. Challenge: avoiding the 'emotional uncanny valley' effect.
Contextual activation
Emotional state is activated by conversation content, interaction history, and user signals (tone, rhythm, content). Research: real-time detection of incoming emotional signals.
Key differentiation dimension
No current commercial platform offers an explicit, creator-configurable emotional design system. Most leave the LLM to implicitly decide emotional state, without guaranteed control or coherence. DigiDouble targets an emotional toolbox inspired by actor direction methods, accessible to non-technical creators.