02·Research Challenges

Research Challenges

Three research axes, all converging toward a central goal: a fluid, personalized, near-real-time conversational experience. These challenges will be addressed through an Innosuisse project with IDIAP as research partner (expected start: autumn 2026).

Axis 1 — Latency & UXAxis 2 — Avatar BehaviorAxis 3 — Orchestration
01
Target Architecture

Target architecture: available blocks (green), R&D required (blue), Memoways internal (yellow). The <2s target latency budget is the constraint that structures all architectural choices.

AvailableR&DMemoways InternalCritical bottleneckUSERVoice/TextAVAILABLE — GamilabSovereign ASR + STTSwiss-hosted · HITL optional~300ms targetR&DAXIS 1Memory3-layer arch.AXIS 2aExpressive TTSPersonalized prosodyAXIS 2b ⚠Avatar GenerationBehavioral fidelity⚠ <500ms targetAXIS 3OrchestrationDeterministic-organicArchitecture challengeINTERNAL — MemowaysNode EditorConversation graphConfigurable PlayerMode pédagogique / Mode narratifEXPERIENCE<2s targetTARGET LATENCY BUDGET<300msASR+STT<200msOrchestration<500msSLM+LLM<200msTTS<500msAvatar (R&D)<300msStreaming= <2stotal targetAll values are R&D targets — end-to-end benchmarks planned spring 2026
Click to expand
02
Axis 1 — Latency & UX Fluidity

6–12 seconds break the illusion of presence

Latency is not just a technical problem — it is a user experience problem. Beyond 2 seconds, users lose their train of thought, the avatar stops being a presence and becomes a tool. DigiDouble's goal is to cross the conversational naturalness threshold: <2s end-to-end, with first sound within 500ms.

Cognitive thresholds of perceptive latency

ThresholdQualificationUX ImpactAchievable (DD)
100msInstantaneous'Immediate' response threshold. User perceives no delay. Target for micro-interactions (click, hover).✓ Yes
300msFluidPerceptive fluidity threshold. User perceives slight delay but interaction remains natural. Target for TTS first audio.✓ Yes
1sAcceptableConversational comfort threshold. Beyond this, users start anticipating the wait. Target for TTFB (first video frame).✓ Yes
2sNatural limitConversational naturalness threshold (Nielsen 1993, validated by human dialogue research). Beyond this, conversation becomes a series of waits. DigiDouble TTFR target.R&D Goal
6–12sEngagement breakCurrent DigiDouble latency (HeyGem OS). User loses the thread, avatar stops being a presence. High drop-off rate. This is the problem to solve.Current problem

Comparative latency benchmark (March 2026)

Data from analysis of 11 solutions. Full technical profiles in State of the Art.

SolutionLatencyTypeSovereignCostNoteLinks
Beyond Presence<100msCommercialEnterpriseProprietary infra
NVIDIA ACE<100msCommercialNVIDIA infraNVIDIA lock-in
Simli Trinity-1<300msCommercial$0.009/minGaussian Splatting
AnamGoodCommercial~$0.18/minWebRTC Pion
Runway Characters<500msCommercial$0.20/minWebRTC GWM-1
D-ID V4Improved V4Commercial~$0.35/minWebRTC Janus
HeyGen2–5sCommercialHighStreaming
DigiDouble (current)6–12sOpen-sourceExoscale GPUHeyGem OS
DigiDouble (R&D target)<2sR&DSovereign GPUAxis 1 R&D
SoulX-FlashTalk0.87s startupResearch8xH80014B DiT
AvatarForcingReal-timeResearchResearch GPU1-step diffusion

Competitive positioning: Latency × Sovereignty

The DigiDouble gap is visible: fast solutions have no sovereignty, sovereign solutions are not fast. The R&D goal is to bridge this gap (dashed arrow).

Commercial
Open-source
Research
DigiDouble (current)
DigiDouble (R&D target)
IDEAL ZONEPROBLEM ZONESlowFastLATENCY → SPEEDDependentSovereignSOVEREIGNTY →R&D TARGETBeyond PresenceNVIDIA ACESimli Trinity-1Runway CharactersAnamD-ID V4HeyGenHeyGem OSbitHumanSoulX-FlashTalkAvatarForcingDigiDoubleDigiDouble R&D

Hover a point to see details

Target UX metrics

TTFR
Time to First Response
Current
6–12s
Target
<2s

Beyond 2s, users lose their train of thought. The conversation becomes a series of waits, not a natural exchange.

TTFA
Time to First Audio
Current
3–6s
Target
<500ms

Audio must precede or accompany video. Prolonged silence before speech breaks the illusion of presence.

TTFB
Time to First Frame
Current
5–10s
Target
<1s

The first video frame must appear within a second. A frozen avatar while audio plays creates cognitive dissonance.

Complete sequence of a conversational exchange with latency budget per component. The main bottleneck is avatar video generation (5–8s out of the 6–12s total).

SEQUENCE DIAGRAM — CONVERSATIONAL EXCHANGE (TARGET <2s)UserASR/STTOrchestratorMemoryLLMTTSAvatarSpeech (audio)t=0Transcribed text+300msContext query+350msContext + profile+400msEnriched prompt+420msResponse (stream)+900msText → synthesis+950msMemory update+960msAudio + phonemes+1100msVideo + audio sync+1500ms ✓Target timings · TTS+Avatar parallelization possible · LLM → TTS streaming without waiting for full generation
Click to expand
03
Axis 1b — Conversational Memory & Personalization

3-layer memory: coherence without context overload

Speech & Audio Processing

Memory is a sub-problem of latency: each memory layer must be accessible without adding perceptible delay. Mem0 (2025) demonstrates −90% tokens, +26% accuracy — but the impact on generation latency remains to be measured in our context.

CONVERSATIONAL MEMORY ARCHITECTURE — 3 LAYERSLLM / AgentOrchestratorL1Node MemoryShort-termCurrent conversationNode variablesEmotional stateLLM Context WindowCost: HIGHL2Session MemoryMedium-termVisited node pathProgression scoreSummarized historyVector DB / RAGCost: MEDIUML3User MemoryLong-termLearning profileHistorical sessionsCross-session patternsPostgreSQL + SLMCost: LOWGoal: -90% context window tokens · +26% accuracy (Mem0, 2025)
Click to expand
L1
Working Memory
Short term · Active session
·Current conversation context
·Node-specific variables
·Covered concepts tracker
·Emotional state detection
·Selective forgetting on node exit
LLM Context WindowCost: High
Added latency: None (already in context)
L2
Episodic Memory
Medium term · Multi-node
·Path of visited nodes
·Global progression score
·Engagement level tracking
·Decisions and branches taken
·Summarized conversation history
Vector DB / RAGCost: Medium
Added latency: +50–200ms (retrieval)
L3
Semantic Memory
Long term · Multi-session
·Learning profile + preferences
·Historical session summaries
·Knowledge level by topic
·Detected interaction style
·Inter-session patterns
PostgreSQL + SLMCost: Low
Added latency: +10–50ms (SQL query)

Personalization & evaluation metrics

PM1Prosodic coherence

Does the generated voice match the individual prosodic fingerprint (rhythm, emphasis, pauses)? Metric: MOS + DTW on pause patterns.

PM2Behavioral fidelity

Do micro-expressions and gestures match the extracted behavioral repertoire? Metric: FID (Fréchet Inception Distance) adapted to facial sequences.

PM3Conversational engagement

Does the user maintain engagement over time? Metrics: session duration, completion rate, subjective naturalness score (Likert 1–5).

PM4Memory accuracy

Does the avatar correctly recall relevant information from previous sessions? Metric: LoCoMo benchmark (Snap Research 2024).

04
Axis 2 — Avatar Behavior & Expressiveness

Two independent streams, one dual-stream output

Computer Vision & Speech

The system strictly separates source video analysis (Stream A, offline, non-critical) from avatar construction (Stream B, main R&D). The avatar training video is never played in the experience. Axis 2's challenge is making Stream B fast enough to meet Axis 1's latency budget.

Stream A: Offline analysis — standard, non-criticalStream B: Avatar construction — main R&D challenge (Axis 2b)Output: Synchronized dual-stream — internal expertise
STREAM A — Source Video AnalysisOffline processing · Not a major R&D challengeVideo Archives(source material)✓ Offline · SimpleFrameExtractionJPEG · 1fpsSemanticAnalysisCLIP · BLIP2Tags · EmbeddingsVideoDescriptor DBVector DBSemantic searchDynamic VideoPlaylistUpdated in real-timebased on conversationIllustrative VideosPlay alongside avatar(secondary stream)Avatar speaks ✓Informative / Interview VideosAvatar pausesvideo delivers spoken contentAvatar pauses ⏸STREAM B — Avatar ConstructionOffline training + real-time inference · Main R&D challengeTrainingVideoSingle · Never playedin the experienceBehavioralFingerprintMicro-expressionsGestures · RhythmProsodyAvatarModel⚠ R&D Challenge (Axis 2b)Diffusion distillationIntelligent cache<500ms targetReal-timeAvatarSpeaks · Pauses duringinformative videosOUTPUT — Dual-Stream ExperienceMulti-stream sync · <100ms · Memoways internal expertiseMain StreamReal-time avatarWebRTC · H.264 · <100msPauses during informative videosSecondary StreamDynamic video playlistIllustrative: insert alongside avatarInformative: full-screen, avatar pausesSmart orchestration: avatar yields to informative videosIllustrative videos play as inserts without interrupting avatarResearch Axes: Axis 2b (Avatar <500ms) · Axis 2a (Expressive TTS) · Axis 1 (Conversational Memory)Source video analysis (Stream A) is NOT a research challenge — standard offline processing
Click to expand
AXE 22A

Behavioral Extraction from Archives

Extract individual behavioral patterns from existing videos — without new capture sessions. Identify: micro-expression repertoire, gestural vocabulary, gesture-speech temporal relationships, postural habits.

Key question:

Can we automatically extract an individual's gestural vocabulary from uncontrolled footage?

AXE 22B

Coherent Body Language Generation

Go beyond lip-sync. Generate coordinated body behavior: synchronized with speech content and emotional tone, culturally appropriate, consistent with the defined personality.

Key question:

Most current systems focus on the face only. The body is absent or from a template library.

AXE 22C

Personalized Expressive TTS

Generate speech capturing not only vocal timbre but the prosodic fingerprint: rhythm, emphasis patterns, pause distribution, emotional modulation. The voice must match the avatar's emotional state.

Key question:

How much source audio is needed to capture prosodic individuality? Minutes or hours?

AXE 22D

Cost / Quality / Latency Optimization

Approaches: pre-rendered base + real-time lip-sync, model distillation, intelligent cache, graceful degradation. The goal is an acceptable personalized avatar at <500ms on accessible hardware.

Key question:

What is the minimum compute for acceptable personalized avatar generation at <500ms?

05
Axis 3 — Orchestration Freedom Degree

Deterministic vs organic: the orchestration trilemma

Each conversation node can define its own freedom degree (0% = scripted, 90%+ = free AI). The R&D challenge: guarantee mandatory content coverage while maintaining conversational naturalness — and without adding latency from the orchestration decision.

ARCH

Orchestration relies on a multi-agent architecture of specialized agents, each responsible for one dimension of the conversation (content coverage, narrative progression, evaluation, memory). The challenge is coordinating these agents without introducing perceptible latency or behavioral divergence.

FREEDOM DEGREE — PER CONVERSATION NODEEach node defines its own deterministic ↔ organic balance0%ScriptedFixed sequenceNo AI generation30%GuidedMandatory contentAI rephrases only50%BalancedHard + soft constraintsAI adapts to user70%CreativeFew constraintsAI drives dialogue90%+OpenTopic boundaries onlyFull conversational AIPEDAGOGICAL MODE0–50% · Mandatory content coverage · Strong pedagogical controlAI adapts tone, not contentNARRATIVE MODE50–90%+ · Topic boundaries only · Character personalityAI drives narrative evolutionR&D challenge: guarantee mandatory content coverage (deterministic) while maintaining conversational naturalness (organic)Hypothesis H4 — Axis 3
Click to expand
05b
Emotional Toolbox & Character Design

Cinema-grade character design

A conversational avatar is not just a talking face. Behavioral fidelity requires an explicit emotional design layer: defining, encoding, and activating a repertoire of emotional states consistent with the character's personality, history, and interaction context.

ET-1

Emotional repertoire

Define a set of discrete and continuous emotional states per character. Each state encodes: facial expression, vocal prosody, cadence, posture, micro-behaviors.

ET-2

Transition & coherence

Transitions between emotional states must be smooth, personality-consistent, and not create perceptible breaks in the experience. Challenge: avoiding the 'emotional uncanny valley' effect.

ET-3

Contextual activation

Emotional state is activated by conversation content, interaction history, and user signals (tone, rhythm, content). Research: real-time detection of incoming emotional signals.

Key differentiation dimension

No current commercial platform offers an explicit, creator-configurable emotional design system. Most leave the LLM to implicitly decide emotional state, without guaranteed control or coherence. DigiDouble targets an emotional toolbox inspired by actor direction methods, accessible to non-technical creators.

06
Research Collaboration — Mutual Contributions

DigiDouble brings

Sovereign ASR pipeline (Audiogami) — operational, Swiss-hosted
Multi-stream expertise — 14 years of synchronized multimedia delivery
Two validated prototypes with real user testing and documented feedback
Domain expertise in interactive narrative design and pedagogical structuring
Swiss GPU infrastructure — Exoscale partnership for sovereign compute

DigiDouble seeks

Fundamental research on memory architectures for long-duration conversational AI
Research on speech synthesis for personalized, expressive, real-time TTS
Evaluation frameworks — scientific metrics for behavioral authenticity and engagement
Publications in relevant venues (Interspeech, SIGDIAL, ACL, CHI, CVPR)
PhD/postdoc capacity to advance these axes over the project duration