02·Research Challenges

Research Challenges

Three research axes, all converging toward a central goal: a fluid, personalized, near-real-time conversational experience. These challenges will be addressed through an Innosuisse project with IDIAP as research partner (expected start: autumn 2026).

Axis 1 — Latency & UXAxis 2 — Avatar BehaviorAxis 3 — Orchestration

Target Architecture

Target architecture: available blocks (green), R&D required (blue), Memoways internal (yellow). The <2s target latency budget is the constraint that structures all architectural choices.

Click to expand

Axis 1 — Latency & UX Fluidity

6–12 seconds break the illusion of presence

Latency is not just a technical problem — it is a user experience problem. Beyond 2 seconds, users lose their train of thought, the avatar stops being a presence and becomes a tool. DigiDouble's goal is to cross the conversational naturalness threshold: <2s end-to-end, with first sound within 500ms.

Cognitive thresholds of perceptive latency

Threshold	Qualification	UX Impact	Achievable (DD)
100ms	Instantaneous	'Immediate' response threshold. User perceives no delay. Target for micro-interactions (click, hover).	✓ Yes
300ms	Fluid	Perceptive fluidity threshold. User perceives slight delay but interaction remains natural. Target for TTS first audio.	✓ Yes
1s	Acceptable	Conversational comfort threshold. Beyond this, users start anticipating the wait. Target for TTFB (first video frame).	✓ Yes
2s	Natural limit	Conversational naturalness threshold (Nielsen 1993, validated by human dialogue research). Beyond this, conversation becomes a series of waits. DigiDouble TTFR target.	R&D Goal
6–12s	Engagement break	Current DigiDouble latency (HeyGem OS). User loses the thread, avatar stops being a presence. High drop-off rate. This is the problem to solve.	Current problem

Comparative latency benchmark (March 2026)

Data from analysis of 11 solutions. Full technical profiles in State of the Art.

Solution	Latency	Type	Sovereign	Cost	Note
Beyond Presence	<100ms	Commercial	✗	Enterprise	Proprietary infra
NVIDIA ACE	<100ms	Commercial	✗	NVIDIA infra	NVIDIA lock-in
Simli Trinity-1	<300ms	Commercial	✗	$0.009/min	Gaussian Splatting
Anam	Good	Commercial	✗	~$0.18/min	WebRTC Pion
Runway Characters	<500ms	Commercial	✗	$0.20/min	WebRTC GWM-1
D-ID V4	Improved V4	Commercial	✗	~$0.35/min	WebRTC Janus
HeyGen	2–5s	Commercial	✗	High	Streaming
DigiDouble (current)	6–12s	Open-source	✓	Exoscale GPU	HeyGem OS
DigiDouble (R&D target)	<2s	R&D	✓	Sovereign GPU	Axis 1 R&D
SoulX-FlashTalk	0.87s startup	Research	✗	8xH800	14B DiT
AvatarForcing	Real-time	Research	✗	Research GPU	1-step diffusion

Competitive positioning: Latency × Sovereignty

The DigiDouble gap is visible: fast solutions have no sovereignty, sovereign solutions are not fast. The R&D goal is to bridge this gap (dashed arrow).

Commercial

Open-source

Research

DigiDouble (current)

DigiDouble (R&D target)

Hover a point to see details

Target UX metrics

TTFR

Time to First Response

Current

6–12s

→

Target

<2s

Beyond 2s, users lose their train of thought. The conversation becomes a series of waits, not a natural exchange.

TTFA

Time to First Audio

Current

3–6s

→

Target

<500ms

Audio must precede or accompany video. Prolonged silence before speech breaks the illusion of presence.

TTFB

Time to First Frame

Current

5–10s

→

Target

<1s

The first video frame must appear within a second. A frozen avatar while audio plays creates cognitive dissonance.

Complete sequence of a conversational exchange with latency budget per component. The main bottleneck is avatar video generation (5–8s out of the 6–12s total).

Click to expand

Axis 1b — Conversational Memory & Personalization

3-layer memory: coherence without context overload

Speech & Audio Processing

Memory is a sub-problem of latency: each memory layer must be accessible without adding perceptible delay. Mem0 (2025) demonstrates −90% tokens, +26% accuracy — but the impact on generation latency remains to be measured in our context.

Click to expand

Working Memory

Short term · Active session

·Current conversation context

·Node-specific variables

·Covered concepts tracker

·Emotional state detection

·Selective forgetting on node exit

LLM Context WindowCost: High

Added latency: None (already in context)

Episodic Memory

Medium term · Multi-node

·Path of visited nodes

·Global progression score

·Engagement level tracking

·Decisions and branches taken

·Summarized conversation history

Vector DB / RAGCost: Medium

Added latency: +50–200ms (retrieval)

Semantic Memory

Long term · Multi-session

·Learning profile + preferences

·Historical session summaries

·Knowledge level by topic

·Detected interaction style

·Inter-session patterns

PostgreSQL + SLMCost: Low

Added latency: +10–50ms (SQL query)

Personalization & evaluation metrics

PM1Prosodic coherence

Does the generated voice match the individual prosodic fingerprint (rhythm, emphasis, pauses)? Metric: MOS + DTW on pause patterns.

PM2Behavioral fidelity

Do micro-expressions and gestures match the extracted behavioral repertoire? Metric: FID (Fréchet Inception Distance) adapted to facial sequences.

PM3Conversational engagement

Does the user maintain engagement over time? Metrics: session duration, completion rate, subjective naturalness score (Likert 1–5).

PM4Memory accuracy

Does the avatar correctly recall relevant information from previous sessions? Metric: LoCoMo benchmark (Snap Research 2024).

Axis 2 — Avatar Behavior & Expressiveness

Two independent streams, one dual-stream output

Computer Vision & Speech

The system strictly separates source video analysis (Stream A, offline, non-critical) from avatar construction (Stream B, main R&D). The avatar training video is never played in the experience. Axis 2's challenge is making Stream B fast enough to meet Axis 1's latency budget.

Stream A: Offline analysis — standard, non-criticalStream B: Avatar construction — main R&D challenge (Axis 2b)Output: Synchronized dual-stream — internal expertise

Click to expand

AXE 22A

Behavioral Extraction from Archives

Extract individual behavioral patterns from existing videos — without new capture sessions. Identify: micro-expression repertoire, gestural vocabulary, gesture-speech temporal relationships, postural habits.

Key question:

Can we automatically extract an individual's gestural vocabulary from uncontrolled footage?

AXE 22B

Coherent Body Language Generation

Go beyond lip-sync. Generate coordinated body behavior: synchronized with speech content and emotional tone, culturally appropriate, consistent with the defined personality.

Key question:

Most current systems focus on the face only. The body is absent or from a template library.

AXE 22C

Personalized Expressive TTS

Generate speech capturing not only vocal timbre but the prosodic fingerprint: rhythm, emphasis patterns, pause distribution, emotional modulation. The voice must match the avatar's emotional state.

Key question:

How much source audio is needed to capture prosodic individuality? Minutes or hours?

AXE 22D

Cost / Quality / Latency Optimization

Approaches: pre-rendered base + real-time lip-sync, model distillation, intelligent cache, graceful degradation. The goal is an acceptable personalized avatar at <500ms on accessible hardware.

Key question:

What is the minimum compute for acceptable personalized avatar generation at <500ms?

Axis 3 — Orchestration Freedom Degree

Deterministic vs organic: the orchestration trilemma

Each conversation node can define its own freedom degree (0% = scripted, 90%+ = free AI). The R&D challenge: guarantee mandatory content coverage while maintaining conversational naturalness — and without adding latency from the orchestration decision.

ARCH

Orchestration relies on a multi-agent architecture of specialized agents, each responsible for one dimension of the conversation (content coverage, narrative progression, evaluation, memory). The challenge is coordinating these agents without introducing perceptible latency or behavioral divergence.

Click to expand

05b

Emotional Toolbox & Character Design

Cinema-grade character design

A conversational avatar is not just a talking face. Behavioral fidelity requires an explicit emotional design layer: defining, encoding, and activating a repertoire of emotional states consistent with the character's personality, history, and interaction context.

ET-1

Emotional repertoire

Define a set of discrete and continuous emotional states per character. Each state encodes: facial expression, vocal prosody, cadence, posture, micro-behaviors.

ET-2

Transition & coherence

Transitions between emotional states must be smooth, personality-consistent, and not create perceptible breaks in the experience. Challenge: avoiding the 'emotional uncanny valley' effect.

ET-3

Contextual activation

Emotional state is activated by conversation content, interaction history, and user signals (tone, rhythm, content). Research: real-time detection of incoming emotional signals.

Key differentiation dimension

No current commercial platform offers an explicit, creator-configurable emotional design system. Most leave the LLM to implicitly decide emotional state, without guaranteed control or coherence. DigiDouble targets an emotional toolbox inspired by actor direction methods, accessible to non-technical creators.

Research Collaboration — Mutual Contributions

DigiDouble brings

Sovereign ASR pipeline (Audiogami) — operational, Swiss-hosted

Multi-stream expertise — 14 years of synchronized multimedia delivery

Two validated prototypes with real user testing and documented feedback

Domain expertise in interactive narrative design and pedagogical structuring

Swiss GPU infrastructure — Exoscale partnership for sovereign compute

DigiDouble seeks

Fundamental research on memory architectures for long-duration conversational AI

Research on speech synthesis for personalized, expressive, real-time TTS

Evaluation frameworks — scientific metrics for behavioral authenticity and engagement

Publications in relevant venues (Interspeech, SIGDIAL, ACL, CHI, CVPR)

PhD/postdoc capacity to advance these axes over the project duration