DigiDouble Research

Academic Assessment

Status of publications and recent work in key domains (2023–2026). Specific to the DigiDouble project needs.

DOMAIN A — Conversational Memory

MemoryOS (NeurIPS, 2025)

NeurIPS 2025 ↗

Memory operating system for personalized AI agents. 3-layer architecture (STM, LTM, Knowledge). +37.6% accuracy vs baseline.

Relevance: Very high

LOCRET (ICLR, 2025)

ICLR 2025 ↗

Stateful context compression for LLM inference. 128K tokens without degradation. Applicable to long session memory.

Relevance: High

LTM-Benchmark (arXiv, 2024)

arXiv:2410.10813 ↗

Benchmark for long-term memory capabilities of LLM assistants. Opens the path toward more personalized assistants.

Relevance: High

Mem0 (2025)

arXiv:2504.19413 ↗

+26% accuracy, -91% latency, -90% tokens vs baseline. Persistent structured memory for AI agents.

Relevance: Very high

RAG-Driven Memory (IEEE, 2025)

IEEE Access ↗

Review of RAG memory architectures for conversational LLMs. Synthesis of vector DB approaches.

Relevance: High

Conversational Agents: From RAG to LTM

ACM, 2025 ↗

Transition from RAG approaches to long-term memory. Agentic memory management via RL.

Relevance: High

DOMAIN B — Avatar & Voice Synthesis

VASA-1 (Microsoft, 2024)

NeurIPS 2024 ↗

Photorealistic talking faces with nuanced expressions. 40 FPS online, 512×512. Not commercialized — risk of incomplete publication.

Relevance: Very high

A²-LLM (2026)

arXiv:2602.04913 ↗

End-to-end audio-avatar LLM. Emotionally rich facial movements beyond lip-sync. 8B + 0.16B LoRA architecture.

Relevance: Very high

Hi-Reco (HKUST, 2025)

Conference ↗

Complete digital human: 3D avatar + expressive speech + grounded dialogue. Rare integrated approach.

Relevance: High

Survey Talking Head (ACM, 2025)

ACM Computing Surveys ↗

Comprehensive review of talking head synthesis techniques. Real-time / expressiveness / quality trilemma documented.

Relevance: High

EmergentTTS-Eval (NeurIPS, 2025)

NeurIPS 2025 ↗

Benchmark for complex style control in TTS. Evaluates 11Labs, Deepgram, OpenAI 4o-mini-TTS.

Relevance: High

PerTTS (2026)

ResearchGate ↗

Personalized and controllable zero-shot spontaneous TTS. Speech style encoder + local prosody encoder.

Relevance: Very high

AvatarForcing (arXiv 2603.14331, mars 2026)

arXiv:2603.14331 ↗

One-step streaming diffusion for talking avatars. Local-Future Sliding-Window Denoising. Single image + streaming audio → real-time long-form video. Directly applicable to Axis 1 (latency).

Relevance: Very high

Summary — Maturity levels by domain

Domain	Academic maturity	Commercial availability	DigiDouble gap
Conversational memory	High (Mem0, MemoryOS)	Partial (Mem0 API)	Multi-session avatar integration
Talking head / avatar	High (VASA-1, A²-LLM)	Partial (HeyGen, Tavus)	Latency <500ms + full body
Personalized expressive TTS	High (PerTTS, EmergentTTS)	Good (ElevenLabs, Cartesia)	Individual prosodic fingerprint
Conversational orchestration	Medium (SIGDIAL)	Low (Flowise custom)	Configurable freedom spectrum

← Research Gaps The Project →Research Challenges →