Axis 2 — Avatar Behavior & Expressiveness
Going beyond lip-sync: behavioral extraction, coherent body language, expressive TTS, and latency optimization.
The system strictly separates video source analysis (Stream A, offline, non-critical) from avatar construction (Stream B, main R&D). Avatar training video is never played in the experience. The Axis 2 challenge is making Stream B fast enough to meet the Axis 1 latency budget.
Behavioral Extraction from Archives
Extract individual behavioral patterns from existing videos — without new capture sessions. Identify: micro-expression repertoire, gestural vocabulary, gesture-speech temporal relationships, postural habits.
Can we automatically extract an individual's gestural vocabulary from uncontrolled footage?
Coherent Body Language Generation
Go beyond lip-sync. Generate coordinated body behavior: synchronized with speech content and emotional tone, culturally appropriate, consistent with the defined personality.
Most current systems focus on the face only. The body is absent or from a template library.
Personalized Expressive TTS
Generate speech capturing not only vocal timbre but the prosodic fingerprint: rhythm, emphasis patterns, pause distribution, emotional modulation. The voice must match the avatar's emotional state.
How much source audio is needed to capture prosodic individuality? Minutes or hours?
Cost / Quality / Latency Optimization
Approaches: pre-rendered base + real-time lip-sync, model distillation, intelligent cache, graceful degradation. The goal is an acceptable personalized avatar at <500ms on accessible hardware.
What is the minimum compute for acceptable personalized avatar generation at <500ms?