Back/State of the Art/Tavus (Phoenix-4 + Raven-1)

commercialSovereignty 1/5

Tavus (Phoenix-4 + Raven-1)

Conversational Video Interface with full emotional intelligence stack — Raven-1 perception + Phoenix-4 rendering + Sparrow-1 turn-taking

Website API Docs

TTFR Latency

~500ms

real-time

Cost / minute

$0.320/min

real-time

Visual Quality

10/10

estimated score

Protocols

WebRTC, REST, Daily SDK

Avatar Customisation

RAG / Knowledge Base

Native RAG via Persona: provide document_ids or document_tags. Avatar queries knowledge base in real-time during conversation.

Behavior & Personality

Persona system_prompt + Objectives (guided conversational goals) + Guardrails (strict behavioral limits). 30+ languages.

Body Language & Gestures

Phoenix-4 (2026): generates full head, hair, eyes, pose and expression from scratch every frame — no pre-recorded video loops. Active listening: nods, tilts, microexpressions react to what the user says in real time.

Facial Expressions

Phoenix-4 emotional intelligence: smooth transitions between emotional states, emergent microexpressions, no brute-force emotion. Raven-1 perception layer feeds emotional context to Phoenix-4 in real time (context freshness < 300ms).

Voice & Voice Cloning

Tone and accent customisation. 30+ languages. Echo mode: drive avatar with external audio stream.

Persona Fine-Tuning

Objectives + Guardrails system allows fine-grained persona control. Text Respond mode for scripted interactions.

Avatar Training

Video required Yes

Duration2 minutes (1 min speech + 1 min neutral listening)

Resolution1080p minimum, 4K recommended

FormatMP4 (H.264/AAC) or WebM, 25fps min

Consent required Yes (mandatory)

Processing time4–5 hours

Best Practices

01.1 min continuous speech (clear articulation, teeth visible)
02.1 min neutral listening (closed mouth, no expression)
03.Waist-up, seated, ~1m from camera
04.Diffuse lighting, static background
05.Verbal consent declaration required

API Analysis

Protocols

RESTWebRTC (Daily)WebSocket

SDKs

JavaScript/ReactPython

Webhooks Yes

Concurrent Sessions

1 (Free) → 15+ (Growth) → unlimited (Enterprise)

Rate Limits

S3 pre-signed URLs required for training media

Key Features

Raven-1 (2026): multimodal perception — audio-visual fusion, tone + expression + gaze + posture → natural language output for LLMs. Context < 300ms stale. Audio perception < 100ms.
Phoenix-4 (2026): fully generated face/hair/eyes/pose every frame — no video loops. Active listening behaviors. Smooth emotional transitions with microexpressions.
Sparrow-1: turn-taking model for natural conversation flow
Raven-1 tool calling: OpenAI-compatible schema, callbacks on user laughter, emotional thresholds, attention shifts
Echo mode: lip-sync on external audio stream
Text Respond: generate response from text input
Cerebras chip integration for ultra-fast LLM inference
Webhooks for training completion and conversation state

API Constraints

Dependency on Daily for WebRTC layer
S3 pre-signed URLs required for media upload
4–5h training time for custom replicas

Pricing Model

Model: Monthly subscription + pay-as-you-go

Plan	Price	Included minutes	Overage
Free	$0/mo	25 min conversation	N/A
Starter	$59/mo	100 min	$0.37/min
Growth	$397/mo	1250 min	$0.32/min
Enterprise	Custom	Custom	Negotiated

Free tier

Cloud only

Enterprise pricing

Hidden costs / watch out

Replica training: $40–$65 per extra training
Video generation billed separately from conversation minutes

Sovereignty & Hosting

Sovereignty Score

1/5

Hosting

AWS US

GDPR

Yes

On-premise

Sovereignty detail

AWS US. SOC2 Type II + HIPAA (Growth+). No EU hosting.

Constraints & Limits

No manual control of specific hand/arm gestures
4–5h training time for custom replicas
High-quality video required (1080p min, 4K recommended)
US hosting only
SOC2/HIPAA only on Growth+ plans
Mandatory verbal consent for personal replicas

GamiWays Relevance

Score

9/10

As of April 2026, Tavus is the most advanced commercial platform for emotional intelligence in conversational video avatars. The Raven-1 + Phoenix-4 + Sparrow-1 stack is the reference architecture for GamiWays's target capabilities. Raven-1's perception layer (audio-visual fusion, < 300ms context freshness) directly addresses GamiWays Axis 3 (Contextual Awareness). Phoenix-4's fully-generated rendering (no video loops, active listening behaviors) sets the quality benchmark. Main limitations for GamiWays: US-only hosting (GDPR sovereignty concern), high cost ($0.32/min), and no open-source equivalent available yet.

← Back to State of the Art Research Challenges →