Back/Sesame CSM

Open SourceApache 2.0 (research)

Sesame CSM

Conversational Speech Model — crosses the uncanny valley of voice

Website Docs

400ms

TTFA (best case) ?

1000ms

TTFA (typical) ?

Free

Price per million chars

—

ELO Score ?

Comparative Scores

Voice quality?9/10

Latency?3/10

Voice cloning?7/10

Expressiveness?10/10

Sovereignty?10/10

Price accessibility10/10

Multilingual1/10

Architecture

ArchitectureMultimodal backbone + audio decoder (RVQ tokens)

Parameters1B+

Languages1

Self-hostable Yes

Streaming No

DigiDouble

Axe 2 R&D — Comportement conversationnel

Highly relevant for Axis 2 R&D (conversational behavior). Context-aware prosody and natural backchannels are exactly what DigiDouble needs for authentic conversation. Not suitable for Phase 1 MVP real-time — evaluate for Axis 2 research.

Analysis

Sesame CSM (Conversational Speech Model) is a multimodal model that generates RVQ audio codes from text and audio inputs. Designed to 'cross the uncanny valley of conversational voice' — context-aware prosody, natural backchannels, turn-taking behavior. Research-grade, not production-optimized for streaming. Released March 2025.

Strengths

Context-aware conversational prosody
Natural backchannels and turn-taking
Crosses uncanny valley of voice
Apache 2.0 — full sovereignty
Unique multimodal architecture

Weaknesses

Not production-optimized for streaming
~400ms+ TTFA
English only
Research-grade stability

Voice Capabilities

Voice Cloning ? Yes

Context-aware voice generation. Generates RVQ audio codes from text + audio inputs. Conversational context maintained.

Emotion Control Yes

Context-aware prosody. Conversational backchannels. Natural turn-taking behavior.

Streaming ? No

Not optimized for real-time streaming. Research model.

Lip-sync Data ? No

No native lip-sync data.

Pricing

Price / 1M chars

Free

Price / minute

Free

Free tier

Free (open weights)

Open weights — self-hosting cost only. Research use.

Sovereignty & Compliance

On-premise Yes

Full self-hosting under Apache 2.0.

GDPR ? Compliant

Data residency: Fully local when self-hosted.

Strategic & Business Analysis

Sesame CSM — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?

Sesame CSM brings 'voice presence' to open-source TTS — emotional intelligence and contextual adaptation backed by $307M, Apache 2.0 licensed, with planned 20+ language expansion.

Open-source / self-hosted

Lock-in risk:Low

Sovereignty fit:High

Open-source threat:Low

Pricing:Stable →

A. Strategic Positioning

Target customer: Developer / Enterprise — conversational AI, emotional voice presence

Open-source conversational speech model (Apache 2.0) by Sesame AI — 'voice presence' with emotional intelligence and contextual adaptation.

B. Competitive Moat

Focus on 'voice presence' — emotional intelligence and contextual adaptation beyond standard TTS
Apache 2.0 license — full commercial use, self-hostable
$307M funding from Sesame AI — significant R&D backing for an open-source model

Vulnerability: Reported slowness in online discussions. Monetization challenge vs proprietary solutions. Competitive pressure from Chatterbox and Dia.

E. Strategic Questions for DigiDouble

Sovereignty fit

Fully self-hostable on Swiss/EU infrastructure. Apache 2.0 license. $307M backing ensures long-term model development.

Build vs. Buy

Build (integrate open-source) for Phase 2 sovereignty. Evaluate performance vs Chatterbox for Phase 1 quality requirements.

Lock-in risk

Apache 2.0 open-source — zero vendor lock-in. Sesame AI's commercial services create soft dependency if used.

Roadmap alignment

Good for both phases. Planned multilingual expansion aligns with DigiDouble international requirements. Performance concerns to validate.

Back to State of the Art View in Benchmarks

Data Freshness

Updated 30 April 2026

Sesame AI research blog, Mar 2025

Update note: Sesame CSM released Mar 2025. Apache 2.0. 1B params. Conversational speech model with natural prosody. Self-hosted.

Reference Sources

Sesame AI Blognews Sesame CSM GitHubdocs HuggingFace CSMdocs