Back/Moshi (Kyutai)
Open SourceCC-BY 4.0

Moshi (Kyutai)

Full-duplex spoken dialogue — simultaneous listening and speaking

200ms
TTFA (best case) ?
500ms
TTFA (typical) ?
Free
Price per million chars
ELO Score ?

Comparative Scores

Voice quality?7/10
Latency?8/10
Voice cloning?1/10
Expressiveness?6/10
Sovereignty?10/10
Price accessibility10/10
Multilingual1/10

Architecture

ArchitectureSpeech-text foundation model + Mimi codec (full-duplex)
Parameters7B
Languages1
Self-hostable Yes
Streaming Yes
DigiDouble
Axe 1 R&D — Full-duplex conversation

Key reference for Axis 1 R&D (full-duplex conversation). Full-duplex capability is the long-term goal for DigiDouble — enables natural interruption handling. CC-BY 4.0 enables sovereign deployment. Not suitable for Phase 1 MVP — evaluate for Axis 1 advanced research.

Analysis

Moshi is a full-duplex speech-text foundation model from Kyutai (French AI lab). Enables simultaneous listening and speaking — unlike turn-based systems. Uses Mimi streaming neural audio codec. CC-BY 4.0 license enables sovereign deployment. Reference for full-duplex conversation research.

Strengths

  • Full-duplex: simultaneous listen + speak
  • CC-BY 4.0 — commercial sovereign deployment
  • Mimi streaming codec
  • From Kyutai (European AI lab)
  • Research reference for full-duplex

Weaknesses

  • English only
  • Requires A100 for real-time
  • No voice cloning
  • Research-grade stability

Voice Capabilities

Voice Cloning ? No

No voice cloning. Fixed voice output.

Emotion Control No

Natural prosody from end-to-end training. No explicit emotion control.

Streaming ? Yes

Full-duplex: simultaneous listening and speaking. Streaming neural audio codec (Mimi). Real-time capable on A100.

Lip-sync Data ? No

No native lip-sync data.

Pricing

Price / 1M chars
Free
Price / minute
Free
Free tier
Free (open weights)

Open weights — self-hosting cost only.

Sovereignty & Compliance

On-premise Yes

Full self-hosting under CC-BY 4.0. Commercial use allowed.

GDPR ? Compliant

Data residency: Fully local when self-hosted.

Strategic & Business Analysis

Moshi (Kyutai) — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?

Moshi is Kyutai's open-source breakthrough for real-time full-duplex voice AI — 160ms latency, EU-backed research, CC-BY-NC license. The future of natural voice interaction, available today for research.

Open-source / self-hosted
Lock-in risk:Low
Sovereignty fit:High
Open-source threat:Low
Pricing:Stable →

A. Strategic Positioning

Target customer: Researcher / Developer — real-time duplex voice, French lab

Kyutai's full-duplex real-time voice AI — handles interruptions, overlapping speech, and natural conversation flow at 160ms latency.

B. Competitive Moat

  • Full-duplex speech — handles interruptions and overlapping speech without turn-taking
  • 160ms end-to-end latency — competitive with commercial real-time voice solutions
  • €300M Kyutai research backing — long-term open-source commitment

Vulnerability: CC-BY-NC 4.0 license restricts commercial use. Research model — production readiness and enterprise support uncertain.

E. Strategic Questions for DigiDouble

Sovereignty fit

French lab, EU-aligned, self-hostable. CC-BY-NC restricts commercial use but research/prototype use is free and sovereign.

Build vs. Buy

Use for research/prototype (Phase 1). For Phase 2 commercial, negotiate license or use Apache 2.0 alternatives (Ultravox, Chatterbox).

Lock-in risk

Open-source CC-BY-NC — zero vendor lock-in for non-commercial. Commercial deployment requires license negotiation.

Roadmap alignment

Excellent for research and Phase 1. Phase 2 commercial deployment requires CC-BY-NC license resolution.

Data Freshness

Updated 30 April 2026

Kyutai GitHub + research blog, 2024–2025

Update note: Moshi released Sep 2024 by Kyutai. CC-BY 4.0. Full-duplex S2S with inner monologue. 7B params. Self-hosted on GPU.