Moshi (Kyutai)
Full-duplex spoken dialogue — simultaneous listening and speaking
Comparative Scores
Architecture
Key reference for Axis 1 R&D (full-duplex conversation). Full-duplex capability is the long-term goal for DigiDouble — enables natural interruption handling. CC-BY 4.0 enables sovereign deployment. Not suitable for Phase 1 MVP — evaluate for Axis 1 advanced research.
Analysis
Moshi is a full-duplex speech-text foundation model from Kyutai (French AI lab). Enables simultaneous listening and speaking — unlike turn-based systems. Uses Mimi streaming neural audio codec. CC-BY 4.0 license enables sovereign deployment. Reference for full-duplex conversation research.
Strengths
- Full-duplex: simultaneous listen + speak
- CC-BY 4.0 — commercial sovereign deployment
- Mimi streaming codec
- From Kyutai (European AI lab)
- Research reference for full-duplex
Weaknesses
- English only
- Requires A100 for real-time
- No voice cloning
- Research-grade stability
Voice Capabilities
No voice cloning. Fixed voice output.
Natural prosody from end-to-end training. No explicit emotion control.
Full-duplex: simultaneous listening and speaking. Streaming neural audio codec (Mimi). Real-time capable on A100.
No native lip-sync data.
Pricing
Open weights — self-hosting cost only.
Sovereignty & Compliance
Full self-hosting under CC-BY 4.0. Commercial use allowed.
Data residency: Fully local when self-hosted.
Moshi (Kyutai) — Strategic Positioning
Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for DigiDouble?
Moshi is Kyutai's open-source breakthrough for real-time full-duplex voice AI — 160ms latency, EU-backed research, CC-BY-NC license. The future of natural voice interaction, available today for research.
A. Strategic Positioning
Target customer: Researcher / Developer — real-time duplex voice, French lab
Kyutai's full-duplex real-time voice AI — handles interruptions, overlapping speech, and natural conversation flow at 160ms latency.
B. Competitive Moat
- Full-duplex speech — handles interruptions and overlapping speech without turn-taking
- 160ms end-to-end latency — competitive with commercial real-time voice solutions
- €300M Kyutai research backing — long-term open-source commitment
Vulnerability: CC-BY-NC 4.0 license restricts commercial use. Research model — production readiness and enterprise support uncertain.
E. Strategic Questions for DigiDouble
Sovereignty fit
French lab, EU-aligned, self-hostable. CC-BY-NC restricts commercial use but research/prototype use is free and sovereign.
Build vs. Buy
Use for research/prototype (Phase 1). For Phase 2 commercial, negotiate license or use Apache 2.0 alternatives (Ultravox, Chatterbox).
Lock-in risk
Open-source CC-BY-NC — zero vendor lock-in for non-commercial. Commercial deployment requires license negotiation.
Roadmap alignment
Excellent for research and Phase 1. Phase 2 commercial deployment requires CC-BY-NC license resolution.
Data Freshness
Kyutai GitHub + research blog, 2024–2025
Update note: Moshi released Sep 2024 by Kyutai. CC-BY 4.0. Full-duplex S2S with inner monologue. 7B params. Self-hosted on GPU.