Back/Ultravox v0.5
Open SourceCommercial API (CC-BY-NC-4.0 weights)

Ultravox v0.5

Speech-to-speech model — ~100ms latency, no ASR/TTS pipeline needed

100ms
TTFA (best case) ?
300ms
TTFA (typical) ?
Free
Price per million chars
ELO Score ?

Comparative Scores

Voice quality?7/10
Latency?10/10
Voice cloning?1/10
Expressiveness?6/10
Sovereignty?7/10
Price accessibility7/10
Multilingual5/10

Architecture

ArchitectureEnd-to-end speech-to-speech (no cascading ASR+LLM+TTS)
Parameters8B (Llama-based)
Languages10
Self-hostable Yes
Streaming Yes
GamiWays
Phase 1 MVP — Architecture V2V end-to-end

Critical reference for Phase 1 MVP architecture decision: cascading (ASR+LLM+TTS) vs end-to-end (Ultravox). End-to-end eliminates latency accumulation but loses controllability. GamiWays Phase 1 should benchmark both approaches. CC-BY-NC-4.0 limits commercial self-hosting.

Analysis

Ultravox v0.5 is an end-to-end speech-to-speech model that eliminates the cascading ASR+LLM+TTS pipeline. ~100ms response latency vs 800ms–2s for cascading systems. Turn reliability: 300/300 vs 296/300 for GPT-4o Realtime. Median response latency: 0.864s vs 1.536s for GPT-4o. Best for latency-critical voice agents.

Strengths

  • ~100ms response latency
  • Eliminates cascading pipeline overhead
  • 300/300 turn reliability
  • 0.864s median latency vs 1.536s GPT-4o
  • End-to-end speech understanding

Weaknesses

  • No voice cloning
  • CC-BY-NC-4.0 — non-commercial self-hosting only
  • Less controllable than cascading systems
  • No lip-sync data

Voice Capabilities

Voice Cloning ? No

No voice cloning. Fixed voice output.

Emotion Control No

Natural prosody from end-to-end training. Limited explicit emotion control.

Streaming ? Yes

~100ms response latency. Full-duplex capable. Eliminates ASR+LLM+TTS pipeline latency accumulation.

Lip-sync Data ? No

No native lip-sync data.

Pricing

Price / 1M chars
Free
Price / minute
$0.0500
Free tier
Limited free tier

~$0.05/min API. Self-hosted: GPU cost only.

Sovereignty & Compliance

On-premise Yes

Weights available under CC-BY-NC-4.0. Non-commercial self-hosting.

GDPR ? Compliant

Data residency: Self-hosted: fully local.

Strategic & Business Analysis

Ultravox v0.5 — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for GamiWays?

Ultravox eliminates the STT→LLM→TTS pipeline latency with a single multimodal LLM — open-source, self-hostable, created by the inventor of WebRTC. The architecture is the moat.

Open-source / self-hosted
Lock-in risk:Low
Sovereignty fit:High
Open-source threat:Low
Pricing:Commoditizing ↓↓

A. Strategic Positioning

Target customer: Developer / Enterprise — real-time voice agents, multimodal LLM

Open-source multimodal LLM (speech+text, no separate ASR) by Fixie AI — real-time voice agents without the STT→LLM→TTS pipeline latency.

B. Competitive Moat

  • Multimodal LLM architecture — processes speech directly without separate ASR step
  • WebRTC creator (Justin Uberti) involvement — real-time communication expertise
  • Open-source + managed service hybrid — flexibility for enterprise deployment

Vulnerability: Open-source nature could lead to rapid commoditization. Small team (Fixie AI) — long-term maintenance risk.

E. Strategic Questions for GamiWays

Sovereignty fit

Fully self-hostable on Swiss/EU infrastructure. Open-source license. Multimodal architecture reduces data processing surface.

Build vs. Buy

Build (integrate open-source) for Phase 2 sovereignty. Use managed service for Phase 1 speed. Multimodal architecture reduces pipeline complexity.

Lock-in risk

Open-source model — zero vendor lock-in. Managed service creates soft dependency but self-hosted alternative always available.

Roadmap alignment

Good for voice agent use cases. Multimodal LLM approach aligns with GamiWays's need for natural real-time interaction.

Data Freshness

Updated 30 April 2026

Ultravox AIEWF eval, Feb 2026

Update note: Ultravox v0.5 released Feb 2026. MIT license. End-to-end S2S, no separate TTS. Llama 3.1 backbone. Self-hosted or Fixie.ai API. AIEWF eval: 0.864s median latency.