Back/Ultravox v0.5

Open SourceCommercial API (CC-BY-NC-4.0 weights)

Ultravox v0.5

Speech-to-speech model — ~100ms latency, no ASR/TTS pipeline needed

Website Docs

100ms

TTFA (best case) ?

300ms

TTFA (typical) ?

Free

Price per million chars

—

ELO Score ?

Comparative Scores

Voice quality?7/10

Latency?10/10

Voice cloning?1/10

Expressiveness?6/10

Sovereignty?7/10

Price accessibility7/10

Multilingual5/10

Architecture

ArchitectureEnd-to-end speech-to-speech (no cascading ASR+LLM+TTS)

Parameters8B (Llama-based)

Languages10

Self-hostable Yes

Streaming Yes

GamiWays

Phase 1 MVP — Architecture V2V end-to-end

Critical reference for Phase 1 MVP architecture decision: cascading (ASR+LLM+TTS) vs end-to-end (Ultravox). End-to-end eliminates latency accumulation but loses controllability. GamiWays Phase 1 should benchmark both approaches. CC-BY-NC-4.0 limits commercial self-hosting.

Analysis

Ultravox v0.5 is an end-to-end speech-to-speech model that eliminates the cascading ASR+LLM+TTS pipeline. ~100ms response latency vs 800ms–2s for cascading systems. Turn reliability: 300/300 vs 296/300 for GPT-4o Realtime. Median response latency: 0.864s vs 1.536s for GPT-4o. Best for latency-critical voice agents.

Strengths

~100ms response latency
Eliminates cascading pipeline overhead
300/300 turn reliability
0.864s median latency vs 1.536s GPT-4o
End-to-end speech understanding

Weaknesses

No voice cloning
CC-BY-NC-4.0 — non-commercial self-hosting only
Less controllable than cascading systems
No lip-sync data

Voice Capabilities

Voice Cloning ? No

No voice cloning. Fixed voice output.

Emotion Control No

Natural prosody from end-to-end training. Limited explicit emotion control.

Streaming ? Yes

~100ms response latency. Full-duplex capable. Eliminates ASR+LLM+TTS pipeline latency accumulation.

Lip-sync Data ? No

No native lip-sync data.

Pricing

Price / 1M chars

Free

Price / minute

$0.0500

Free tier

Limited free tier

~$0.05/min API. Self-hosted: GPU cost only.

Sovereignty & Compliance

On-premise Yes

Weights available under CC-BY-NC-4.0. Non-commercial self-hosting.

GDPR ? Compliant

Data residency: Self-hosted: fully local.

Strategic & Business Analysis

Ultravox v0.5 — Strategic Positioning

Beyond technical specs: where does this tool sit in the ecosystem, what are the risks and strategic implications for GamiWays?

Ultravox eliminates the STT→LLM→TTS pipeline latency with a single multimodal LLM — open-source, self-hostable, created by the inventor of WebRTC. The architecture is the moat.

Open-source / self-hosted

Lock-in risk:Low

Sovereignty fit:High

Open-source threat:Low

Pricing:Commoditizing ↓↓

A. Strategic Positioning

Target customer: Developer / Enterprise — real-time voice agents, multimodal LLM

Open-source multimodal LLM (speech+text, no separate ASR) by Fixie AI — real-time voice agents without the STT→LLM→TTS pipeline latency.

B. Competitive Moat

Multimodal LLM architecture — processes speech directly without separate ASR step
WebRTC creator (Justin Uberti) involvement — real-time communication expertise
Open-source + managed service hybrid — flexibility for enterprise deployment

Vulnerability: Open-source nature could lead to rapid commoditization. Small team (Fixie AI) — long-term maintenance risk.

E. Strategic Questions for GamiWays

Sovereignty fit

Fully self-hostable on Swiss/EU infrastructure. Open-source license. Multimodal architecture reduces data processing surface.

Build vs. Buy

Build (integrate open-source) for Phase 2 sovereignty. Use managed service for Phase 1 speed. Multimodal architecture reduces pipeline complexity.

Lock-in risk

Open-source model — zero vendor lock-in. Managed service creates soft dependency but self-hosted alternative always available.

Roadmap alignment

Good for voice agent use cases. Multimodal LLM approach aligns with GamiWays's need for natural real-time interaction.

Back to State of the Art View in Benchmarks

Data Freshness

Updated 30 April 2026

Ultravox AIEWF eval, Feb 2026

Update note: Ultravox v0.5 released Feb 2026. MIT license. End-to-end S2S, no separate TTS. Llama 3.1 backbone. Self-hosted or Fixie.ai API. AIEWF eval: 0.864s median latency.

Reference Sources

Ultravox GitHubdocs Ultravox Websitedocs AIEWF Evaluation Feb 2026benchmark