Voice PipelineCustom Scoring

Custom Voice Tool Ranking

Weight criteria to your context and get a dynamic TTS / STT ranking.

Preset Profiles

Criteria Weights

0–10
Voice Quality?5
Latency (TTFA)?5
Voice Cloning?5
Expressiveness?5
Data Sovereignty?5
Cost / Pricing5
Multilingual5

Each criterion is scored 0–10. A weight of 0 excludes the criterion from the calculation. The final score is the weighted average.

TTS Ranking — 16 tools

Sorted by weighted score
🥇
Cloud API

Inworld TTS-1.5 + Realtime API

#1 quality benchmark — ELO 1160, sub-120ms Mini, Realtime S2S + STT + LLM Router

Sov. highLock-in medium
8.0
/10
Quality×5
10
Latency×5
8
Cloning×5
9
Sovereignty×5
6
Pricing×5
8
130ms TTFAELO 1160Commercial (training framework open-sourced)
🥈
Open Source

Orpheus 3B

LLM-based TTS — ultra-natural speech with emotion tags and non-verbals

Sov. highLock-in low
7.9
/10
Quality×5
8
Latency×5
6
Cloning×5
7
Sovereignty×5
10
Pricing×5
10
200ms TTFAApache 2.0
🥉
Open Source

Voxtral TTS (Mistral)

Open-weights TTS from Mistral — fast, adaptable, 9 languages (Mar 2026)

Sov. highLock-in low
7.6
/10
Quality×5
8
Latency×5
8
Cloning×5
7
Sovereignty×5
9
Pricing×5
9
150ms TTFAOpen weights (Mistral license)
4
Cloud API

ElevenLabs v3

Industry reference — 380+ voices, 70+ languages, emotional range

Sov. mediumLock-in medium
7.3
/10
Quality×5
9
Latency×5
8
Cloning×5
10
Sovereignty×5
2
Pricing×5
2
75ms TTFAELO 1108Commercial
5
Open Source

Chatterbox (Resemble AI)

MIT license — beats ElevenLabs in blind tests (63.75% preference)

Sov. highLock-in low
7.1
/10
Quality×5
7
Latency×5
7
Cloning×5
8
Sovereignty×5
10
Pricing×5
9
150ms TTFAELO 1050MIT
6
Open Source

Sesame CSM

Conversational Speech Model — crosses the uncanny valley of voice

Sov. highLock-in low
7.1
/10
Quality×5
9
Latency×5
3
Cloning×5
7
Sovereignty×5
10
Pricing×5
10
400ms TTFAApache 2.0 (research)
7
Open Source

Dia (Nari Labs)

Ultra-realistic dialogue generation — multi-speaker, emotion, non-verbals

Sov. highLock-in low
7.0
/10
Quality×5
8
Latency×5
4
Cloning×5
7
Sovereignty×5
10
Pricing×5
10
300ms TTFAApache 2.0
8
Open Source

Kyutai TTS 1.6B

Delayed streams modeling — streaming-native, timestamps, batching

Sov. highLock-in low
7.0
/10
Quality×5
7
Latency×5
8
Cloning×5
6
Sovereignty×5
10
Pricing×5
10
100ms TTFACC-BY 4.0
9
Cloud API

Cartesia Sonic 3

Fastest TTFA on the market — 40ms, State Space Model architecture

Sov. lowLock-in medium
6.6
/10
Quality×5
7
Latency×5
10
Cloning×5
8
Sovereignty×5
2
Pricing×5
5
40ms TTFAELO 1054Commercial
10
Cloud API

Hume AI Octave 2

LLM-based emotional TTS — natural language emotion control

Sov. lowLock-in high
6.4
/10
Quality×5
7
Latency×5
8
Cloning×5
6
Sovereignty×5
2
Pricing×5
8
100ms TTFAELO 1046Commercial
11
Cloud API

Fish Audio OpenAudio S1

Pay-as-you-go voice cloning — 70% cheaper than ElevenLabs

Sov. highLock-in low
6.3
/10
Quality×5
7
Latency×5
6
Cloning×5
8
Sovereignty×5
4
Pricing×5
7
200ms TTFAELO 1074Commercial
12
Open Source

Kokoro 82M v1.0

Highest-ranked open-weight TTS — ELO 1059, 82M params, Apache 2.0

Sov. highLock-in low
6.3
/10
Quality×5
7
Latency×5
9
Cloning×5
1
Sovereignty×5
10
Pricing×5
10
60ms TTFAELO 1059Apache 2.0
13
Open Source

Ultravox v0.5

Speech-to-speech model — ~100ms latency, no ASR/TTS pipeline needed

Sov. highLock-in low
6.1
/10
Quality×5
7
Latency×5
10
Cloning×5
1
Sovereignty×5
7
Pricing×5
7
100ms TTFACommercial API (CC-BY-NC-4.0 weights)
14
Open Source

Moshi (Kyutai)

Full-duplex spoken dialogue — simultaneous listening and speaking

Sov. highLock-in low
6.1
/10
Quality×5
7
Latency×5
8
Cloning×5
1
Sovereignty×5
10
Pricing×5
10
200ms TTFACC-BY 4.0
15
Cloud API

OpenAI Realtime API

GPT-4o speech-to-speech — integrated LLM + voice, WebSocket

Sov. lowLock-in high
5.1
/10
Quality×5
8
Latency×5
6
Cloning×5
1
Sovereignty×5
1
Pricing×5
4
300ms TTFAELO 1106Commercial
16
Cloud API

Deepgram Aura 2

Ultra-low latency TTS optimized for voice agents — <100ms

Sov. mediumLock-in medium
4.4
/10
Quality×5
6
Latency×5
9
Cloning×5
1
Sovereignty×5
3
Pricing×5
7
80ms TTFACommercial

Methodology: Raw scores (1–10) are sourced from public benchmarks (Artificial Analysis ELO, Koenecke WER, measured TTFA). Weighting is applied via weighted average. Sovereignty and lock-in badges come from the DigiDouble strategic analysis.