know.2nth.ai Media The voice AI landscape
media · voice & speech · landscape map

The voice AI landscape.

AI speech and voice generation is a real arena with seven distinct categories, each with its own leaders, latency norms, and pricing shape. This leaf maps the categories, says who leads where, reports the latency reality honestly, covers the voice-rights regulation that arrived in 2026, and names the South African languages gap. Reference knowledge — verified against current vendor state.

7 categories Reference knowledge Hot · quarterly review

Voice AI is not one thing.

“Voice AI” collapses seven distinct problems into one phrase. They have different leaders, different latency budgets, and different pricing models. Knowing which category you are actually buying is the first decision.

01

Text-to-speech (TTS)

Converts written text into natural speech. Evaluated on naturalness, latency, language coverage, and prosody control. The most mature category — sub-100ms synthesis is commoditising.

ElevenLabs, Cartesia, Google
02

Speech-to-text (STT)

Transcription and translation of spoken audio. Measured on word error rate (WER), latency, accent robustness, and noise tolerance. Streaming vs batch are distinct use cases.

Deepgram, AssemblyAI, Whisper
03

Voice cloning / voice design

Capturing a speaker's voice from short samples and synthesising new speech in it. Instant clones (~30s of audio) through professional clones. Tightly coupled to consent and ethics.

ElevenLabs, Resemble, WellSaid
04

Conversational voice agents

End-to-end real-time systems — speech in, reasoning, speech out. Stack STT + LLM + TTS. The category replacing chatbots. Sub-500ms response is the bar.

ElevenLabs Agents, OpenAI, Hume
05

Dubbing / speech translation

Content in one language re-produced in another while preserving speaker identity, emotion, and (for video) lip sync. More than translation — audio-visual coordination.

ElevenLabs, OpenAI Realtime-Translate
06

Speech-to-speech / voice changing

Transforming one speaker's voice into another's while preserving content and emotional delivery. Used for voice conversion, accessibility, and creative work.

ElevenLabs, Resemble
07

Music generation (adjacent)

Generative music and sound design. Often bundled with voice vendors (ElevenLabs now ships music). Licensed-data models give commercial-use clarity.

ElevenLabs Music, Suno, Udio

The vendors that matter, mid-2026.

Not exhaustive — the vendors that show up in production voice stacks. Model names and latency figures are vendor-stated; treat latency as best-case, not guaranteed.

VendorFlagshipStrengthOpen / closed
ElevenLabsFlash (75ms), Multilingual v2, v3, Scribe v2Deepest platform — TTS, cloning, dubbing, agents, music. The default for creators and developers.Closed; open API
CartesiaSonic-3 (~90ms model latency)The latency king. Purpose-built for real-time voice agents; 40+ languages.Closed
Hume AIEVI (empathic voice interface)Emotion as a first-class feature — reads tone and responds with matching prosody.Closed; open API
OpenAIgpt-realtime, Realtime-Translate, Realtime-WhisperVoice-in-LLM — integrated listening, reasoning, speaking. Live translation across 70+ languages.Closed; API
GoogleGemini TTS, Chirp 3 HDNative audio inside Gemini; production TTS on Vertex AI. 100+ languages.Closed; API
DeepgramNova (STT), Aura-2 (TTS)STT accuracy leader; Aura-2 closes the TTS gap with enterprise pronunciation handling.Closed; some on-prem
AssemblyAIUniversal-3 (streaming STT)Audio analysis — transcription plus summarisation, entity detection, topic classification.Closed; API
Whisper (OpenAI)large-v3The open-weights STT standard. 99 languages, MIT-licensed, self-hostable or $0.006/min API.Open (MIT)
KokoroKokoro-82MLightweight open TTS — runs on consumer hardware, quality near larger models. No voice cloning.Open (Apache-2)

The shape of it. ElevenLabs is the breadth leader — it does every category competently and owns creator and developer mind-share. Cartesia owns latency, Hume owns emotion. The big labs (OpenAI, Google, Anthropic) are pulling voice inside the LLM, which over time favours them for conversational agents. Deepgram and AssemblyAI own the STT side. Whisper and Kokoro prove open-weights models are competitive on accuracy and cost — the commodity floor keeps rising.

Sub-100ms TTS is commoditising. The round trip isn't.

What “fast” means in 2026

Human conversational baseline: ~200ms average response time. That is the target a voice agent is measured against.

TTS alone: the fast tier is now 75–90ms (ElevenLabs Flash, Cartesia Sonic-3). This is close to a solved problem.

End-to-end (STT + LLM reasoning + TTS): this is the real bottleneck. Sub-500ms is the bar for natural conversation; ~800ms is the practical ceiling before dialogue feels laggy. The LLM reasoning step, not the speech synthesis, is what eats the budget.

Architecture matters. Most production voice agents in 2026 are still turn-based (STT then LLM then TTS, in sequence), not true streaming. Streaming architectures shave latency but are harder to build. When a vendor quotes a latency number, ask whether it is the model alone or the full round trip.

Voice cloning sits inside real law now.

By mid-2026 the regulatory frame around synthetic voice has moved from "coming soon" to enforceable. Anyone cloning a voice or shipping AI audio needs to know this.

Three regulatory developments that have landed

EU Digital Creativity Integrity Act — voice rights are recognised as monetizable creative assets; vendors must track and compensate voice actors whose voices are used in cloning.

US AI Transparency and Accountability Act — individuals have legal grounds to seek recurring compensation when AI companies use their voices without consent.

Tennessee ELVIS Act — unauthorised use of a person's voice or likeness is a civil tort plus a Class A misdemeanour; the scope has expanded beyond music to all voice contexts.

Practical rules for anyone shipping voice

— Get explicit, documented consent before cloning any real person's voice.

Label AI-generated audio clearly. Disclosure is becoming a legal expectation, not a courtesy.

— Prefer vendors that enforce consent and actor compensation (ElevenLabs, WellSaid). The compliance posture is becoming a competitive moat.

— Expect rising voice-cloned fraud — mimicked executives, family members. Voice authentication alone is no longer trustworthy as a security factor.

The coverage is uneven. Name it plainly.

South African English is broadly supported — ElevenLabs, Google Cloud TTS, Deepgram, and Whisper all handle it, though no major vendor has tuned a model specifically for the SA-English accent to the quality bar they hit for US or UK English. Accent prompting on ElevenLabs is the usual workaround.

Afrikaans has moderate coverage — several cloud TTS providers offer af-ZA voices, with smaller but growing vendor support.

isiZulu and isiXhosa are severely under-resourced in commercial TTS. Qfrency (South African Voices) is the notable exception that offers both. Whisper's large-v3 theoretically covers them via multilingual training, but with no region-specific optimisation. Open datasets exist for research — OpenSLR #32 covers Afrikaans, Sesotho, Setswana, and isiXhosa.

The honest read. If you are building voice for South African English, default to ElevenLabs with accent prompting. For isiZulu, isiXhosa, and the other indigenous languages, expect a custom solution or an academic partnership — commercial TTS quality for these languages is likely 18+ months from parity, and pretending otherwise sets a project up to fail.

Where this leaf links into the tree.

Primary sources.

Voice AI vendors move fast and over-claim — verify versions and latency figures against the date on this leaf.