know.2nth.ai Media
media · Top-level domain hub

Generative media. Voice first.

The Media domain covers generative media — the AI systems that synthesise speech, images, video, and music. It launches voice-first, because AI speech is the most production-ready of the four and the one most teams are wiring into products in 2026. Image, video, and music generation follow as later bands. Every leaf here is verified against current vendor state, with the South African languages context called out where it matters.

1
Band live
2
Live leaves
7
Voice categories
3
Bands to come
01 · What this domain is

Four kinds of generative media. One opening band.

Generative media is the part of AI that produces things people watch, hear, and look at — as distinct from the Agents domain (systems that do work) and the Data domain (systems that turn events into decisions). It splits cleanly into four bands: voice & speech, image generation, video generation, and music generation.

This domain opens with voice because, of the four, AI speech is the most mature in production as of 2026 — sub-100ms text-to-speech is commoditising, conversational voice agents are replacing chatbots, and the regulatory frame (consent, voice rights, disclosure) has started to settle. The other three bands will land as the content firms up; the structure below leaves room for them.

02 · The bands

Voice & speech — the live band.

Start with the landscape leaf to orient across the seven categories of voice AI, then drill into a specific vendor. ElevenLabs is the first per-vendor deep dive; others follow.

Voice & speech · 2 Live
The voice AI landscapeLive
media/landscape

The seven categories of voice AI — TTS, STT, voice cloning, conversational voice agents, dubbing, speech-to-speech, music. Who leads where, latency reality, the regulation that arrived in 2026, and the South African languages gap.

ElevenLabsLive
media/elevenlabs

The deepest voice AI platform — ElevenCreative (TTS, cloning, music), ElevenAgents (conversational voice), ElevenAPI. Flagship models Flash, Multilingual v2, v3, Scribe v2. The default pick for creators and developers.

CartesiaSoon
media/cartesia

The latency king — Sonic-3 at ~90ms model latency, purpose-built for real-time voice agents.

Hume AISoon
media/hume

The empathic voice interface — emotion detection and emotionally-responsive generation as a first-class feature.

Big-lab voiceSoon
media/big-lab-voice

OpenAI gpt-realtime, Google Gemini TTS / Chirp 3, Anthropic Claude voice — voice as a first-class LLM feature.

STT · transcriptionSoon
media/speech-to-text

Deepgram Nova, AssemblyAI, Speechmatics, OpenAI Whisper — the speech-to-text side of the arena.

Voice agent platformsSoon
media/voice-agents

Vapi, Retell, Bland, LiveKit Agents, Pipecat — the platforms that orchestrate STT + LLM + TTS into real-time conversation.

03 · Bands to come

Three more generative-media bands, scoped not built.

The domain is structured for these from day one. They land as the content firms up — voice was first because it is furthest along in production.

Band 2

Image generation

Diffusion and autoregressive image models — the text-to-image and image-editing arena.

Band 3

Video generation

Text-to-video and video-editing models — the fastest-moving and least production-settled of the four.

Band 4

Music generation

Generative music and sound design — licensed-data models, commercial-use clarity, the creator tooling.

04 · Principles

What every Media leaf inherits.

Verified, not vendor copy

Voice AI vendors over-claim — latency numbers, language counts, "most realistic ever." Every model name, price, and capability on a Media leaf is checked against current primary sources, and anything that can't be verified is marked as such.

Consent and disclosure are load-bearing

By 2026 voice cloning sits inside real law — the EU Digital Creativity Integrity Act, the US AI Transparency Act, Tennessee's ELVIS Act. Every leaf treats consent, voice rights, and AI-audio disclosure as part of the engineering decision, not an afterthought.

The South African languages gap is named

SA English is broadly supported; isiZulu and isiXhosa are severely under-resourced in commercial TTS. Media leaves call this out plainly rather than pretending the coverage is even.

Latency is the 2026 UX battleground

Sub-100ms text-to-speech is commoditising; the practical bottleneck is the STT + reasoning + TTS round trip. Leaves report latency honestly, with the human-conversation baseline (~200ms) as the reference.

05 · Connections

Where Media touches the rest of the tree.