The Media domain covers generative media — the AI systems that synthesise speech, images, video, and music. It launches voice-first, because AI speech is the most production-ready of the four and the one most teams are wiring into products in 2026. Image, video, and music generation follow as later bands. Every leaf here is verified against current vendor state, with the South African languages context called out where it matters.
Generative media is the part of AI that produces things people watch, hear, and look at — as distinct from the Agents domain (systems that do work) and the Data domain (systems that turn events into decisions). It splits cleanly into four bands: voice & speech, image generation, video generation, and music generation.
This domain opens with voice because, of the four, AI speech is the most mature in production as of 2026 — sub-100ms text-to-speech is commoditising, conversational voice agents are replacing chatbots, and the regulatory frame (consent, voice rights, disclosure) has started to settle. The other three bands will land as the content firms up; the structure below leaves room for them.
Start with the landscape leaf to orient across the seven categories of voice AI, then drill into a specific vendor. ElevenLabs is the first per-vendor deep dive; others follow.
The seven categories of voice AI — TTS, STT, voice cloning, conversational voice agents, dubbing, speech-to-speech, music. Who leads where, latency reality, the regulation that arrived in 2026, and the South African languages gap.
The deepest voice AI platform — ElevenCreative (TTS, cloning, music), ElevenAgents (conversational voice), ElevenAPI. Flagship models Flash, Multilingual v2, v3, Scribe v2. The default pick for creators and developers.
The latency king — Sonic-3 at ~90ms model latency, purpose-built for real-time voice agents.
The empathic voice interface — emotion detection and emotionally-responsive generation as a first-class feature.
OpenAI gpt-realtime, Google Gemini TTS / Chirp 3, Anthropic Claude voice — voice as a first-class LLM feature.
Deepgram Nova, AssemblyAI, Speechmatics, OpenAI Whisper — the speech-to-text side of the arena.
Vapi, Retell, Bland, LiveKit Agents, Pipecat — the platforms that orchestrate STT + LLM + TTS into real-time conversation.
The domain is structured for these from day one. They land as the content firms up — voice was first because it is furthest along in production.
Diffusion and autoregressive image models — the text-to-image and image-editing arena.
Text-to-video and video-editing models — the fastest-moving and least production-settled of the four.
Generative music and sound design — licensed-data models, commercial-use clarity, the creator tooling.
Voice AI vendors over-claim — latency numbers, language counts, "most realistic ever." Every model name, price, and capability on a Media leaf is checked against current primary sources, and anything that can't be verified is marked as such.
By 2026 voice cloning sits inside real law — the EU Digital Creativity Integrity Act, the US AI Transparency Act, Tennessee's ELVIS Act. Every leaf treats consent, voice rights, and AI-audio disclosure as part of the engineering decision, not an afterthought.
SA English is broadly supported; isiZulu and isiXhosa are severely under-resourced in commercial TTS. Media leaves call this out plainly rather than pretending the coverage is even.
Sub-100ms text-to-speech is commoditising; the practical bottleneck is the STT + reasoning + TTS round trip. Leaves report latency honestly, with the human-conversation baseline (~200ms) as the reference.