Gemini Omni is the generation half of the Gemini family. Where the reasoning tiers read media, Omni creates it — conversational, multi-turn video and image generation and editing where every edit builds on the last and the scene stays coherent. Hand it any reference — image, text, video, or audio — and it returns one cohesive output, with physics-aware motion (gravity, kinetic energy, fluid dynamics) and Gemini's world knowledge baked in. It ships as Gemini Omni Flash and is surfaced today through the Gemini app, Google Flow, and YouTube Shorts. Think of it as Nano Banana — but for video.
Every other tier in the Gemini line is an understanding model: you feed it text, images, video, or audio and it reasons about them. Gemini Omni inverts that. It's Google DeepMind's native multimodal generation model — the same family lineage, the same world knowledge, but pointed at producing video and images rather than analysing them. Google's one-line pitch is literally "create anything from any input — starting with video."
The defining property isn't raw generation — plenty of models can turn a prompt into a clip. It's conversational, multi-turn editing that stays coherent. "Every edit you make builds on the one before, maintaining a consistent, coherent scene." You don't re-roll the whole video each time you want a change; you direct it the way you'd direct an editor — swap this character, move the camera over the shoulder, sync the lights to the music — and the model preserves what already worked while applying the new instruction.
The other half is input fusion: "turn any reference — image, text, video, or audio — into a single, cohesive output." A reference photo for a character, a sketch for the motion path, an audio track for the beat, a sentence for the style — Omni resolves all of them into one result rather than treating them as separate, bolted-together steps. Underneath, it "combines an intuitive understanding of physics with Gemini's knowledge of history, science, and cultural context — bridging the gap from photorealism to meaningful storytelling."
Google frames Omni as "Nano Banana, but for video." Nano Banana is the Gemini-family image generation-and-editing model that made conversational, reference-driven image edits feel native. Omni extends that exact interaction model — iterative, prompt-driven, consistency-preserving — into the temporal dimension, where keeping a subject, a style, and a physics model stable across frames and across edits is the genuinely hard part. If you've used Nano Banana for stills, Omni is the same muscle memory for motion.
Omni isn't one trick. The demonstrated capability surface spans transformation, simulation, replacement, and synchronisation — all reachable through the same conversational interface, all preserving the rest of the scene while one thing changes.
Change the visual style — line art, stuffed puppet, hologram, voxel, claymation — while the underlying motion and scene structure stay intact.
Gravity, kinetic energy, and fluid dynamics are modelled, not faked — so ripples, falls, and flows behave the way the eye expects them to.
Swap any element by natural language or reference image — "change the butterfly to a bee" — without disturbing the rest of the frame.
Re-frame the shot ("over the violinist's shoulder") or transfer a movement pattern from one video onto another subject.
Coordinate on-screen events to a beat — "lights turn on in sync with the music" — and align overlaid text with the action.
Hand it a sketch as a movement guide and it renders realistic footage that follows your drawn path. Storyboard beats in, structured sequence out.
Single-shot text-to-video is a solved-enough problem. The thing that makes Omni a tool rather than a slot machine is that the seventh edit doesn't throw away the first six. A creator can land a scene, then refine it the way they'd refine a draft — lighting, then a character swap, then a camera move, then a style pass — and each step inherits the last. That turns generation from "regenerate and pray" into something closer to a non-destructive editing timeline.
A walk through the kinds of edit Google demonstrates, each one a single conversational instruction. The interface is the prompt: you describe the change, Omni applies it to the existing scene and hands the result back to edit again.
// Illustrative — prompt text and edit types are from Google's Gemini Omni gallery and prompt guide. The tiles below stand in for the rendered clips; watch the live examples on the source page linked in Resources.
Swap one subject for another by name. The flight path, lighting, and background carry through untouched — only the creature changes.
Re-stage the shot without re-rendering the performer. Omni recomputes the viewpoint while holding the subject and motion stable.
Tie an on-screen event to a beat. Action synchronisation lets the visual respond to the audio track frame-for-frame.
Metamorphose a subject — line art, stuffed puppet, hologram, voxel — while the original motion and timing are preserved.
A rough sketch becomes the motion guide for a photoreal clip — your hand-drawn path drives the camera or subject.
Fluid dynamics and material transformation in one instruction — the kind of edit that only holds together when physics is modelled, not painted on.
The point of Omni is the chain, not any single prompt. Here's the violinist scene refined the way you'd actually work — each turn inherits the last.
Google's prompt guide breaks an effective Omni prompt into five components. The headline rule: the more detail you add, the more control you'll have over the final output — but you don't have to be exhaustive, because the model's reasoning and world knowledge fill the gaps.
| Element | What to specify |
|---|---|
| Shot framing & motion | How the shot is framed — wide-angle, medium, or close-up — and how the camera moves. |
| Style | The aesthetic register: "realistic, or cinematic? Grounded, or majestic?" |
| Lighting | Source and quality: "where does the light come from — the sun, a streetlamp?" |
| Location | The landscape you imagine — e.g. "an alien landscape with clear, azure water." |
| Action | Who the characters and objects are, and how they're moving. |
The guide is explicit: edit through conversation rather than starting over. The model "will preserve your video across multiple amends — keeping what works." And lean on its world knowledge for complex actions — "when you refer to a complex action, Gemini Omni understands your intention, and how this action should be applied across your video." You name the action; Omni works out how it should play out across the frames.
Object / subject swap ("change the butterfly to a bee") · camera direction ("over the violinist's shoulder") · style transfer (new aesthetic, original motion preserved) · action synchronisation ("lights turn on in sync with the music") · storyboard-based generation (share visual narrative beats for a structured sequence). Naming the category you want tends to land the edit more cleanly than describing it from scratch.
The version Google actually ships is Gemini Omni Flash — positioned for "exceptional results in Video Editing, Text to Video, Image to Video, and Reference to Video." Google's reported human-rater and benchmark results across those four tasks:
| Task | Benchmark / method | Reported result |
|---|---|---|
| Video editing | Head-to-head, human raters | Leads on Overall Preference & Instruction Following |
| Text-to-video | MovieGenBench (Meta · 1,003 prompts) | Leads on Overall Preference & Instruction Following |
| Image-to-video | VBench I2V (355 image-text pairs) | Tied performance |
| Reference-to-video | Human raters | Leads on Overall Preference & Speech Adherence |
These are Google's own reported figures — internal head-to-head ratings plus two public benchmarks (Meta's MovieGenBench, VBench I2V). They're a credible capability signal, not an independent leaderboard. The honest read: Omni Flash is competitive-to-leading across the four video tasks Google chose to report, with the strongest claims on instruction following and preference — which is exactly what the conversational-editing pitch needs to be true. Treat "tied" on image-to-video as the honest weak spot. The model card (linked in Resources) carries the fuller methodology.
At launch Omni is surfaced through Google's creative and consumer products rather than as a standalone developer endpoint. Every output it produces carries provenance signals.
The main consumer interface at gemini.google.com — the simplest way to drive Omni conversationally.
Google's AI creative studio, with Omni integrated for longer-form, multi-shot creative work.
Omni-powered video creation wired directly into the Shorts creation flow.
"Google AI subscription required. Features vary by tier and geography." Availability is not uniform — check your tier and region.
Every Omni output ships with two provenance layers: SynthID — Google DeepMind's imperceptible digital watermark embedded in the pixels — and C2PA Content Credentials, the cross-industry metadata standard for authenticity. Google also states the model went through both human and automated red-teaming, and that outputs adhere to its AI Principles and Generative AI Prohibited Use Policy. For anyone publishing Omni media, the practical upshot: assume it's detectable as AI-generated by design, and don't strip the credentials if your platform or jurisdiction expects disclosure.
The single most important thing for a builder to internalise: at launch Omni is a product capability, not a confirmed Vertex / AI Studio API SKU. It's reachable through the Gemini app, Flow, and Shorts — not (yet, publicly) as a programmatic endpoint you can wire into an agent. Plan accordingly.
If a proposal depends on calling Omni from code, flag it as unconfirmed until the Gemini model availability matrix lists a programmatic Omni / Omni Flash endpoint. The capability is real and shipping — through products. The API surface is the open question. Building a client deliverable around a programmatic Omni call today is a planning risk, not a feature.
For SA agencies and content studios, Omni's value is immediate and the access path is simple: a Google AI subscription and the Gemini app or Flow. Short-form social, pitch animatics, explainer sequences, ad concepting — the conversational edit loop is a genuine production accelerant where the alternative is a full motion-graphics pass. The constraint is the one in the box: "features vary by tier and geography" — confirm the specific Omni features are live for SA accounts before you quote a client a turnaround that depends on them.
Unlike the Gemini reasoning tiers — which run SA-resident from Vertex AI's africa-south1 in Johannesburg — Omni has no regional hosting or POPIA-clean inference path at launch. For regulated SA clients (banks, insurers, government) who need generated media produced under data-residency controls, Omni isn't the answer today. Keep generative-video work for those clients in the "watch this space" column and revisit when (if) a Vertex SKU lands in africa-south1.
SynthID + C2PA on every output is a selling point in the SA market, not a limitation. For brand-safety-conscious clients and any public-sector work, "every clip is watermarked and carries content credentials" is exactly the disclosure posture procurement increasingly asks for. Lead with it.
africa-south1 residency story — would live. The hosting question is a GCP question.