know.2nth.ai › Agents › Gemini › Omni

agents · Gemini Omni · Skill Leaf

Create anything from any input. Starting with video.

Gemini Omni is the generation half of the Gemini family. Where the reasoning tiers read media, Omni creates it — conversational, multi-turn video and image generation and editing where every edit builds on the last and the scene stays coherent. Hand it any reference — image, text, video, or audio — and it returns one cohesive output, with physics-aware motion (gravity, kinetic energy, fluid dynamics) and Gemini's world knowledge baked in. It ships as Gemini Omni Flash and is surfaced today through the Gemini app, Google Flow, and YouTube Shorts. Think of it as Nano Banana — but for video.

Native multimodal generation Gemini Omni Flash Conversational · multi-turn editing Physics-aware motion SynthID + C2PA provenance

01 · What it is

The model that generates media instead of just reading it.

Every other tier in the Gemini line is an understanding model: you feed it text, images, video, or audio and it reasons about them. Gemini Omni inverts that. It's Google DeepMind's native multimodal generation model — the same family lineage, the same world knowledge, but pointed at producing video and images rather than analysing them. Google's one-line pitch is literally "create anything from any input — starting with video."

The defining property isn't raw generation — plenty of models can turn a prompt into a clip. It's conversational, multi-turn editing that stays coherent. "Every edit you make builds on the one before, maintaining a consistent, coherent scene." You don't re-roll the whole video each time you want a change; you direct it the way you'd direct an editor — swap this character, move the camera over the shoulder, sync the lights to the music — and the model preserves what already worked while applying the new instruction.

The other half is input fusion: "turn any reference — image, text, video, or audio — into a single, cohesive output." A reference photo for a character, a sketch for the motion path, an audio track for the beat, a sentence for the style — Omni resolves all of them into one result rather than treating them as separate, bolted-together steps. Underneath, it "combines an intuitive understanding of physics with Gemini's knowledge of history, science, and cultural context — bridging the gap from photorealism to meaningful storytelling."

The Nano Banana analogy

Google frames Omni as "Nano Banana, but for video." Nano Banana is the Gemini-family image generation-and-editing model that made conversational, reference-driven image edits feel native. Omni extends that exact interaction model — iterative, prompt-driven, consistency-preserving — into the temporal dimension, where keeping a subject, a style, and a physics model stable across frames and across edits is the genuinely hard part. If you've used Nano Banana for stills, Omni is the same muscle memory for motion.

02 · What it does

Six capabilities that define the model.

Omni isn't one trick. The demonstrated capability surface spans transformation, simulation, replacement, and synchronisation — all reachable through the same conversational interface, all preserving the rest of the scene while one thing changes.

Transform

Aesthetic restyle

Change the visual style — line art, stuffed puppet, hologram, voxel, claymation — while the underlying motion and scene structure stay intact.

Simulate

Physics-aware motion

Gravity, kinetic energy, and fluid dynamics are modelled, not faked — so ripples, falls, and flows behave the way the eye expects them to.

Replace

Object & character swap

Swap any element by natural language or reference image — "change the butterfly to a bee" — without disturbing the rest of the frame.

Direct

Camera & motion transfer

Re-frame the shot ("over the violinist's shoulder") or transfer a movement pattern from one video onto another subject.

Synchronise

Action & text sync

Coordinate on-screen events to a beat — "lights turn on in sync with the music" — and align overlaid text with the action.

Originate

Drawing-to-video

Hand it a sketch as a movement guide and it renders realistic footage that follows your drawn path. Storyboard beats in, structured sequence out.

Why "coherent across edits" is the whole game

Single-shot text-to-video is a solved-enough problem. The thing that makes Omni a tool rather than a slot machine is that the seventh edit doesn't throw away the first six. A creator can land a scene, then refine it the way they'd refine a draft — lighting, then a character swap, then a camera move, then a style pass — and each step inherits the last. That turns generation from "regenerate and pray" into something closer to a non-destructive editing timeline.

03 · Omni in action

Demos — the prompts that drive each edit.

A walk through the kinds of edit Google demonstrates, each one a single conversational instruction. The interface is the prompt: you describe the change, Omni applies it to the existing scene and hands the result back to edit again.

// Illustrative — prompt text and edit types are from Google's Gemini Omni gallery and prompt guide. The tiles below stand in for the rendered clips; watch the live examples on the source page linked in Resources.

Object swap🐝

Replace · language-driven

Change the butterfly to a bee

Swap one subject for another by name. The flight path, lighting, and background carry through untouched — only the creature changes.

Camera🎻

Direct · re-frame

Change the camera angle to be over the violinist's shoulder

Re-stage the shot without re-rendering the performer. Omni recomputes the viewpoint while holding the subject and motion stable.

Sync🎵

Synchronise · to audio

The lights of the apartments start turning on in sync with the music

Tie an on-screen event to a beat. Action synchronisation lets the visual respond to the audio track frame-for-frame.

Restyle🧸

Transform · aesthetic

Change the astronaut to a sea anemone

Metamorphose a subject — line art, stuffed puppet, hologram, voxel — while the original motion and timing are preserved.

Drawing-to-video✏️

Originate · sketch guide

Use this drawing as the movement path; render it as realistic footage

A rough sketch becomes the motion guide for a photoreal clip — your hand-drawn path drives the camera or subject.

Physics💧

Simulate · material + motion

The mirror surface ripples like water as she steps through it

Fluid dynamics and material transformation in one instruction — the kind of edit that only holds together when physics is modelled, not painted on.

One scene, four turns — the conversational loop

The point of Omni is the chain, not any single prompt. Here's the violinist scene refined the way you'd actually work — each turn inherits the last.

A violinist plays alone in a sunlit stone hall

Base scene generated from a text prompt — subject, setting, lighting established.

Change the camera angle to be over the violinist's shoulder

Re-framed to the new viewpoint. The performer, bowing motion, and hall are preserved exactly — only the camera moved.

Restyle the whole scene as a hand-painted watercolour

Aesthetic transfer applied on top of the new angle. Motion and composition from turn 2 carry through; only the look changes.

As the music swells, light blooms through the windows in time with the melody

Action synchronised to the audio — layered on the watercolour, the over-shoulder angle, the original performance. Four edits, one coherent scene.

04 · How to prompt it

Five elements, then iterate.

Google's prompt guide breaks an effective Omni prompt into five components. The headline rule: the more detail you add, the more control you'll have over the final output — but you don't have to be exhaustive, because the model's reasoning and world knowledge fill the gaps.

Element	What to specify
Shot framing & motion	How the shot is framed — wide-angle, medium, or close-up — and how the camera moves.
Style	The aesthetic register: "realistic, or cinematic? Grounded, or majestic?"
Lighting	Source and quality: "where does the light come from — the sun, a streetlamp?"
Location	The landscape you imagine — e.g. "an alien landscape with clear, azure water."
Action	Who the characters and objects are, and how they're moving.

Iterate, don't regenerate

The guide is explicit: edit through conversation rather than starting over. The model "will preserve your video across multiple amends — keeping what works." And lean on its world knowledge for complex actions — "when you refer to a complex action, Gemini Omni understands your intention, and how this action should be applied across your video." You name the action; Omni works out how it should play out across the frames.

Edit categories worth knowing by name

Object / subject swap ("change the butterfly to a bee") · camera direction ("over the violinist's shoulder") · style transfer (new aesthetic, original motion preserved) · action synchronisation ("lights turn on in sync with the music") · storyboard-based generation (share visual narrative beats for a structured sequence). Naming the category you want tends to land the edit more cleanly than describing it from scratch.

05 · Omni Flash & benchmarks

The deployable variant, and how it scores.

The version Google actually ships is Gemini Omni Flash — positioned for "exceptional results in Video Editing, Text to Video, Image to Video, and Reference to Video." Google's reported human-rater and benchmark results across those four tasks:

Task	Benchmark / method	Reported result
Video editing	Head-to-head, human raters	Leads on Overall Preference & Instruction Following
Text-to-video	MovieGenBench (Meta · 1,003 prompts)	Leads on Overall Preference & Instruction Following
Image-to-video	VBench I2V (355 image-text pairs)	Tied performance
Reference-to-video	Human raters	Leads on Overall Preference & Speech Adherence

Read these as vendor benchmarks

These are Google's own reported figures — internal head-to-head ratings plus two public benchmarks (Meta's MovieGenBench, VBench I2V). They're a credible capability signal, not an independent leaderboard. The honest read: Omni Flash is competitive-to-leading across the four video tasks Google chose to report, with the strongest claims on instruction following and preference — which is exactly what the conversational-editing pitch needs to be true. Treat "tied" on image-to-video as the honest weak spot. The model card (linked in Resources) carries the fuller methodology.

06 · Where to use it & provenance

Consumer surfaces today, provenance built in.

At launch Omni is surfaced through Google's creative and consumer products rather than as a standalone developer endpoint. Every output it produces carries provenance signals.

Surface

Gemini app

The main consumer interface at gemini.google.com — the simplest way to drive Omni conversationally.

Surface

Google Flow

Google's AI creative studio, with Omni integrated for longer-form, multi-shot creative work.

Surface

YouTube Shorts

Omni-powered video creation wired directly into the Shorts creation flow.

Access · the catch

Subscription-gated

"Google AI subscription required. Features vary by tier and geography." Availability is not uniform — check your tier and region.

Provenance is not optional

Every Omni output ships with two provenance layers: SynthID — Google DeepMind's imperceptible digital watermark embedded in the pixels — and C2PA Content Credentials, the cross-industry metadata standard for authenticity. Google also states the model went through both human and automated red-teaming, and that outputs adhere to its AI Principles and Generative AI Prohibited Use Policy. For anyone publishing Omni media, the practical upshot: assume it's detectable as AI-generated by design, and don't strip the credentials if your platform or jurisdiction expects disclosure.

07 · Builder's reality

Reach for Omni when. Don't, when.

The single most important thing for a builder to internalise: at launch Omni is a product capability, not a confirmed Vertex / AI Studio API SKU. It's reachable through the Gemini app, Flow, and Shorts — not (yet, publicly) as a programmatic endpoint you can wire into an agent. Plan accordingly.

Reach for Omni when

You're producing video / image creative and want conversational, iterative editing
Scene coherence across many edits matters — it's the model's core strength
You're fusing mixed references (image + audio + text + sketch) into one output
Physics-realistic motion (fluids, falls, ripples) is part of the brief
You're working inside the Gemini app, Google Flow, or YouTube Shorts
You need provenance (SynthID + C2PA) on generated media by default

Don't design around Omni when

You need a programmatic API today — treat it as a capability signal, verify the model matrix first
You're building an autonomous agent that must generate video unattended
Your users are outside the supported subscription tiers or geographies
You need on-prem / SA-resident generation — no regional hosting story yet
You need understanding, not generation — use the Gemini reasoning tiers instead
Deterministic, repeatable output is a hard requirement — generative media isn't that

The one-line risk to write into any plan

If a proposal depends on calling Omni from code, flag it as unconfirmed until the Gemini model availability matrix lists a programmatic Omni / Omni Flash endpoint. The capability is real and shipping — through products. The API surface is the open question. Building a client deliverable around a programmatic Omni call today is a planning risk, not a feature.

08 · South African context

Where Omni lands in SA creative & delivery work.

Creative studios · the realistic entry point

For SA agencies and content studios, Omni's value is immediate and the access path is simple: a Google AI subscription and the Gemini app or Flow. Short-form social, pitch animatics, explainer sequences, ad concepting — the conversational edit loop is a genuine production accelerant where the alternative is a full motion-graphics pass. The constraint is the one in the box: "features vary by tier and geography" — confirm the specific Omni features are live for SA accounts before you quote a client a turnaround that depends on them.

Enterprise · no residency story yet

Unlike the Gemini reasoning tiers — which run SA-resident from Vertex AI's africa-south1 in Johannesburg — Omni has no regional hosting or POPIA-clean inference path at launch. For regulated SA clients (banks, insurers, government) who need generated media produced under data-residency controls, Omni isn't the answer today. Keep generative-video work for those clients in the "watch this space" column and revisit when (if) a Vertex SKU lands in africa-south1.

Provenance is an asset, not a tax

SynthID + C2PA on every output is a selling point in the SA market, not a limitation. For brand-safety-conscious clients and any public-sector work, "every clip is watermarked and carries content credentials" is exactly the disclosure posture procurement increasingly asks for. Lead with it.

09 · Connections