know.2nth.ai › Agents › Gemini

agents · Gemini · Skill Leaf

One million tokens of context. Native voice. Native search.

Gemini is Google DeepMind's flagship model family. The current line in mid-2026 is Gemini 3.x: Gemini 3.1 Pro for the hardest reasoning and agentic coding, Gemini 3.5 Flash for frontier-class agentic and coding work, Gemini 3.1 Flash-Lite for high-volume routing (the 2.5 line is now legacy). The differentiators that matter: 1M+ token context (the largest in the field, reportedly up to ~2M on the Pro tier), native multimodal across text, image, video, and audio, native Google Search grounding, native bidirectional voice via the Live API, and now native multimodal generation via Gemini Omni — conversational video and image creation from any input. Hosted via Google AI Studio, Vertex AI (including africa-south1 in Johannesburg), and OpenRouter. The default closed-frontier model when extreme context, search grounding, voice, or media generation is the load-bearing capability.

Gemini 3.x family · current 3.1 Pro · 3.5 Flash · 3.1 Flash-Lite 1M+ context (to ~2M) Native search grounding Live API · voice + video Gemini Omni · media generation

01 · What it is

Google DeepMind's flagship multimodal family.

Gemini is the model family from Google DeepMind, the merged Google research lab that combines DeepMind's frontier-AI work with the production model engineering that powers Google's products. The first Gemini family shipped in late 2023 (Gemini 1.0 Pro, Ultra, Nano); subsequent generations followed roughly annually, with Gemini 1.5 expanding context to 1M tokens, Gemini 2.0 adding native multimodal generation, Gemini 2.5 bringing production-grade reasoning and the Live API, and the current Gemini 3.x line pushing frontier agentic and coding performance.

Where Claude leads on instruction-following and GPT on function-calling reach, Gemini has historically led on three things: extreme context length (1M+ tokens reliably used, not just nominally supported), native multimodality (text, image, video, audio in a single model rather than bolted-together components), and native search grounding (the model can search Google and ground responses in current information without external RAG infrastructure). Those three properties make it the default choice for use cases where any of them is structurally needed.

Distribution: Gemini is available via Google AI Studio (the developer-friendly direct path, also free up to a quota), Vertex AI (the enterprise GCP path, including the JHB africa-south1 region for SA-resident calls), Gemini.google.com (the consumer product), and through every major aggregator (OpenRouter, LiteLLM). Inside Google's own products — Search AI Overviews, Workspace AI features, Pixel devices — Gemini is the model behind much of what users actually interact with daily, making it one of the most-deployed model families by raw query count.

Naming · how the family lines up

Gemini's naming pattern: Gemini {version} {tier}. Version is the generation (1.0 → 1.5 → 2.0 → 2.5 → the current 3.x). Tier is the size class — Pro (the flagship reasoning SKU at the top), Flash (frontier-class, faster), Flash-Lite (cheapest / fastest), and Nano (on-device, mobile-targeted). The 3.x line dropped the separate "Ultra" SKU; Pro is now the top. API model IDs use dashes: gemini-3.1-pro, gemini-3.5-flash, gemini-3.1-flash-lite. The whole Gemini 2.5 line (gemini-2.5-pro, -flash, -flash-lite) is still served but now legacy — check the model availability matrix before pinning anything older.

02 · The Gemini 3.x lineup

Three tiers, one quality curve, one differentiating context window.

The Gemini family is tiered like Claude and GPT: flagship at the top for hard reasoning, a frontier-class Flash for the everyday default, a Flash-Lite for high-volume routine work. The difference that's held across generations: the whole family ships a 1M+ context window. The 3.x line also collapsed the old "Ultra" tier — Pro is now the single top SKU.

Tier · flagship

Gemini 3.1 Pro

The top reasoning SKU. Best for: complex problem-solving, deep research, code generation, agentic and "vibe" coding, long-context analysis. 1M+ token window (reportedly up to ~2M on Pro). Input pricing roughly doubles for prompts past 200k tokens.

Tier · default

Gemini 3.5 Flash

Google's pitch is "most intelligent model for sustained frontier performance on agentic and coding tasks" — frontier-class quality at Flash pricing. Launched at I/O 2026, it's positioned as outperforming Gemini 3.1 Pro on almost all benchmarks while running ~4× faster — the speed engine for real-world agentic loops, and the model behind Antigravity's Managed Agents. The right default for production agent volume. (Gemini 3 Flash is the preview sibling at an even lower price point.)

Tier · volume

Gemini 3.1 Flash-Lite

The cheapest Gemini 3 tier. Best for: routing, classification, summarisation, structured extraction at scale. Still carries the large context window — ideal for "scan a long document, answer one question" patterns where a heavier tier would be overkill.

Capability · cross-tier

All tiers ship

1M+ context, native multimodal (text/image/video/audio in, text/image out), function-calling, structured outputs, native Google Search grounding (paid feature), the Live API for bidirectional streaming voice and video.

The 1M context isn't just a number

Many models nominally support large context but degrade meaningfully past 32k or 64k. Gemini Pro is the rare model that genuinely uses its full window — you can pass an entire codebase, a 200-page document, or several hours of audio transcription and get coherent reasoning back. For "ask the company" agents over large internal corpora, Gemini's context advantage often beats the RAG-engineering complexity Claude or GPT would otherwise require. The trade-offs: latency grows with context (multiple seconds at the top of the window), and on Gemini 3 Pro input pricing roughly doubles once a prompt crosses 200k tokens.

Gemini Omni — generation, not just understanding

The reasoning tiers above read media; Gemini Omni creates it. Google's pitch is "create anything from any input — starting with video": conversational video and image generation and editing, with consistency held across iterative edits, physics-aware motion (gravity, kinetic energy, fluid dynamics), and world knowledge baked in. It takes video, image, text, and audio as input. This is Gemini moving from understanding multimodal input to generating multimodal output in the same family. Honest caveat for builders: at launch it's surfaced through consumer and creative products — the Gemini app, Google Flow, and YouTube Shorts — not a confirmed Vertex/AI Studio API SKU. Treat it as a capability signal for now; check the model availability matrix before designing an agent around a programmatic Omni endpoint.

Full Gemini Omni leaf — capabilities, demos, prompt guide, benchmarks ↗

03 · vs Claude, GPT, Llama, Gemma

Where Gemini is the right pick — and where it isn't.

Honest cross-family positioning. Gemini's strengths sit on context length, multimodal range, search grounding, and voice; its weaknesses are around instruction-following consistency (relative to Claude) and ecosystem reach (relative to OpenAI). For specific use cases, Gemini is the only credible answer in 2026.

Family	Strengths	Watch out for
Gemini (Google)	1M+ context (the largest), native multimodal across all media, native Google Search grounding, voice via Live API, JHB Vertex region for SA residency	Less consistent on instruction-following than Claude; smaller community / ecosystem than OpenAI; Live API still maturing on TS/JS support
Claude (Anthropic)	Best instruction-following, adaptive thinking, MCP-native, top tier (Fable 5) leads on long-horizon agentic work, now 1M context, Bedrock `af-south-1` for SA residency	Closed; weaker multimodal range than Gemini; no native voice/video
GPT (OpenAI)	Largest API ecosystem, best function-calling reliability, broadest multimodal (vision/audio/image-gen), built-in reasoning, GPT-5.5 now ~1M context	"Confidently wrong" failure mode more than Claude; no SA-resident region as clean as Vertex JHB
Llama (Meta)	Open weights, runs locally, largest community fine-tune ecosystem	Frontier gap; smaller context than Gemini Pro; Meta licence has commercial restrictions
Gemma (Google)	Open weights, multimodal, frontier-lab safety tuning, runs locally via Ollama	Smaller fine-tune ecosystem than Llama; not as code-strong as Qwen-coder

When Gemini is the only credible answer

Three use-case shapes where the other frontier families simply can't compete: (1) genuine 1M+ context reasoning — codebase analysis, multi-document synthesis, video understanding; (2) bidirectional voice agents with vision — the Live API is the cleanest implementation of streaming multimodal interaction available; (3) agents that need fresh information without external RAG — Google Search grounding gives you up-to-date answers without standing up your own search infrastructure. For any of these three shapes, Gemini wins by default in 2026.

04 · Pricing reality

Tiered, with a context-length cost asymmetry.

Gemini stays the cheapest frontier family at the Flash tiers, and uniquely carries a context-length pricing step: on Gemini 3 Pro, input pricing roughly doubles once a prompt crosses 200k tokens. Always check ai.google.dev/pricing for current numbers — rates below are mid-2026.

List rates, per million tokens (input / output):

Gemini 3.1 Pro — $2 / $12 for prompts up to 200k tokens; input roughly doubles above that. The flagship reasoning SKU — and notably cheaper than Claude Opus or GPT-5.5 at the top.
Gemini 3.5 Flash — $1.50 / $9. Frontier-class agentic/coding quality at Flash pricing; the right production default.
Gemini 3.1 Flash-Lite — $0.25 / $1.50. The most cost-effective Gemini 3 tier for routing and extraction at scale. (Legacy 2.5 Flash-Lite goes lower still at $0.10 / $0.40.)
Search grounding — billed per grounded query on top of model tokens. Cheap per query but adds up at volume.
Context caching — cached context cuts cost materially for repeated long-context (RAG) workloads.
Batch API — ~50% discount on input + output for non-urgent batched workloads.

The Flash-default routing pattern

Same logic as the Claude and GPT leaves: don't run everything on Pro. Build a Flash-Lite-driven router that classifies incoming requests, dispatches the routine 60-80% to 3.5 Flash, and escalates the genuinely hard 5-15% to 3.1 Pro. Combined with Google's context caching for any repeated-context workload (common with 1M-token RAG), this pattern cuts Gemini costs 60-80% versus running everything on Pro. Across closed-frontier models in 2026, Gemini Flash plus context caching is often the most cost-effective path for long-context-heavy agents.

05 · Decision guide

When Gemini is the right model. When it isn't.

Use Gemini when

You need genuine 1M+ context (codebase, multi-document, hours of transcription)
Bidirectional voice + video native is load-bearing — the Live API leads
You want native Google Search grounding without standing up RAG infra
Multimodal range matters — text + image + video + audio in one model, now with generation via Gemini Omni
You're already on GCP and want Vertex AI's managed integrations
POPIA / data-residency requires SA-resident inference and you want JHB africa-south1
Cost-effective long-context agents — Flash + context caching is hard to beat

Skip Gemini when

Instruction-following reliability matters more than reach — Claude often wins
Function-calling is the load-bearing capability — GPT wins
You need the largest API ecosystem / community — OpenAI wins
You want fully open / self-hostable models — Gemini is closed (try Gemma instead)
Cost-sensitive at extreme scale and the Live / search / context features aren't structural — open-weights via Ollama is cheaper
You distrust Google as a vendor or want to avoid Google's data practices

06 · South African context

Where Gemini lands in SA delivery work.

Enterprise · Vertex AI in `africa-south1`

Vertex AI's Johannesburg region (africa-south1) hosts the Gemini Flash tiers and (in most cases) Pro for SA-resident inference, with full POPIA compliance, IAM controls, and Cloud Logging audit trails. For SA banks, insurers, and telcos already on GCP — or evaluating it — this is the structurally cleanest path among closed frontier models. The honest constraint: not every Gemini variant lands in africa-south1 at launch. The newest flagship and preview models (e.g. the latest 3.x Pro) sometimes lag the US regions by weeks or months. Plan for either a US-East fallback or accept the lag if residency is non-negotiable.

Studio · Google AI Studio direct

For SA studios without enterprise residency requirements, Google AI Studio is the simplest path. Free tier covers prototypes; usage-based billing scales to production. The Studio UI is genuinely good for prompt iteration — better than OpenAI Platform for context-heavy workflows. Pragmatic SA studio path: prototype in AI Studio direct, ship pilots from there, only move to Vertex AI if a client requires it.

Live API · voice agents with SA infrastructure

The Live API is the most-credible answer for "build a voice agent in SA" in 2026. Bidirectional streaming voice + video, native multilingual (English + Afrikaans + isiZulu work meaningfully well), low-latency from africa-south1. For SA studios building voice agents for telcos, banks, or government, Gemini Live + Vertex JHB is structurally easier than the OpenAI Realtime API or Anthropic's separate audio endpoints — both of which lack a regional SA hosting story.

07 · Connections