know.2nth.ai › Agents › Gemma

agents · Gemma · Skill Leaf

Gemini's research, in open weights you can run.

Gemma is Google's open-weights model family, built from the same research line that produces Gemini. Gemma 3 (the current generation, March 2025) ships in five sizes — 270M / 1B / 4B / 12B / 27B — with multimodal vision-text understanding, a 128k-token context window, and a permissive licence allowing commercial use. Available on Ollama, Hugging Face, and Vertex AI Model Garden. The local-first option for teams that want frontier-quality inference without the cloud dependency.

Gemma 3 · Mar 2025 Open weights 270M · 1B · 4B · 12B · 27B 128k context Multimodal · vision-text

01 · What it is

Open weights from the team that makes Gemini.

Gemma is a family of open-weights models from Google DeepMind, distilled from the research and training infrastructure that produces Gemini. The first release was Gemma 1 (February 2024) at 2B and 7B sizes; Gemma 2 added 9B and 27B; Gemma 3 (March 2025) is the current generation and the one most teams should default to.

Gemma 3 ships in five sizes (270M, 1B, 4B, 12B, 27B), all multimodal vision-text capable, with a 128k-token context window. The licence is permissive enough for commercial deployments — Google's Gemma Terms of Use grants free use including for commercial work, with restrictions on harmful applications. Not a fully open-source licence (e.g. Apache 2.0) but practically permissive for almost every business case.

What "Gemma 4" looks like. As of mid-2026, Gemma 3 is the current line and Google has signalled the next generation in research papers but not shipped it. Treat any "Gemma 4" mention as forward-looking until Google confirms; this leaf will refresh when it does. The cadence so far has been ~12 months between major versions, so a follow-up generation is plausible in 2026.

Why "open weights from a frontier lab" matters

Most open-weights families are either research-lab (academic-quality benchmarks but production-rough deployment), or post-frontier (the lab's earlier-generation models released after the frontier moves on). Gemma is closer to the latter — not the same as Gemini Pro 2.5, but produced by the team that ships Gemini, with the production-grade tokeniser, training data curation, and safety tuning that comes from a frontier-lab pipeline. The size-for-size quality is competitive with Llama 3.x and meaningfully ahead of Mistral / Qwen for most multimodal benchmarks.

02 · The Gemma 3 line

Five sizes, one quality curve.

The Gemma 3 family covers a deliberate range — from edge devices (270M, 1B) to laptops (4B, 12B) to small servers (27B). The breakpoints map roughly to which hardware you'd actually deploy on. All five share the same multimodal capability and 128k context window; the difference is parameter count, throughput, and quality-on-hard-tasks.

Size	Hardware	Best at
Gemma 3 270M	Phones, embedded, Raspberry Pi 5	On-device assistants, classification, basic chat
Gemma 3 1B	16GB RAM laptops, edge devices	Lightweight RAG, summarisation, simple agents
Gemma 3 4B	32GB MacBook, mid-tier laptops	The default "good local model" — agents, chat, RAG
Gemma 3 12B	64GB MacBook Pro, single GPU 12GB+ VRAM	Where Gemma starts feeling like Claude Haiku
Gemma 3 27B	64GB+ MacBook, RTX 4090, single H100	Frontier-grade open-weights tasks; reasoning, code

The default pick for most workloads

Gemma 3:4b is the local-LLM default for 2026. Fits on a 32GB MacBook, runs at chat speed (15-30 tokens/sec on M3), handles multimodal input, supports 128k context. Quality is comparable to GPT-3.5 / Claude Haiku from a year or two ago — good enough for the vast majority of internal tooling and agent use cases. ollama run gemma3:4b is the canonical "try it now" command.

03 · vs Llama, Qwen, Mistral, and Gemini Pro

When Gemma is the right open-weights pick — and when it isn't.

Gemma has direct open-weights competitors at every size point. The honest comparison: Llama 3.3 dominates the long tail of community fine-tunes; Qwen 2.5 leads on coding-specific work; Mistral Large 2 has stronger function-calling. Gemma 3 leads on multimodal vision and is the most consistently safety-tuned for general business use.

Family	Strengths	Watch out for
Gemma 3	Multimodal vision, safety tuning, 128k context, broad licence	Smaller community fine-tune ecosystem than Llama; not as code-strong as Qwen-coder
Llama 3.3	Largest community ecosystem, best fine-tunes, best instruction-following at 70B	Meta licence has commercial restrictions over 700M MAU; vision support uneven
Qwen 2.5	Best for code (Qwen Coder), strong multilingual	Chinese training emphasis means English-only benchmarks lag slightly
Mistral Large 2	Function-calling reliability, French/European multilingual	Smaller, less open ecosystem
Phi-4	Smallest credible reasoning model; runs on phones	Limited context, no multimodal
Gemini 2.5 Pro (closed)	1M+ context, frontier reasoning, native search grounding	Cloud-only, USD per-token billing, data leaves device

Gemma 3 vs Gemini 2.5 Pro — not the same model

It's tempting to read "Gemma 3 27B" as "smaller Gemini." It isn't. Gemini 2.5 Pro is a much larger, much more capable closed model with proprietary training data and post-training. Gemma 3 is a deliberate distillation that runs locally; the gap to Gemini 2.5 Pro on hard reasoning, long context, or specialist domains is real. Use Gemma 3 for tasks that don't need frontier; use Gemini 2.5 Pro (or Claude Opus, GPT-5) when they do. Both running side-by-side via the same agent framework is increasingly the production pattern.

04 · Running it

Three platforms cover almost every deployment.

Gemma is hosted in many places, but three cover almost every real workload: Ollama for laptop / single-machine / dev work, vLLM for production server deployments needing throughput, and Vertex AI Model Garden when you want a managed endpoint inside Google Cloud.

Local · dev

Ollama

ollama run gemma3:4b. The simplest path. OpenAI-compatible API at localhost:11434. Full quantisation options. Default for development.

Server · prod

vLLM

Production-grade serving with PagedAttention, continuous batching, OpenAI-compatible API. Optimal for high-throughput single-tenant deployments on a dedicated GPU.

Managed · GCP

Vertex AI Model Garden

One-click deploy to a Vertex endpoint with managed scaling, IAM, and audit logs. The path of least resistance for SA enterprise on GCP.

Discovery

Hugging Face

Where Google publishes the weights. Use for fine-tuning, evaluation, and model cards. Not typically the production serving path.

The "try Gemma 3 in five minutes" path:

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 3 4B
ollama run gemma3:4b

# Or via the OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:4b",
    "messages": [
      {"role": "user", "content": "Summarise the last 18 months of SA reserve bank policy in 3 bullets."}
    ]
  }'

05 · South African context

Open weights solve the cross-border problem.

POPIA + cross-border. A SA bank running customer-data summarisation on Claude or GPT-4 is making a Section 72 cross-border transfer. Same workload on a locally-hosted Gemma 3 is no transfer at all. For PII-bearing automation (loan summary, claim triage, complaint classification) where frontier-model quality isn't required, Gemma 3 27B running on a SA-resident GPU is genuinely the regulatory-clean answer.

Cost vs Vertex AI. If your data residency story doesn't require local inference, Vertex AI Model Garden hosts Gemma in africa-south1 (Johannesburg). Per-token costs are billed in USD and depend on the variant; for high-volume workloads, hosting on dedicated Vertex AI endpoints can be more cost-effective than per-token API consumption, but worse than self-hosted on Hetzner Frankfurt with regional latency tolerance.

Load-shedding fit. Same as Ollama in general — a maxed MacBook running Gemma 3 4B on battery is genuinely productive during stage 6. The "your AI dev tooling works through outages" story is a real differentiator for SA teams.

06 · Connections

Where Gemma links in the tree.

agents

Agents hub

The sub-tree landing. Frameworks call models; Gemma is one of the strongest open-weights options for "I want frontier-adjacent quality without cloud dependency".

agents/ollama

Ollama

The default local serving path for Gemma. ollama run gemma3:4b is the canonical "try it" command.

agents/google-adk

Google ADK

Same Google research lineage. ADK + Gemma 3 is the all-Google open-weights stack, useful when you want the framework and model to share assumptions.

tech/google

Google Cloud

Vertex AI Model Garden hosts Gemma in africa-south1. The managed-endpoint path for SA enterprise on GCP.

agents/llama

Llama

The other dominant open-weights family. Larger fine-tune ecosystem; pick Llama if community variants matter, Gemma if frontier-lab safety tuning matters.

agents/qwen

Qwen

Alibaba's open-weights family; particularly strong for code work. A common alternative to Gemma at similar size points.

07 · Resources

Model card, weights, and serving paths.

Site Gemma official site ai.google.dev/gemma Weights Gemma 3 27B IT (Hugging Face) huggingface.co/google/gemma-3-27b-it Local serve Gemma 3 on Ollama ollama.com/library/gemma3 Managed serve Gemma on Vertex AI Model Garden cloud.google.com/.../use-gemma Licence Gemma Terms of Use ai.google.dev/gemma/terms Source google-deepmind/gemma reference impl github.com/google-deepmind/gemma