know.2nth.ai Agents Gemma
agents · Gemma · Skill Leaf

Gemini's research, in open weights you can run.

Gemma is Google's open-weights model family, built from the same research line that produces Gemini. Gemma 3 (the current generation, March 2025) ships in five sizes — 270M / 1B / 4B / 12B / 27B — with multimodal vision-text understanding, a 128k-token context window, and a permissive licence allowing commercial use. Available on Ollama, Hugging Face, and Vertex AI Model Garden. The local-first option for teams that want frontier-quality inference without the cloud dependency.

Gemma 3 · Mar 2025 Open weights 270M · 1B · 4B · 12B · 27B 128k context Multimodal · vision-text

Open weights from the team that makes Gemini.

Gemma is a family of open-weights models from Google DeepMind, distilled from the research and training infrastructure that produces Gemini. The first release was Gemma 1 (February 2024) at 2B and 7B sizes; Gemma 2 added 9B and 27B; Gemma 3 (March 2025) is the current generation and the one most teams should default to.

Gemma 3 ships in five sizes (270M, 1B, 4B, 12B, 27B), all multimodal vision-text capable, with a 128k-token context window. The licence is permissive enough for commercial deployments — Google's Gemma Terms of Use grants free use including for commercial work, with restrictions on harmful applications. Not a fully open-source licence (e.g. Apache 2.0) but practically permissive for almost every business case.

What "Gemma 4" looks like. As of mid-2026, Gemma 3 is the current line and Google has signalled the next generation in research papers but not shipped it. Treat any "Gemma 4" mention as forward-looking until Google confirms; this leaf will refresh when it does. The cadence so far has been ~12 months between major versions, so a follow-up generation is plausible in 2026.

Why "open weights from a frontier lab" matters

Most open-weights families are either research-lab (academic-quality benchmarks but production-rough deployment), or post-frontier (the lab's earlier-generation models released after the frontier moves on). Gemma is closer to the latter — not the same as Gemini Pro 2.5, but produced by the team that ships Gemini, with the production-grade tokeniser, training data curation, and safety tuning that comes from a frontier-lab pipeline. The size-for-size quality is competitive with Llama 3.x and meaningfully ahead of Mistral / Qwen for most multimodal benchmarks.

Five sizes, one quality curve.

The Gemma 3 family covers a deliberate range — from edge devices (270M, 1B) to laptops (4B, 12B) to small servers (27B). The breakpoints map roughly to which hardware you'd actually deploy on. All five share the same multimodal capability and 128k context window; the difference is parameter count, throughput, and quality-on-hard-tasks.

SizeHardwareBest at
Gemma 3 270MPhones, embedded, Raspberry Pi 5On-device assistants, classification, basic chat
Gemma 3 1B16GB RAM laptops, edge devicesLightweight RAG, summarisation, simple agents
Gemma 3 4B32GB MacBook, mid-tier laptopsThe default "good local model" — agents, chat, RAG
Gemma 3 12B64GB MacBook Pro, single GPU 12GB+ VRAMWhere Gemma starts feeling like Claude Haiku
Gemma 3 27B64GB+ MacBook, RTX 4090, single H100Frontier-grade open-weights tasks; reasoning, code

The default pick for most workloads

Gemma 3:4b is the local-LLM default for 2026. Fits on a 32GB MacBook, runs at chat speed (15-30 tokens/sec on M3), handles multimodal input, supports 128k context. Quality is comparable to GPT-3.5 / Claude Haiku from a year or two ago — good enough for the vast majority of internal tooling and agent use cases. ollama run gemma3:4b is the canonical "try it now" command.

When Gemma is the right open-weights pick — and when it isn't.

Gemma has direct open-weights competitors at every size point. The honest comparison: Llama 3.3 dominates the long tail of community fine-tunes; Qwen 2.5 leads on coding-specific work; Mistral Large 2 has stronger function-calling. Gemma 3 leads on multimodal vision and is the most consistently safety-tuned for general business use.

FamilyStrengthsWatch out for
Gemma 3Multimodal vision, safety tuning, 128k context, broad licenceSmaller community fine-tune ecosystem than Llama; not as code-strong as Qwen-coder
Llama 3.3Largest community ecosystem, best fine-tunes, best instruction-following at 70BMeta licence has commercial restrictions over 700M MAU; vision support uneven
Qwen 2.5Best for code (Qwen Coder), strong multilingualChinese training emphasis means English-only benchmarks lag slightly
Mistral Large 2Function-calling reliability, French/European multilingualSmaller, less open ecosystem
Phi-4Smallest credible reasoning model; runs on phonesLimited context, no multimodal
Gemini 2.5 Pro (closed)1M+ context, frontier reasoning, native search groundingCloud-only, USD per-token billing, data leaves device

Gemma 3 vs Gemini 2.5 Pro — not the same model

It's tempting to read "Gemma 3 27B" as "smaller Gemini." It isn't. Gemini 2.5 Pro is a much larger, much more capable closed model with proprietary training data and post-training. Gemma 3 is a deliberate distillation that runs locally; the gap to Gemini 2.5 Pro on hard reasoning, long context, or specialist domains is real. Use Gemma 3 for tasks that don't need frontier; use Gemini 2.5 Pro (or Claude Opus, GPT-5) when they do. Both running side-by-side via the same agent framework is increasingly the production pattern.

Three platforms cover almost every deployment.

Gemma is hosted in many places, but three cover almost every real workload: Ollama for laptop / single-machine / dev work, vLLM for production server deployments needing throughput, and Vertex AI Model Garden when you want a managed endpoint inside Google Cloud.

Local · dev

Ollama

ollama run gemma3:4b. The simplest path. OpenAI-compatible API at localhost:11434. Full quantisation options. Default for development.

Server · prod

vLLM

Production-grade serving with PagedAttention, continuous batching, OpenAI-compatible API. Optimal for high-throughput single-tenant deployments on a dedicated GPU.

Managed · GCP

Vertex AI Model Garden

One-click deploy to a Vertex endpoint with managed scaling, IAM, and audit logs. The path of least resistance for SA enterprise on GCP.

Discovery

Hugging Face

Where Google publishes the weights. Use for fine-tuning, evaluation, and model cards. Not typically the production serving path.

The "try Gemma 3 in five minutes" path:

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 3 4B
ollama run gemma3:4b

# Or via the OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:4b",
    "messages": [
      {"role": "user", "content": "Summarise the last 18 months of SA reserve bank policy in 3 bullets."}
    ]
  }'

Open weights solve the cross-border problem.

POPIA + cross-border. A SA bank running customer-data summarisation on Claude or GPT-4 is making a Section 72 cross-border transfer. Same workload on a locally-hosted Gemma 3 is no transfer at all. For PII-bearing automation (loan summary, claim triage, complaint classification) where frontier-model quality isn't required, Gemma 3 27B running on a SA-resident GPU is genuinely the regulatory-clean answer.

Cost vs Vertex AI. If your data residency story doesn't require local inference, Vertex AI Model Garden hosts Gemma in africa-south1 (Johannesburg). Per-token costs are billed in USD and depend on the variant; for high-volume workloads, hosting on dedicated Vertex AI endpoints can be more cost-effective than per-token API consumption, but worse than self-hosted on Hetzner Frankfurt with regional latency tolerance.

Load-shedding fit. Same as Ollama in general — a maxed MacBook running Gemma 3 4B on battery is genuinely productive during stage 6. The "your AI dev tooling works through outages" story is a real differentiator for SA teams.

Where Gemma links in the tree.

Model card, weights, and serving paths.