Gemma is Google's open-weights model family, built from the same research line that produces Gemini. Gemma 3 (the current generation, March 2025) ships in five sizes — 270M / 1B / 4B / 12B / 27B — with multimodal vision-text understanding, a 128k-token context window, and a permissive licence allowing commercial use. Available on Ollama, Hugging Face, and Vertex AI Model Garden. The local-first option for teams that want frontier-quality inference without the cloud dependency.
Gemma is a family of open-weights models from Google DeepMind, distilled from the research and training infrastructure that produces Gemini. The first release was Gemma 1 (February 2024) at 2B and 7B sizes; Gemma 2 added 9B and 27B; Gemma 3 (March 2025) is the current generation and the one most teams should default to.
Gemma 3 ships in five sizes (270M, 1B, 4B, 12B, 27B), all multimodal vision-text capable, with a 128k-token context window. The licence is permissive enough for commercial deployments — Google's Gemma Terms of Use grants free use including for commercial work, with restrictions on harmful applications. Not a fully open-source licence (e.g. Apache 2.0) but practically permissive for almost every business case.
What "Gemma 4" looks like. As of mid-2026, Gemma 3 is the current line and Google has signalled the next generation in research papers but not shipped it. Treat any "Gemma 4" mention as forward-looking until Google confirms; this leaf will refresh when it does. The cadence so far has been ~12 months between major versions, so a follow-up generation is plausible in 2026.
Most open-weights families are either research-lab (academic-quality benchmarks but production-rough deployment), or post-frontier (the lab's earlier-generation models released after the frontier moves on). Gemma is closer to the latter — not the same as Gemini Pro 2.5, but produced by the team that ships Gemini, with the production-grade tokeniser, training data curation, and safety tuning that comes from a frontier-lab pipeline. The size-for-size quality is competitive with Llama 3.x and meaningfully ahead of Mistral / Qwen for most multimodal benchmarks.
The Gemma 3 family covers a deliberate range — from edge devices (270M, 1B) to laptops (4B, 12B) to small servers (27B). The breakpoints map roughly to which hardware you'd actually deploy on. All five share the same multimodal capability and 128k context window; the difference is parameter count, throughput, and quality-on-hard-tasks.
| Size | Hardware | Best at |
|---|---|---|
| Gemma 3 270M | Phones, embedded, Raspberry Pi 5 | On-device assistants, classification, basic chat |
| Gemma 3 1B | 16GB RAM laptops, edge devices | Lightweight RAG, summarisation, simple agents |
| Gemma 3 4B | 32GB MacBook, mid-tier laptops | The default "good local model" — agents, chat, RAG |
| Gemma 3 12B | 64GB MacBook Pro, single GPU 12GB+ VRAM | Where Gemma starts feeling like Claude Haiku |
| Gemma 3 27B | 64GB+ MacBook, RTX 4090, single H100 | Frontier-grade open-weights tasks; reasoning, code |
Gemma 3:4b is the local-LLM default for 2026. Fits on a 32GB MacBook, runs at chat speed (15-30 tokens/sec on M3), handles multimodal input, supports 128k context. Quality is comparable to GPT-3.5 / Claude Haiku from a year or two ago — good enough for the vast majority of internal tooling and agent use cases. ollama run gemma3:4b is the canonical "try it now" command.
Gemma has direct open-weights competitors at every size point. The honest comparison: Llama 3.3 dominates the long tail of community fine-tunes; Qwen 2.5 leads on coding-specific work; Mistral Large 2 has stronger function-calling. Gemma 3 leads on multimodal vision and is the most consistently safety-tuned for general business use.
| Family | Strengths | Watch out for |
|---|---|---|
| Gemma 3 | Multimodal vision, safety tuning, 128k context, broad licence | Smaller community fine-tune ecosystem than Llama; not as code-strong as Qwen-coder |
| Llama 3.3 | Largest community ecosystem, best fine-tunes, best instruction-following at 70B | Meta licence has commercial restrictions over 700M MAU; vision support uneven |
| Qwen 2.5 | Best for code (Qwen Coder), strong multilingual | Chinese training emphasis means English-only benchmarks lag slightly |
| Mistral Large 2 | Function-calling reliability, French/European multilingual | Smaller, less open ecosystem |
| Phi-4 | Smallest credible reasoning model; runs on phones | Limited context, no multimodal |
| Gemini 2.5 Pro (closed) | 1M+ context, frontier reasoning, native search grounding | Cloud-only, USD per-token billing, data leaves device |
It's tempting to read "Gemma 3 27B" as "smaller Gemini." It isn't. Gemini 2.5 Pro is a much larger, much more capable closed model with proprietary training data and post-training. Gemma 3 is a deliberate distillation that runs locally; the gap to Gemini 2.5 Pro on hard reasoning, long context, or specialist domains is real. Use Gemma 3 for tasks that don't need frontier; use Gemini 2.5 Pro (or Claude Opus, GPT-5) when they do. Both running side-by-side via the same agent framework is increasingly the production pattern.
Gemma is hosted in many places, but three cover almost every real workload: Ollama for laptop / single-machine / dev work, vLLM for production server deployments needing throughput, and Vertex AI Model Garden when you want a managed endpoint inside Google Cloud.
ollama run gemma3:4b. The simplest path. OpenAI-compatible API at localhost:11434. Full quantisation options. Default for development.
Production-grade serving with PagedAttention, continuous batching, OpenAI-compatible API. Optimal for high-throughput single-tenant deployments on a dedicated GPU.
One-click deploy to a Vertex endpoint with managed scaling, IAM, and audit logs. The path of least resistance for SA enterprise on GCP.
Where Google publishes the weights. Use for fine-tuning, evaluation, and model cards. Not typically the production serving path.
The "try Gemma 3 in five minutes" path:
# Install Ollama (macOS / Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull and run Gemma 3 4B ollama run gemma3:4b # Or via the OpenAI-compatible API curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma3:4b", "messages": [ {"role": "user", "content": "Summarise the last 18 months of SA reserve bank policy in 3 bullets."} ] }'
POPIA + cross-border. A SA bank running customer-data summarisation on Claude or GPT-4 is making a Section 72 cross-border transfer. Same workload on a locally-hosted Gemma 3 is no transfer at all. For PII-bearing automation (loan summary, claim triage, complaint classification) where frontier-model quality isn't required, Gemma 3 27B running on a SA-resident GPU is genuinely the regulatory-clean answer.
Cost vs Vertex AI. If your data residency story doesn't require local inference, Vertex AI Model Garden hosts Gemma in africa-south1 (Johannesburg). Per-token costs are billed in USD and depend on the variant; for high-volume workloads, hosting on dedicated Vertex AI endpoints can be more cost-effective than per-token API consumption, but worse than self-hosted on Hetzner Frankfurt with regional latency tolerance.
Load-shedding fit. Same as Ollama in general — a maxed MacBook running Gemma 3 4B on battery is genuinely productive during stage 6. The "your AI dev tooling works through outages" story is a real differentiator for SA teams.
ollama run gemma3:4b is the canonical "try it" command.africa-south1. The managed-endpoint path for SA enterprise on GCP.