Gemma is Google's open-weights model family, built from the same research line that produces Gemini. Gemma 4 (current generation) ships in four sizes — E2B and E4B for ultra-mobile / edge / browser, a 31B dense model for server-grade single-host inference, and a 26B A4B Mixture-of-Experts for high-throughput reasoning. Multimodal across text, image, video, and audio (audio is native on E2B and E4B); 128k context on the small models, 256k on the medium. Native function-calling and configurable thinking modes. Permissive licence; runs on Ollama, Hugging Face, and Vertex AI Model Garden. The local-first option for teams that want frontier-quality inference without the cloud dependency.
Gemma is a family of open-weights models from Google DeepMind, distilled from the research and training infrastructure that produces Gemini. The first release was Gemma 1 (February 2024) at 2B and 7B sizes; Gemma 2 added 9B and 27B; Gemma 3 (March 2025) shipped at 270M / 1B / 4B / 12B / 27B with vision-text multimodality; Gemma 4 is the current generation and the one most teams should default to.
Gemma 4 ships in four sizes — E2B and E4B (small, effective-parameter naming for the mobile / edge / browser class), a 31B dense model, and a 26B A4B Mixture-of-Experts. Multimodal across text + image (object detection, document / PDF parsing, OCR including multilingual, handwriting, pointing) + video (frame-sequence understanding) + audio (native on the E2B and E4B models). Context window is 128k on the small models and 256k on the medium. Native function-calling support is built in, and all four sizes are designed as capable reasoners with configurable thinking modes. Google's Gemma Terms of Use is the licence — not Apache 2.0, but permissive enough for almost every commercial case.
Why the rename in the small range. The "E2B" / "E4B" naming replaces Gemma 3's 270M / 1B / 4B / 12B grading. The E stands for "effective" — the parameter count that determines runtime behaviour on a phone or browser, rather than the raw parameter count of the underlying model. The shift signals Gemma 4's deliberate design for ultra-mobile and edge deployment alongside the server-grade 31B and 26B-A4B options.
Most open-weights families are either research-lab (academic-quality benchmarks but production-rough deployment), or post-frontier (the lab's earlier-generation models released after the frontier moves on). Gemma is closer to the latter — not the same as Gemini Pro on the closed side, but produced by the team that ships Gemini, with the production-grade tokeniser, training data curation, and safety tuning that comes from a frontier-lab pipeline. Gemma 4's additions — native audio on the small models, function-calling, configurable thinking modes, MoE at the 26B point — are the same operational features Gemini ships on the closed side, brought into open weights with about a generation of lag.
Gemma 4 splits into two ranges. The small models (E2B, E4B) are built for ultra-mobile, edge, and browser deployment — 128k context, native audio, on-device practical. The medium models (31B dense, 26B A4B MoE) are server-grade and 256k context; the MoE option gives high throughput for advanced reasoning at lower active-parameter cost than the dense 31B.
| Size | Hardware | Best at |
|---|---|---|
| Gemma 4 E2B | Phones, embedded, browser (Chrome), Pixel | On-device assistants, classification, basic chat. Audio in / out native. |
| Gemma 4 E4B | Laptops (16–32GB RAM), edge devices, mid-tier mobile | The default "good local model" — agents, chat, RAG, multimodal including audio |
| Gemma 4 31B (dense) | 64GB+ MacBook Pro, single GPU 24GB+ VRAM (RTX 4090, H100) | Frontier-grade open-weights tasks; reasoning, code, server-side multimodal |
| Gemma 4 26B A4B (MoE) | Same server class as 31B but lower active params per token | High-throughput inference; advanced reasoning at a fraction of the 31B compute cost per token |
Gemma 4 E4B is the local-LLM default in 2026. Fits on a 16–32GB MacBook, runs at chat speed on Apple Silicon, handles full multimodal input including audio, supports 128k context, and has native function-calling for agent use. Quality is competitive with the previous closed-frontier mid-tier — good enough for the vast majority of internal tooling and agent use cases. ollama run gemma4 is the canonical "try it now" command; ollama run gemma4:26b for the MoE on a server.
Gemma has direct open-weights competitors at every size point. The honest comparison: Llama dominates the long tail of community fine-tunes; Qwen leads on coding-specific work; Mistral has stronger European-language depth. Gemma 4 leads on full multimodal coverage (text + image + video + audio) and on first-party safety tuning and function-calling for general business use.
| Family | Strengths | Watch out for |
|---|---|---|
| Gemma 4 | Full multimodal (text + image + video + audio), native function-calling, configurable thinking modes, 256k context on medium, broad licence, MoE option | Smaller community fine-tune ecosystem than Llama; not as code-strong as Qwen-coder |
| Llama 4 / 3.3 | Largest community ecosystem, best fine-tunes, best instruction-following at 70B+ | Meta licence has commercial restrictions over 700M MAU; vision support uneven |
| Qwen 2.5 / 3 | Best for code (Qwen Coder), strong multilingual | Chinese training emphasis means English-only benchmarks lag slightly |
| Mistral Large 2 | Function-calling reliability, French/European multilingual | Smaller, less open ecosystem |
| Phi-4 | Smallest credible reasoning model; runs on phones | Limited context, no multimodal |
| Gemini 2.5 Pro (closed) | 1M+ context, frontier reasoning, native search grounding | Cloud-only, USD per-token billing, data leaves device |
It's tempting to read "Gemma 4 31B" as "smaller Gemini." It isn't. Gemini Pro on the closed side is a much larger, much more capable model with proprietary training data and post-training. Gemma 4 is a deliberate distillation that runs locally; the gap to closed Gemini on hard reasoning, long context, or specialist domains is real. Use Gemma 4 for tasks that don't need frontier; use closed Gemini (or Claude, GPT) when they do. Both running side-by-side via the same agent framework is increasingly the production pattern.
Gemma is hosted in many places, but three cover almost every real workload: Ollama for laptop / single-machine / dev work, vLLM for production server deployments needing throughput, and Vertex AI Model Garden when you want a managed endpoint inside Google Cloud.
ollama run gemma4. The simplest path. OpenAI-compatible API at localhost:11434. Full quantisation options. Default for development.
Production-grade serving with PagedAttention, continuous batching, OpenAI-compatible API. Optimal for high-throughput single-tenant deployments on a dedicated GPU.
One-click deploy to a Vertex endpoint with managed scaling, IAM, and audit logs. The path of least resistance for SA enterprise on GCP.
Where Google publishes the weights. Use for fine-tuning, evaluation, and model cards. Not typically the production serving path.
The "try Gemma 4 in five minutes" path:
# Install Ollama (macOS / Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull and run Gemma 4 (E4B by default; tags for E2B, 31B dense, 26B MoE) ollama run gemma4 # or the MoE on a server-class host: ollama run gemma4:26b # Or via the OpenAI-compatible API curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma4", "messages": [ {"role": "user", "content": "Summarise the last 18 months of SA reserve bank policy in 3 bullets."} ] }'
POPIA + cross-border. A SA bank running customer-data summarisation on Claude or GPT is making a Section 72 cross-border transfer. The same workload on a locally-hosted Gemma 4 is no transfer at all. For PII-bearing automation (loan summary, claim triage, complaint classification) where frontier-model quality isn't required, Gemma 4 31B (dense) or 26B A4B (MoE) running on a SA-resident GPU is genuinely the regulatory-clean answer.
Cost vs Vertex AI. If the data residency story doesn't require local inference, Vertex AI Model Garden hosts Gemma in africa-south1 (Johannesburg). Per-token costs are billed in USD and depend on the variant; for high-volume workloads, hosting on dedicated Vertex AI endpoints can be more cost-effective than per-token API consumption, but worse than self-hosted on Hetzner Frankfurt with regional latency tolerance.
Load-shedding fit. Same as Ollama in general — a MacBook running Gemma 4 E4B on battery is genuinely productive during stage 6. The "AI dev tooling works through outages" story is a real differentiator for SA teams. And the native audio support on E2B / E4B means voice-driven local agents become practical without a cloud round-trip.
ollama run gemma4 is the canonical "try it" command.africa-south1. The managed-endpoint path for SA enterprise on GCP.