Ollama is the simplest way to run LLMs locally. ollama run gemma3:4b and you have a working model. MIT-licensed, llama.cpp under the hood, OpenAI-compatible API endpoint at http://localhost:11434. Pull from a registry of 200+ open models — Gemma, Llama, Qwen, Mistral, Phi, DeepSeek, and the long tail of fine-tuned community variants. Zero cloud dependency, POPIA-friendly by default, the inference layer that survives load-shedding.
Ollama is three things bundled into one CLI: a model registry (Docker-Hub-for-LLMs at ollama.com/library), a local runtime built on top of llama.cpp for efficient quantised inference on CPU and GPU, and an OpenAI-compatible HTTP server running on localhost:11434 that any agent framework can call as if it were OpenAI.
The core promise: "ollama run gemma3:4b and the model works." No CUDA setup, no Hugging Face authentication dance, no virtualenv configuration, no quantisation guessing. The CLI handles model pulls, GPU detection, KV cache management, context window sizing, and concurrent request scheduling. For a developer who just wants to test "does Llama 3.3 70B answer this query the way Claude does?" without standing up a Hugging Face Inference Endpoint, Ollama is the path of least resistance.
The runtime under Ollama is llama.cpp — Georgi Gerganov's open-source C++ inference engine. Ollama wraps llama.cpp in a Go server with model management, a registry, and an HTTP API. That's the whole architecture. No proprietary pieces; the underlying engine is the same one shipped in every other "run LLMs locally" project (LM Studio, Jan, GPT4All).
Ollama exposes /v1/chat/completions with OpenAI's exact request/response format. That means: any code written against the OpenAI SDK runs against Ollama by changing a single environment variable (OPENAI_BASE_URL=http://localhost:11434/v1). LangChain, LlamaIndex, OpenAI Agents SDK, ADK via LiteLLM, custom curl calls — all work unchanged. That's the structural advantage: Ollama isn't trying to win on its own API, it's trying to be the place OpenAI-compatible code lands when you don't want to pay OpenAI.
Three CLI verbs do most of the work. ollama pull downloads a model. ollama run starts it interactively. ollama serve exposes the HTTP API. The HTTP server runs in the background by default on macOS and Linux, so most users never explicitly call serve — the API is just always there at localhost:11434.
The day-to-day commands:
# Install (macOS / Linux) curl -fsSL https://ollama.com/install.sh | sh # Or on macOS via Homebrew brew install ollama # Pull and run a model — combined ollama run gemma3:4b ollama run llama3.3:70b ollama run qwen2.5-coder:32b # List local models, sizes, last-used ollama list # Remove a model to free disk ollama rm llama3.3:70b # Show GPU / CPU usage of running models ollama ps
The HTTP API is OpenAI-shaped. Every framework that already speaks OpenAI works without modification:
# curl — direct OpenAI-format request curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma3:4b", "messages": [{"role": "user", "content": "Hello"}] }' # Python — OpenAI SDK pointed at Ollama from openai import OpenAI client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") resp = client.chat.completions.create( model="gemma3:4b", messages=[{"role": "user", "content": "Hello"}] )
Calling Ollama from ADK via LiteLLM (the model-agnostic layer ADK uses for non-Gemini models):
from google.adk.agents import Agent from google.adk.models.lite_llm import LiteLlm local_model = LiteLlm(model="ollama/gemma3:4b") agent = Agent(name="local", model=local_model, instruction="You are a helpful assistant.")
llama.cpp ships quantised weights — 4-bit, 5-bit, 8-bit instead of 16-bit FP16. A 70B model that's 140GB at FP16 becomes ~40GB at Q4 — and quality on most tasks is barely affected. Ollama defaults to good quantisation choices per model; you can override (ollama pull llama3.3:70b-q5_K_M) for finer control. The economics: you can run "frontier-grade" 70B models on a 64GB consumer machine. Five years ago this required a $30k server.
The honest answer most "run local LLMs" guides skip. Different model sizes need different hardware. Below is the rough fit as of mid-2026, assuming Q4 quantisation (Ollama's default) and reasonable context windows (8-32k).
| Hardware | Comfortable | Stretches |
|---|---|---|
| MacBook Air M2 / M3 (16GB) | Gemma3:1b, Phi-3, Llama 3.2:3b | Gemma3:4b is borderline |
| MacBook Pro M3/M4 (32GB) | Gemma3:4b/12b, Llama 3.3:8b, Qwen 2.5:7b | Gemma3:27b, Mistral Large fits but slow |
| MacBook Pro M3 Max (64GB) | Gemma3:27b, Qwen 2.5:32b, Llama 3.3:70b at Q4 | Llama 3.3:70b at Q5+ is slow but works |
| Hetzner GEX44 (Ryzen + 64GB, no GPU) | Up to 14B comfortably (CPU inference) | 27B+ runs but at chat speed, not stream |
| Single RTX 4090 (24GB VRAM) | Up to 32B at Q4 fully on GPU | 70B with offloading; faster than M3 Max |
| Dual RTX 4090 / single H100 | 70B at Q5+, 405B at Q4 with offload | The "small server" tier |
Unified memory architecture means Apple Silicon Macs run larger models than Linux + GPU systems with similar nominal RAM. A 64GB M3 Max comfortably runs Llama 70B at Q4; a 64GB Linux box with a 24GB GPU has to offload aggressively and is meaningfully slower. For local-LLM dev work in 2026, a maxed-out MacBook Pro is genuinely the most cost-effective hardware if you don't need extreme throughput.
Local inference is not always cheaper, faster, or better — it has its own trade-offs. The honest framing: cloud inference wins on absolute performance and frontier-model quality; local inference wins on cost predictability, data residency, and "the laptop works on the plane".
qwen2.5-coder:7b + IDELocal for development, cloud for production hero work, local fallback for outages. Build agents against an Ollama-hosted Gemma/Llama for the dev loop; deploy production agents pointing at Claude or Gemini APIs; configure a circuit-breaker that flips back to a local Ollama instance during cloud outages or load-shedding. The OpenAI-compatible API on Ollama makes this trivial — same code, different OPENAI_BASE_URL.
Load-shedding resilience. A laptop on battery with Ollama installed survives a stage 6 cycle just fine. Cloud LLMs need uplink + provider availability + the international peering being calm. For SA teams that need their tooling to work during a 4-hour outage, local inference is the difference between productive and idle.
FX cost predictability. Ollama is free. Hardware is a one-time purchase. There is no per-token billing in USD, no surprise quarterly revaluation, no quota throttling at month-end. For a SA team running heavy iterative-dev work (10k+ requests/day during a feature push), the math gets compelling fast: $200/mo USD = R3,800/mo on a Claude Max heavy plan vs R0/mo on a maxed-out MacBook. You amortise the laptop in 6-12 months.
POPIA + cross-border. Section 72 of POPIA restricts personal information transfer outside SA without consent / adequate protection. Cloud LLMs in the US are one such transfer. Ollama on a local machine is no transfer at all. For workloads on PII-bearing data, local-first inference removes the cross-border issue entirely. The honest constraint: you sacrifice frontier-model quality. But for many enterprise workflows (summarisation, classification, structured extraction), an open-weights 27B-70B model is good enough.
Hetzner SA region note. Hetzner has no SA region; their closest is Frankfurt. For SA-resident production deployments needing more inference throughput than a laptop, options narrow: GCP Johannesburg + Vertex AI hosting open-weights, AWS Cape Town + SageMaker, or a local hosting provider with GPU SKUs. None are as cheap as Ollama on local hardware, but all keep data SA-resident.
ollama run gemma3:4b is the canonical "try it now" command for Gemma 3.LiteLlm("ollama/gemma3:4b")). The framework-to-local-model bridge.