LLMs on your laptop. No cloud. No FX bill.

Ollama is the simplest way to run LLMs locally. ollama run gemma3:4b and you have a working model. MIT-licensed, llama.cpp under the hood, OpenAI-compatible API endpoint at http://localhost:11434. Pull from a registry of 200+ open models — Gemma, Llama, Qwen, Mistral, Phi, DeepSeek, and the long tail of fine-tuned community variants. Zero cloud dependency, POPIA-friendly by default, the inference layer that survives load-shedding.

Live since 2023 MIT licensed llama.cpp engine 200+ models OpenAI-compatible API

01 · What it is

A registry, a runtime, and a really good developer experience.

Ollama is three things bundled into one CLI: a model registry (Docker-Hub-for-LLMs at ollama.com/library), a local runtime built on top of llama.cpp for efficient quantised inference on CPU and GPU, and an OpenAI-compatible HTTP server running on localhost:11434 that any agent framework can call as if it were OpenAI.

The core promise: "ollama run gemma3:4b and the model works." No CUDA setup, no Hugging Face authentication dance, no virtualenv configuration, no quantisation guessing. The CLI handles model pulls, GPU detection, KV cache management, context window sizing, and concurrent request scheduling. For a developer who just wants to test "does Llama 3.3 70B answer this query the way Claude does?" without standing up a Hugging Face Inference Endpoint, Ollama is the path of least resistance.

The runtime under Ollama is llama.cpp — Georgi Gerganov's open-source C++ inference engine. Ollama wraps llama.cpp in a Go server with model management, a registry, and an HTTP API. That's the whole architecture. No proprietary pieces; the underlying engine is the same one shipped in every other "run LLMs locally" project (LM Studio, Jan, GPT4All).

Why "OpenAI-compatible" matters

Ollama exposes /v1/chat/completions with OpenAI's exact request/response format. That means: any code written against the OpenAI SDK runs against Ollama by changing a single environment variable (OPENAI_BASE_URL=http://localhost:11434/v1). LangChain, LlamaIndex, OpenAI Agents SDK, ADK via LiteLLM, custom curl calls — all work unchanged. That's the structural advantage: Ollama isn't trying to win on its own API, it's trying to be the place OpenAI-compatible code lands when you don't want to pay OpenAI.

02 · How it works

Pull, run, query. That's the whole loop.

Three CLI verbs do most of the work. ollama pull downloads a model. ollama run starts it interactively. ollama serve exposes the HTTP API. The HTTP server runs in the background by default on macOS and Linux, so most users never explicitly call serve — the API is just always there at localhost:11434.

The day-to-day commands:

# Install (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Or on macOS via Homebrew
brew install ollama

# Pull and run a model — combined
ollama run gemma3:4b
ollama run llama3.3:70b
ollama run qwen2.5-coder:32b

# List local models, sizes, last-used
ollama list

# Remove a model to free disk
ollama rm llama3.3:70b

# Show GPU / CPU usage of running models
ollama ps

The HTTP API is OpenAI-shaped. Every framework that already speaks OpenAI works without modification:

# curl — direct OpenAI-format request
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:4b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Python — OpenAI SDK pointed at Ollama
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="gemma3:4b",
    messages=[{"role": "user", "content": "Hello"}]
)

Calling Ollama from ADK via LiteLLM (the model-agnostic layer ADK uses for non-Gemini models):

from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm

local_model = LiteLlm(model="ollama/gemma3:4b")
agent = Agent(name="local", model=local_model,
              instruction="You are a helpful assistant.")

Quantisation: why a 70B model fits on a MacBook

llama.cpp ships quantised weights — 4-bit, 5-bit, 8-bit instead of 16-bit FP16. A 70B model that's 140GB at FP16 becomes ~40GB at Q4 — and quality on most tasks is barely affected. Ollama defaults to good quantisation choices per model; you can override (ollama pull llama3.3:70b-q5_K_M) for finer control. The economics: you can run "frontier-grade" 70B models on a 64GB consumer machine. Five years ago this required a $30k server.

03 · Hardware reality

What machine runs what model.

The honest answer most "run local LLMs" guides skip. Different model sizes need different hardware. Below is the rough fit as of mid-2026, assuming Q4 quantisation (Ollama's default) and reasonable context windows (8-32k).

Hardware	Comfortable	Stretches
MacBook Air M2 / M3 (16GB)	Gemma3:1b, Phi-3, Llama 3.2:3b	Gemma3:4b is borderline
MacBook Pro M3/M4 (32GB)	Gemma3:4b/12b, Llama 3.3:8b, Qwen 2.5:7b	Gemma3:27b, Mistral Large fits but slow
MacBook Pro M3 Max (64GB)	Gemma3:27b, Qwen 2.5:32b, Llama 3.3:70b at Q4	Llama 3.3:70b at Q5+ is slow but works
Hetzner GEX44 (Ryzen + 64GB, no GPU)	Up to 14B comfortably (CPU inference)	27B+ runs but at chat speed, not stream
Single RTX 4090 (24GB VRAM)	Up to 32B at Q4 fully on GPU	70B with offloading; faster than M3 Max
Dual RTX 4090 / single H100	70B at Q5+, 405B at Q4 with offload	The "small server" tier

Apple Silicon punches above its weight

Unified memory architecture means Apple Silicon Macs run larger models than Linux + GPU systems with similar nominal RAM. A 64GB M3 Max comfortably runs Llama 70B at Q4; a 64GB Linux box with a 24GB GPU has to offload aggressively and is meaningfully slower. For local-LLM dev work in 2026, a maxed-out MacBook Pro is genuinely the most cost-effective hardware if you don't need extreme throughput.

04 · Decision guide

Local vs cloud: when each wins.

Local inference is not always cheaper, faster, or better — it has its own trade-offs. The honest framing: cloud inference wins on absolute performance and frontier-model quality; local inference wins on cost predictability, data residency, and "the laptop works on the plane".

Use Ollama when

POPIA / GDPR / data-residency: data must not leave your hardware
Cost predictability: no per-token billing, no usage spikes
Offline / load-shedding: laptop on battery still works
Local development / testing: iterate without paying per request
RAG over private corpora: keep documents off third-party servers
Coding agents: small fast models like qwen2.5-coder:7b + IDE
Edge/embedded: on-device assistants, single-tenant kiosks

Cloud LLMs win when

Frontier reasoning: GPT-5, Claude Opus 4.7, Gemini 2.5 Pro genuinely beat any open-weights model
Massive context: 1M+ tokens needs cloud (Gemini 1.5/2.5 Pro)
Throughput at scale: cloud APIs handle 1000s of concurrent requests
Multimodal extremes: state-of-the-art vision/voice often closed
You don't have local hardware: 16GB MacBook Air can't run 70B
Specialised features: extended thinking, tools API, search grounding

The hybrid most teams actually run

Local for development, cloud for production hero work, local fallback for outages. Build agents against an Ollama-hosted Gemma/Llama for the dev loop; deploy production agents pointing at Claude or Gemini APIs; configure a circuit-breaker that flips back to a local Ollama instance during cloud outages or load-shedding. The OpenAI-compatible API on Ollama makes this trivial — same code, different OPENAI_BASE_URL.

05 · South African context

Why this matters more here than most places.

Load-shedding resilience. A laptop on battery with Ollama installed survives a stage 6 cycle just fine. Cloud LLMs need uplink + provider availability + the international peering being calm. For SA teams that need their tooling to work during a 4-hour outage, local inference is the difference between productive and idle.

FX cost predictability. Ollama is free. Hardware is a one-time purchase. There is no per-token billing in USD, no surprise quarterly revaluation, no quota throttling at month-end. For a SA team running heavy iterative-dev work (10k+ requests/day during a feature push), the math gets compelling fast: $200/mo USD = R3,800/mo on a Claude Max heavy plan vs R0/mo on a maxed-out MacBook. You amortise the laptop in 6-12 months.

POPIA + cross-border. Section 72 of POPIA restricts personal information transfer outside SA without consent / adequate protection. Cloud LLMs in the US are one such transfer. Ollama on a local machine is no transfer at all. For workloads on PII-bearing data, local-first inference removes the cross-border issue entirely. The honest constraint: you sacrifice frontier-model quality. But for many enterprise workflows (summarisation, classification, structured extraction), an open-weights 27B-70B model is good enough.

Hetzner SA region note. Hetzner has no SA region; their closest is Frankfurt. For SA-resident production deployments needing more inference throughput than a laptop, options narrow: GCP Johannesburg + Vertex AI hosting open-weights, AWS Cape Town + SageMaker, or a local hosting provider with GPU SKUs. None are as cheap as Ollama on local hardware, but all keep data SA-resident.

06 · Connections