A frontier-grade AI model now does the everyday work — standard websites, first-draft decks, routine code — at roughly a sixth of what the premium models charge. So why isn’t everyone switching tomorrow? Because the model was never the expensive part. The scaffolding around it is. This is the decision a leader actually faces — and it’s about cost, lock-in and ownership, not parameters.
Picture all the work your people hand to AI as a curve. The fat middle is standard and repeatable — a first-pass website, a presentation outline, summarising a document, a common coding task. On that middle, the cheap model isn’t a compromise; it often does the job as well or better. The premium models earn their price only on the hard, unusual tail — the genuinely novel or ambiguous problems — which is a minority of real usage.
Most leaders have never measured which of their AI work is routine and which is genuinely hard. That single measurement is the whole decision — because if most of your volume sits in the routine middle, you are paying a premium price for work a cheaper model does just as well.
You don’t have to choose one model for everything. The pattern worth knowing by name is the barbell: keep the expensive premium model for the few high-stakes calls — planning, final review — and send the heavy, repetitive bulk to the cheap one. Most of the work, and most of the savings, sits on the cheap end. The catch: someone has to build the switch that decides which job goes where — and that’s talent, not software you buy.
The tempting assumption is that a cheaper model is one configuration change away — swap the key, keep the savings. It isn’t. The model sits inside scaffolding — the instructions it follows, the memory it keeps, the tools it’s wired to (call it the harness) — and all of that is tuned to one model’s habits. Change the model and the scaffolding has to be re-built.
Point the system at a different model, keep everything else, bank the saving. A five-minute job.
Like swapping the engine and expecting the car to drive identically.
The instructions, the memory, the tool connections all have to be re-tuned for how the new model behaves. Real engineering time, before any saving lands.
The saving is real — so is the rebuild. Both go in the business case.
Why it matters to you: the token saving and the rebuild cost sit on opposite sides of the same ledger. Which one wins depends on how much you use AI. At high volume the saving dwarfs the rebuild; at low volume the rebuild can outrun it for a year. The number that decides it is your usage — so the honest first step is to know it.
There’s a clean line through it, and it’s drawn at margin.
If AI tokens are the cost of the product you sell, every cent saved drops straight to margin. The rebuild pays for itself quickly, and your competitors are doing the same maths. This is who is actually migrating today.
If AI helps your staff work faster but isn’t what you sell, the rebuild cost can outrun the saving for now. Waiting until the case is clear is a legitimate decision, not a failure of nerve.
This isn’t a sales pitch for switching. For a lot of businesses, “not yet” is the right answer — and knowing why it’s not yet (the rebuild cost exceeds your saving at your current volume) is worth more than a rushed migration.
The premium vendors aren’t defending their prices by being smarter. They’re defending them by being everywhere your team already works — inside the chat tool, the shared documents, the everyday apps — quietly absorbing how your company actually operates.
Run your whole operation on a vendor’s out-of-the-box convenience and, bit by bit, your operating knowledge ends up living in their tool. You start renting your own know-how back from them — and a cheaper model can’t rescue you if you can’t take your context with you when you leave. The cure isn’t a different model. It’s owning the layer where your company’s knowledge sits, so any model can plug into it.
Access to top-tier AI has been switched off before — under regulatory and export pressure, sometimes to whole countries or sectors, sometimes overnight. So “can we keep operating if the tap gets turned off?” is no longer paranoid; it’s planning. The answer is a model you can download and run yourself (the open-weight kind) kept as a backstop — the thing you fall back to if your main supplier becomes unavailable.
This is spare-generator logic. You don’t run the business off the generator — you keep it so a power cut doesn’t stop you. A self-hostable AI model is the same: redundancy you control, not a rip-and-replace of your best tool.
It’s redundancy, not replacement. The backstop covers the everyday work and keeps you running; you still reach for the premium model on the hard problems when it’s available.
A backstop you control is the point — whoever built it. The argument is for holding a model you can run yourself, not for any one vendor. The cheap, capable open-weight models leading this shift today happen to come from Chinese labs, which carries its own trade-off: running the weights yourself on your own machines is one thing; sending your prompts to that vendor’s hosted service is another. Weigh those separately. For the data-residency side of this — keeping customer information in South Africa — see the Keep the Data Home briefing rather than re-reading it here.
Have we actually measured how much of our AI use is routine versus genuinely hard? Until we have, we’re guessing at what we could save.
Is our operating knowledge ours — in a place any model can use — or trapped inside one vendor’s tool?
The scarce thing isn’t the model; it’s the people who can build the switch that routes routine work to the cheap model and hard work to the premium one. Do we have them, or do we need a partner?
If our main AI vendor became unavailable tomorrow, could we keep operating? And is the saving worth the rebuild — for us, now?
For most mid-sized South African businesses, the move today is not “rip out the premium model tomorrow.” It’s the quieter, cheaper discipline: measure your task mix, own your context, build the routing muscle, and keep a backstop you control — so that when switching pays, or a vendor cuts you off, you’re ready either way.
And a reality check on self-hosting: running the cheap open-weight model yourself needs a serious, expensive GPU cluster — data-centre hardware, not a spare server in the cupboard. The weights are free; the machines aren’t. For many teams the near-term path is trusted hosted access with eyes open, not a data-centre build.
The deep version for your engineers: the open-weight flagship behind this briefing — architecture, the real benchmarks, the hardware bill, and where the routing seam lives.
Open the leaf →The data-residency half of this decision: how to use global AI without your customers’ data leaving South Africa. The POPIA answer in plain language.
Read briefing →The cheap model is real, and it’s good. But the moat moved — from the model to the harness, the routing, and the context around it. Owning those, and keeping a backstop you control, is the actual decision.