What categories of tooling an agent platform actually needs to ship into production, what credible options exist in each, and how to choose between them. Reference knowledge — no claims about anyone's deployment. The point of this leaf is orientation: most teams get one or two categories right and discover the others under deadline pressure. This page maps them so the gaps surface before the deadline, not after.
Last reviewed: 2026-05-17 · Cadence: hot (quarterly) · Worked example: tools/example
An agent is software that calls external systems with permissions and leaves a trail. To do that in production — not in a demo — you need eleven categories of tooling, plus one that cross-cuts the rest. They are not all the same category of decision: some picks are sticky (the tool protocol your whole stack speaks), some are reversible (the audit backend, if you stuck to a standard), some are cheap-to-buy, some are cheap-to-build. A team that maps all ten at the start avoids the worst mistake: discovering at week six that there is no plan for “how a human approves before this sends.”
The categories below are the ones that show up in every credible agent stack as of mid-2026. The twelfth (cross-cutting), pre-built vendor MCP servers, applies across all of them — it is the universe of integrations agents can call without writing custom code.
Read this section to orient. The next section breaks each category into its credible options.
What hosts the agent loop — the model call, the tool-call parsing, the result feedback, the next call. Manages context window and (often) MCP plumbing.
Multi-provider or first-party?Where calls to the model actually go. Routes between providers (Claude, GPT, Gemini, Ollama, vLLM), handles fallback + caching + rate limits + cost tracking. The layer that buys you model portability.
Edge gateway, unified API, or self-hosted proxy?How the agent talks to external systems. Defines tool schemas, invocation, error handling, streaming. The choice composes (or doesn't) with everything downstream.
Standardised or per-vendor?Where agents and humans find tools to call. Catalogues of pre-built integrations with auth handled. Often the only way to avoid writing custom adapter code.
Free + community, or paid + curated?What each agent is allowed to do, expressed as policy a non-engineer can read. The load-bearing layer for any audit / compliance / regulated-environment story.
Centralised policy engine or framework allowlists?How a human (or another agent) reconstructs what happened. Replayable traces, prompt + result + token + cost capture, by-agent / by-tool / by-time queries.
OSS self-hosted or SaaS managed?Where agent-written code runs without taking down the host. Tradeoff is between cold start and isolation strength; hardware vs container vs language-level.
Hardware isolation or container isolation?The same operation triggered by humans (form) or agents (tool call). Durable functions with sleep, retry, fan-out, and "pause until event" as first-class primitives.
First-class workflow primitive or DIY?Who the agent is when it acts. Short-TTL scoped tokens, per-agent service accounts, sender-constraint proofs. Without this, "permissions" is just a name on a policy file.
Edge service tokens or enterprise IdP?What blocks until a human says yes. Mutating actions, financial transfers, message sends, irreversible state changes — all candidates. Cheap to add, expensive to retrofit.
Built into the workflow engine or external UI?What the agent remembers between calls and across sessions. Per-conversation, per-user, per-organisation; managed memory vs explicit state machines vs DIY.
Per-session state or persistent memory?The universe of packaged integrations agents can call without writing custom code. 500+ servers in the official MCP registry as of mid-2026, first-party + community mixed.
First-party only, or accept community?Each table has three columns: pick, maintainer, and trade-off / when to choose. Not exhaustive — just the picks that show up in production agent stacks in mid-2026. Skim the picks; the rationale for any single choice belongs in its own decision leaf.
The agent's host environment. Manages the model call ↔ tool call ↔ result feedback loop, often the MCP plumbing too. Pick the one closest to the model you are using; switch costs are real but bounded (~1 day per agent).
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Claude Agent SDK | Anthropic (first-party) | Best fit if Claude is the model. Native allowed_tools + PreToolUse hooks; MCP-aware out of the box. |
| LangGraph | LangChain | Best for stateful multi-agent orchestration. Multi-provider. v1.0 in late 2025; 90k+ stars. |
| Vercel AI SDK v6 | Vercel | TypeScript-first, lovely DX, ToolLoopAgent for production. Narrower scope than LangGraph. |
| CrewAI | community + commercial | Role-based, fast adoption (60%+ Fortune 500 by Jan 2026), coarser permission model. |
| OpenAI Agents SDK | OpenAI (first-party) | Best fit if OpenAI is the model. Locks the runtime to OpenAI. |
| Microsoft Agent Framework | Microsoft (unified AutoGen + SK) | Enterprise .NET / Azure shops. GA Q1 2026. |
| Pydantic AI | Pydantic | Type-first Python; small but credible. |
| Raw model API + DIY loop | — | Maximum control; you rebuild the loop, context compaction, MCP plumbing yourself. |
Where calls to the model actually go. The layer that decouples agent code from “which model does this call?” Without it, swapping from Claude to Gemini means refactoring; with it, it's a config change. Cheap to add early, expensive to retrofit once dozens of code paths call anthropic.messages.create (or equivalent) directly. The category that buys you model portability — open weights through frontier — without rewrites.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Cloudflare AI Gateway | Cloudflare | Edge-native, caching + rate limits + analytics built in. Cheapest if you're already on Cloudflare; fits the edge-first story. |
| OpenRouter | OpenRouter | Unified API across 100+ models including open-weights (Llama, Qwen, DeepSeek) and frontiers (Claude, GPT, Gemini). Pay-per-call. |
| LiteLLM | BerriAI (OSS) | Self-hostable OpenAI-compatible proxy that routes anywhere (Anthropic, OpenAI, Bedrock, Vertex, Ollama). Strong control-plane choice. |
| Vertex AI Model Garden | Google's hosted catalogue. Strong for Gemini + Claude + Llama in one place; GCP-native auth. | |
| Amazon Bedrock | AWS | AWS-native model gateway. Right if you're already on AWS; heavy otherwise. |
| Ollama / vLLM (direct) | OSS | For self-hosting open-weights models. Pair with LiteLLM upstream or call directly. |
How the agent talks to external systems. This pick is sticky — everything downstream depends on it. Bias toward standards.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| MCP (Model Context Protocol) | Anthropic-origin, open spec | The vendor-neutral standard. 500+ servers in the official registry. Cross-runtime by design. |
| OpenAI function-calling | OpenAI | Works inside OpenAI tooling. Tool definitions tied to one provider. |
| Direct HTTP / vendor SDKs | per-vendor | No discovery, no composition; fine for one-off integrations, awful as the count grows. |
| GraphQL endpoints | per-team | Niche — useful only if the backend is already GraphQL-native. |
Where the agent finds tools to call. The alternative to a registry is writing your own adapters — viable for 5 tools, unmanageable at 50.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Official MCP Registry | Anthropic-maintained | The canonical discovery surface. Free. Listings for first-party and community servers. |
| Composio | Composio (SaaS) | 500+ pre-built actions with OAuth handled. Paid. |
| Arcade.dev | Arcade (commercial) | MCP-first runtime with per-user scoped tokens. Smaller catalog than Composio. |
| Pipedream | Pipedream (SaaS) | 2000+ app integrations. Agent-callable via API. Workflow-flavoured rather than MCP-native. |
What each agent is allowed to do. Externalised from code so non-engineers (and auditors) can read it. Without a policy engine, you are doing RBAC in code — fine at small scale, awful past it.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Cerbos | Cerbos (OSS, Apache-2) | YAML policies, sub-ms decisions, self-hostable. Partner-readable. |
| Permit.io | Permit (SaaS) | Managed RBAC/ABAC with a UI. Policies live in their database. |
| OPA / Rego | CNCF (OSS) | Generalist policy engine. Rego is powerful but opaque; higher learning curve. |
| Framework allowlists | per-framework | Built into the SDK (e.g. Claude SDK's allowed_tools). Sufficient at small scale. |
| Aembit | Aembit (commercial) | Machine-identity-first. Newer entrant, narrower focus. |
How you reconstruct what happened. Three things matter: that traces are replayable, that args and results are captured (not just metadata), and that the schema follows the OpenTelemetry GenAI conventions so the backend is swappable.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Langfuse | Langfuse (OSS, MIT; ClickHouse-acquired Jan 2026) | Self-hostable, replayable traces, free at scale. Most popular OSS pick. |
| LangSmith | LangChain (SaaS) | Deepest LangGraph integration; per-seat priced. |
| Helicone | Helicone (OSS + SaaS) | Proxy-based, simplest to deploy. Narrower than dedicated trace stores. |
| Braintrust | Braintrust (SaaS) | Eval-first; tracing is secondary. Strong for evaluation workflows. |
| Arize Phoenix | Arize (OSS) | ML-grade rigor; embedding analysis. Mixed LLM + traditional ML. |
| Pydantic Logfire | Pydantic (SaaS) | AI-specific UI; emerging. |
| OpenTelemetry GenAI | OTel community (standard) | The vendor-neutral semantic conventions underneath every backend. Not a backend itself. |
Where agent-written code runs without taking down the host. Hardware isolation (Firecracker) is the credible pick for untrusted output; container isolation is faster but weaker.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| E2B | E2B (commercial) | Firecracker microVMs — hardware boundary. Sub-second cold starts. |
| Daytona | Daytona (commercial) | Docker-based, sub-90ms cold starts. Container-level isolation. |
| Modal Sandboxes | Modal (commercial) | GPU support, gVisor isolation. Higher cost; niche unless you need GPU. |
| Cloudflare Sandboxes | Cloudflare | Edge-native, emerging. Worth tracking, not yet the production pick. |
The same operation, callable by a human (HTML form) or by an agent (MCP tool). Durable functions with "pause until event" are the cleanest way to model this so the audit trail is identical for both callers.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Inngest | Inngest (OSS core + cloud) | Durable functions; "pause until event" primitive. Free tier covers pilot scale. |
| Trigger.dev v3 | Trigger.dev (OSS) | Similar to Inngest. TypeScript-first. |
| Temporal | Temporal (OSS) | Heavyweight enterprise option. Powerful, more infrastructure. |
| n8n | n8n (OSS + cloud) | Low-code workflows. Bidirectional MCP integration since April 2025. |
| Hookdeck | Hookdeck (SaaS) | Webhook infrastructure with an MCP server. Good when webhooks dominate. |
| DIY: one function, multiple entry points | — | Works at small scale; you re-implement durability when you grow. |
Who the agent is when it acts. Short-TTL scoped tokens per (agent, tool) pair, signed by the runtime. Without this layer, "permissions" is just a wishful name on a policy file.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Cloudflare Access service tokens | Cloudflare | Zero-trust, edge-native, no separate IdP. Cheapest path if you are already on Cloudflare. |
| Auth0 / Okta Agent IAM | Okta (SaaS) | Enterprise identity, broader scope. Heavier; right for regulated industries. |
| SPIFFE / SPIRE | CNCF (OSS) | Workload identity standard for K8s. Right if you are running on Kubernetes. |
| DPoP-bound JWTs | open spec (RFC 9449) | Sender-constrained tokens. Most secure; most implementation work. |
| Asgardeo / WSO2 | WSO2 (commercial) | Identity-as-a-service with agent-specific flows. Newer, niche adoption. |
What blocks until a human says yes. Cheap to add as a first-class workflow primitive on day one; expensive to retrofit after the audit log shows agents shipping money out the door without anyone approving it.
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Inngest pause-and-resume | Inngest | First-class workflow primitive. "Wait for approval keyed by run_id" in one line. |
| LangGraph interrupts | LangChain | Built into the agent loop. Right if LangGraph is the runtime. |
| AutoGen human handoff | Microsoft | Pattern within Microsoft's framework. |
| n8n approval nodes | n8n | Visual workflow with explicit approval steps. Lower code. |
| Slack-bot approval | per-team | Decisions happen where the team already talks. Lightweight, ergonomic. |
| Custom approvals UI + queue | — | DIY when frameworks don't fit. Most work; most control. |
What the agent remembers between calls and across sessions. Three flavours: per-session state (the conversation), explicit state machines (the workflow), persistent memory (the agent's long-term knowledge of a user).
| Pick | Maintainer | Trade-off / when to choose |
|---|---|---|
| Anthropic Memory primitives | Anthropic (first-party) | Built into Claude Agent SDK. Lowest friction if Claude is the runtime. |
| Letta (MemGPT-derived) | Letta (OSS) | Persistent memory, conversation-centric. 15k stars. |
| mem0 | mem0 (commercial) | Managed memory service. Saves you running a vector DB. |
| LangGraph state machines | LangChain | Explicit per-workflow state. Right when your "memory" is really workflow state. |
| Vector DB + retrieval (DIY) | — | Maximum flexibility, you own the orchestration. Pick a DB (Postgres + pgvector, Pinecone, Qdrant) and build. |
The universe of packaged integrations. Cuts across all the other categories. The constraint that matters: first-party (vendor-maintained) servers are zero-effort to maintain; community servers can be richer but you adopt them at your own pace.
| Server | Maintainer | What it brings |
|---|---|---|
| GitHub MCP | GitHub (first-party) | Repos, PRs, issues, code search. The legal-review and platform-engineering substrate. |
| HubSpot MCP | HubSpot (first-party, GA Apr 2026) | Contacts, deals, marketing email, campaign analytics. |
| Salesforce MCP | Salesforce (first-party) | Pipeline objects, flows, Apex actions. Enterprise CRM. |
| Google Workspace MCP | Google (first-party) | Gmail, Drive, Sheets, Calendar, Docs. The default office substrate. |
| BigQuery Remote MCP | Google (first-party) | Analytics queries. Read-only by default; write needs explicit scoping. |
| Slack MCP | Slack (first-party) | Read channels, post messages, list users. Post is destructive; gate it. |
| Notion MCP | Notion (first-party) | Pages, databases, comments. |
| Atlassian MCP | Atlassian (first-party) | Jira issues, Confluence pages. Engineering and compliance. |
| + ~500 more in the registry | mixed (first-party + community) | Check the official MCP registry. Production-pick rule: prefer first-party. |
Most picks pass any single heuristic. Run a candidate through all five and the picks narrow fast. These are biased toward shipping into production, not toward the cleverest architecture for a demo.
What does lock-in cost in 24 months? SaaS is faster to start; costs more long-term; the audit log lives on someone else's infrastructure. OSS is more setup, full data sovereignty. For agents handling sensitive data (financial, HR, legal, regulated), audit-log ownership often makes the OSS choice non-negotiable.
Rule of thumb: SaaS for the layers that are commodity (model gateway, registry); OSS for the layers that hold the audit (Langfuse over LangSmith), the policy (Cerbos over Permit.io), and any data plane.
What does the maintenance bill look like in a year? A vendor-maintained MCP server (GitHub's, HubSpot's, Salesforce's) is zero effort to maintain — the vendor ships fixes alongside their API changes. A community server can be richer but is someone's side project; you may inherit the patch responsibility.
Rule of thumb: For production, vendor-maintained wins unless you can afford to fork it and own the patches.
Who owns the audit log? In a multi-tenant SaaS, the log is in the provider's database. You can read it; you don't own it. For regulated environments (POPIA, GDPR, sector compliance) this is often disqualifying — the regulator wants the log under your control, not on someone else's terms of service.
Rule of thumb: Treat audit and policy as single-tenant by default. Treat catalogues and model traffic as multi-tenant if the vendor's terms hold.
At your scale, when does "good enough" beat "best in class"? At 5 agents and 12 tools, breadth (one credible pick per category, all working together) matters more than depth (the best possible audit tool). At 50 agents and 200 tools, depth in your weakest category becomes the constraint.
Rule of thumb: First production deploy — breadth. Second — deepen the layer that hurt most. Don't optimise before the pain shows up.
What does swapping cost if the pick was wrong? The tool protocol (MCP vs not) is sticky — the whole stack speaks it. The audit backend can swap in a weekend if you stuck to OpenTelemetry GenAI conventions. The runtime is one-day-per-agent. Some layers are cheap to change later; some aren't.
Rule of thumb: Spend more time on the sticky picks (protocol, identity, dual-use shape). Spend less on the reversible ones (audit backend, sandbox provider). Bias toward standards when sticky.
The categories above are not independent. The dual-use workflow primitive expects the audit layer to capture both callers identically; the permissions layer expects scoped identity tokens to bind decisions to agents; the runtime drives most of it. Two leaves go deeper.
Linked tersely. The landscape moves fast — verify the version against the date on this leaf.