A control plane shape for agent tool calls — described as designed, not as a running system. In this architecture, agents run on a multi-provider runtime that calls models through a model gateway (so the model itself stays a config choice — open weights through frontier, never tied to one provider). Each agent carries a tight allowed_tools list, and every call routes through one Cloudflare Worker that asks Cerbos for a decision, optionally pauses for human approval via Inngest, forwards to a vendor MCP server, and emits an OpenTelemetry trace to a self-hosted Langfuse. Mutating tools are modelled as Inngest functions so the same code path can serve a human form and an agent. Reference knowledge — this page documents the design so it can be adopted, adapted, or argued with.
A fractional-agent set-up — for illustration, agents named after C-suite roles like Grant (CFO), Penny (CMO), Leo (CLO), Grace (CHRO), Katharine (CRO) — needs real tools wired the right way: agents that do work, leave proof of work, and respect permissions. This page documents one such architecture, end to end, so it can be adapted or copied. The agent names below are illustrative; the architecture is independent of any specific roster.
About the descriptions below. Sections 02 to 08 describe the architecture as designed. They use the architectural present (“the gateway forwards…”, “the function pauses…”) for clarity, but read them as advisory — this is what the system would do in the design, not a claim about a currently-running deployment.
Three guarantees the architecture is designed to enforce:
run_id — an OTel trace in Langfuse, a row in D1, an R2 object if any file was produced, and an approval record if it was a mutation. By construction, off-record action would not be possible.principalPolicy file per agent enumerates the exact (resource, action) pairs that agent is allowed to call. No wildcards. The files are partner-readable.By design there is exactly one entry point for every tool call. Everything downstream — policy, approval, vendor MCP, audit, storage — sits behind that single Worker, which is what makes the audit trail complete by construction.
# One Worker entry. Five downstreams. Same shape for every call. agent (multi-provider runtime human (HTML form) + model gateway) | | | MCP tool call | POST form v v +-----------------------------------------------+ | mcp-gateway (Cloudflare Worker) | +-----------------------------------------------+ | +---> Cerbos PDP # allow | deny | approval_required | (cerbos.2nth.io, Fly.io) | +---> Inngest function # if mutating: pause for approval | (Inngest Cloud) then call the vendor | +---> vendor MCP server # GitHub / HubSpot / Slack / GWS / … | (vendor-hosted) | +---> Langfuse # OTel GenAI span, replayable | (langfuse.2nth.io, Fly.io) | +---> D1 tool_calls row # durable summary, indexed | (Cloudflare D1) | +---> R2 runs/{run_id}/… # artifact (file output, if any) (Cloudflare R2)
The Worker is intentionally thin. By design it doesn't invent a new protocol — it would forward MCP calls to vendor MCP servers with a per-agent scoped token. It doesn't invent an audit format — it would emit standard OpenTelemetry GenAI spans. It doesn't implement its own policy engine — it would ask Cerbos. Each downstream service is one a partner can independently swap or self-host. That is the point.
The agents are designed to share this substrate. None of these are 2nth-built; all are off-the-shelf. The "why" for each pick matters more than the pick itself — if a partner reads the rationale they can substitute a different OSS option without breaking the architecture.
| Layer | Pick | Why this, not the alternative |
|---|---|---|
| Agent runtime | Multi-provider runtime member | LangGraph (Python), Vercel AI SDK (TS), or Pydantic AI — pick the one that switches between Claude, GPT, Gemini, Ollama, vLLM by config. See the runtime decision leaf for the trade-offs. |
| Model gateway | Cloudflare AI Gateway + LiteLLM | Edge caching + rate limits + analytics in front of a routing proxy that targets any provider. Keeps the model itself a config choice rather than a code commitment. |
| Tool protocol | MCP (Model Context Protocol) | 500+ servers in the official registry; vendor-maintained ones are the safest picks for a partner-facing showcase. |
| Permission policy | Cerbos PDP | YAML policies, <1 ms decisions, Apache-2 OSS, self-hostable. Permit.io rejected: SaaS. |
| Proof of work | Langfuse self-hosted | MIT, ClickHouse-backed, free at scale, replayable traces. LangSmith rejected: SaaS lock-in. |
| Trace contract | OpenTelemetry GenAI | Vendor-neutral semantic conventions. Future-proofs the audit log if we swap backends. |
| Dual-use | Inngest | One function = one webhook entry + one MCP tool entry; same code path. Trigger.dev viable but Inngest's pattern is cleaner for human/agent parity. |
| Sandboxed code | E2B | Firecracker microVMs — hardware boundary. Daytona is faster but Docker-based; for partner credibility, hardware isolation wins. |
| Identity per agent | Cloudflare Access service tokens + short-TTL capability JWTs | Cloudflare-native. No extra IdP. Adding SPIFFE/Auth0 would be over-engineering at this scale. |
All eight are first-party (the vendor maintains the MCP server themselves). That is the load-bearing constraint — a third-party MCP server can be richer or cheaper, but it is also someone's hobby project, and a partner showcase cannot run on hobby projects.
| Server | Maintainer | Primary consumers |
|---|---|---|
| GitHub MCP | GitHub | Leo (legal review of repo content), all (code) |
| HubSpot MCP (mcp.hubspot.com, GA Apr 2026) | HubSpot | Penny, Katharine |
| Salesforce Hosted MCP | Salesforce | Katharine (enterprise-CRM lane) |
| Google Workspace MCP (Gmail / Drive / Sheets / Calendar / Docs) | Grace, Penny, Katharine, all | |
| BigQuery Remote MCP | Grant; Katharine read-only | |
| Slack MCP | Slack | Penny, Katharine, all (post-via-approval) |
| Notion MCP | Notion | All |
| Atlassian MCP (Jira / Confluence) | Atlassian | Leo (compliance issues), engineering issue tracking |
Each agent in this kind of set-up would get an allowlist of eight tools or fewer, drawn from the universal pool above plus a few domain-specific picks (e.g. Xero or SARS eFiling for a CFO agent, BambooHR-style HR tools for a CHRO agent). The enforceable version is one Cerbos principalPolicy YAML per agent, sitting at a stable path like cerbos/policies/<agent>.yaml. Allocation lives in the tools catalog for one illustrative roster.
In the design, every tool call produces all four artifacts. The run_id would be the universal join key — a partner pivoting from a single Langfuse trace could find the D1 row, the R2 file, and the approval record without any other identifier.
Replayable trace in Langfuse. Agent prompt, tool args, decisions, latencies, costs — everything an auditor needs to reconstruct what happened.
langfuse, self-hostedDurable summary in tool_calls. Cheap to query, indexed by agent / tool / decision / time. Powers dashboards without paging the trace store.
When a tool produces a file (PDF, CSV, image, audio), it would land at runs/{run_id}/{filename}. The trace and the row both link to it.
Destructive actions block until a human decides. The decision is its own row in approvals with approver id, time, and note. A "yes" is as audited as a "no".
The schema — three tables, append-only by convention: agent_runs (one row per session), tool_calls (one row per call within a run), approvals (one row per approval decision). Indexes cover the three real query shapes: by-agent-by-time, by-decision-by-time, and pending-approvals.
In the design, a mutating action lives once, as an Inngest function. The function would be triggered by either an HTML form on know.2nth.ai/tools/… (humans) or an MCP tool call routed through the gateway (agents). Same code path. Same Langfuse trace shape. Only the source attribute differs.
send-marketing-emailHow a call would flow through the system — using a marketing-email-send action as the illustration:
Agent path. An agent (e.g. a CMO role) calls send_marketing_email via MCP. The gateway sees a destructive action, logs decision=approval_required, and fires an Inngest event with source: "agent". The function pauses on approval.decided keyed by run_id.
Approval step. A human approver opens the audit UI (Langfuse, behind a corporate access gateway), sees a pending card with the proposed copy + audience, clicks approve. The decision becomes its own D1 row and fires the matching event.
Send step. The function unpauses, calls the relevant vendor API (e.g. HubSpot Marketing Email) via the per-agent scoped token, writes the result back to the gateway, closes the tool_calls row.
Human path. A marketing operator opens an HTML form for the same action and submits it. The form POSTs to a Worker which fires the same Inngest event with source: "human". No pause — the form submission is the approval. Send step runs identically.
Audit parity. The Langfuse trace shape is identical across both paths. The single attribute source on the root span distinguishes them — everything else is the same.
If a team adopts this architecture, the natural sequence is seven PRs. Each is one logical change; each builds on the last. Order matters — the audit and policy layer comes before any gateway code, so the first real tool call lands in a system that can see it and decide on it.
Create an agent-platform repo with the canonical folder layout (workers, functions, cerbos/policies, schema, fly). Set up CI with ASCII-only commit messages (Cloudflare Pages rejects non-ASCII with error 8000111). Land a README pointing at the architecture.
Deploy Langfuse and Cerbos on Fly.io (or equivalent). Write one example trace by hand. Load the per-agent Cerbos policies. Confirm a denied call lands in the audit log.
Implement workers/mcp-gateway: receive MCP call, check Cerbos, forward to a vendor MCP server, emit OTel span. End-to-end smoke test against the GitHub MCP server (read-only is safest).
Build one mutating action end-to-end (Inngest function + form Worker + MCP tool registration). Verify dual-use parity: agent and human produce identical traces, only the source attribute differs.
Replicate the pattern for each remaining agent in scope. Each gets at least one Inngest function for a mutating action.
Author one leaf per tool in the catalog (what it is, scopes, audit surface, sample agent + sample human call). Flip the matching soon cards on the catalog hub to Live.
Stand up the auditor surface — typically Langfuse embedded behind a corporate access gateway — so human auditors and partners can replay any run by run_id.
Acceptance criteria for any deployment of this architecture. Each verification should produce a specific, observable outcome once the system is wired and running. If any fail, the design has been broken in implementation.
An agent (e.g. a CFO role) issues a BigQuery query via the gateway. The gateway logs a span with source=agent, the agent id, tool=bigquery.query, decision=allow. A row appears in D1. The trace replays in Langfuse with the exact prompt + response.
A user opens the matching HTML form and submits the same operation. The Inngest function fires. The Langfuse trace appears with source=human, identical schema otherwise.
An agent attempts an action not in its allowlist (e.g. a CFO agent trying slack.delete_channel). Cerbos returns DENY. The denial is logged with decision=deny, reason=policy_no_match. No mutation occurs.
An agent attempts a destructive action (e.g. hubspot.update_deal_stage). Inngest pauses. An approver opens the approvals surface, sees the pending action, approves. The action completes. The Langfuse trace has a child approval span with approver_id and latency.
The same code clones onto another team's Cloudflare account, with their own Cerbos policies and their own vendor MCP tokens. Agents work the same way. No origin-team credentials involved.
The Cloudflare Pages deployments API rejects non-ASCII commit messages (em-dashes, middle-dots, curly quotes, arrows) with error code 8000111. Any wrangler deploy step in CI should pass --commit-message="<ASCII string>" explicitly, with a SHA-based value. Without it, wrangler reads HEAD's raw git commit and any Unicode in there silently fails the deploy. Load-bearing on day one.
Linked tersely. Verify the version against the date you read this leaf — the substrate moves fast, which is why the catalog cadence is hot (quarterly).