04 — Cross-cutting architecture

Project-Agent repo-root control-plane-evaluation/04-architecture.md

System-level architecture: control plane vs. data plane, multi-tenancy, deployment models, identity bootstrap, evidence pipeline, cost model, failure domains. The single architectural decision most likely to kill the company is named at the bottom.

---

1. Control plane vs. data plane split

```mermaid flowchart LR subgraph CP["ARX Control Plane (regional cells: us-east, us-west, eu-central, ap-southeast)"] direction TB CP_API["Tenant API\n(FastAPI)"] CP_PDP_DIST["PDP bundle\ndistributor (S3)"] CP_IDENTITY["SPIRE-fork\nidentity issuer"] CP_CRED["Credential broker\n(STS / OAuth-TE)"] CP_ORCH["Kill-switch\norchestrator (Temporal)"] CP_EVID["Evidence emitter\n+ chain anchor worker"] CP_DISC["Discovery brokers"] CP_DRL["Distributed revocation\nlist (Redis Cluster)"] CP_OTLP["OTLP receiver\n(OTel Collector)"] CP_STORE_PG[("Postgres / Supabase\n(per-tenant RLS)")] CP_STORE_CH[("ClickHouse\n(spans / 90d hot)")] CP_STORE_KAFKA[("Kafka / Redpanda\n(decision + audit\nstreams)")] end

subgraph CUST["Customer environment (per-tenant)"] direction TB subgraph CUST_AGENTS["Agent runtimes"] A1["Salesforce\nAgentforce"] A2["Foundry agent"] A3["Bedrock\nAgentCore"] A4["LangChain agent\n(customer-managed)"] A5["MCP-speaking\nagent"] end PEP_SDK["In-process PEP\n(SDK)"] PEP_SC["Sidecar PEP\n(Envoy + Cedar)"] PEP_MCP["MCP gateway PEP"] PEP_NATIVE["Platform-native\naction hooks\n(Apex / Flow / Lambda)"] PEP_EGR["Egress proxy"] SEN["Shadow-agent\neBPF sensor"] AUD["Customer-controlled\naudit destination\n(S3 Object Lock)"] KMS["Customer KMS\n(BYOK)"] end

subgraph TGTS["Downstream systems agents act on"] SF["Salesforce data"] M365["Microsoft 365 / Azure"] GW["Google Workspace / GCP"] AWS["AWS APIs"] SAAS["SaaS APIs (100+)"] end

A1 --> PEP_NATIVE A2 --> PEP_EGR A3 --> PEP_NATIVE A4 --> PEP_SDK A5 --> PEP_MCP

PEP_SDK -. fetches .-> CP_PDP_DIST PEP_SC -. fetches .-> CP_PDP_DIST PEP_MCP -. fetches .-> CP_PDP_DIST

PEP_SDK -- OTLP --> CP_OTLP PEP_SC -- OTLP --> CP_OTLP PEP_MCP -- OTLP --> CP_OTLP PEP_EGR -- OTLP --> CP_OTLP PEP_NATIVE -- OTLP --> CP_OTLP

CP_API --> CP_IDENTITY CP_API --> CP_CRED CP_API --> CP_ORCH CP_API --> CP_DRL CP_OTLP --> CP_STORE_CH CP_API --> CP_STORE_PG CP_API --> CP_STORE_KAFKA CP_EVID --> AUD CP_DISC -- Graph / Tooling API / etc --> A1 CP_DISC -- Graph --> A2 CP_DISC -- Bedrock APIs --> A3 SEN -- ShadowSignal --> CP_DISC

CP_CRED -. STS / OAuth tokens .-> TGTS PEP_SDK --> TGTS PEP_NATIVE --> TGTS PEP_EGR --> TGTS

CP_API -- KMS-encrypt CMEK at rest --> KMS ```

What runs where

In the customer environment (data plane): every PEP that intercepts agent actions. The shadow-agent eBPF sensor. The customer's audit destination (their bucket). The customer's KMS keys.
In ARX's control plane: identity issuer, credential broker, kill-switch orchestrator, evidence emitter, discovery brokers, DRL, OTLP receiver, ClickHouse, Postgres, Kafka. The PDP bundle distributor (S3-backed; PDPs in the data plane fetch from it).
The deliberate boundary: the PDP itself runs in the data plane (in-process at every PEP). The control plane stores and distributes policy bundles; it doesn't make per-action decisions. This is what allows p99 < 5ms enforcement and what allows the platform to keep operating during a control-plane outage.

What never leaves the customer boundary

Prompt content and tool-call payloads, unless the customer explicitly opts in to ship them to ARX for behavioral analysis. Default is metadata-only (token counts, tool names, target endpoints, redacted classifiers).
Customer KMS keys. ARX uses customer-controlled CMEK to encrypt anything that does land in the control plane.
The audit chain itself lives in the customer's bucket. ARX's control plane holds operational state (what the chain head is, which blocks have been anchored), not the chain content.

What does cross into ARX's control plane

Policy decisions (PDP outcomes — Permit/Forbid + decision log).
Enforcement events (action attempted, decision rendered, latency).
Identity events (SVID issued, rotated, revoked).
Span metadata (OpenTelemetry GenAI conventions — model used, token counts, tool name, agent ID, trace ID). Not prompt or completion text.
Discovery results (what agents exist on what platforms; not what they do).
Cost-relevant fields (model, token counts).

This split is deliberately conservative — the security buyer's first question is "does our customer data leave our boundary" and the honest answer should be "no, only metadata about what your agents did."

---

2. Multi-tenancy model

Three deployment shapes, all on the same codebase:

| Tier | Tenancy | Customer base | Annual contract | |---|---|---|---| | Shared SaaS | Postgres RLS + Vault per-tenant transit + per-tenant signing keys + per-tenant ClickHouse database (shared cluster) | Mid-market through F1000 | $50K – $500K | | Single-tenant SaaS (cell) | One regional cell per customer; dedicated Postgres + ClickHouse + Vault namespace; shared K8s control plane | F500 / regulated | $500K – $3M | | Customer VPC | The control plane runs in the customer's AWS / Azure / GCP account; ARX manages it via cross-account IAM; customer holds keys + storage | Federal, defense, healthcare, financial-services hyper-regulated | $2M – $10M |

For v1 in 18 months: ship Shared SaaS + Single-tenant SaaS only. Customer VPC deployment is v2. Air-gapped is never (or, only on a multi-million-dollar custom-engineering basis with the federal-systems-integrator partner doing the integration).

The shared SaaS isolation primitives:

Postgres RLS for row-level isolation on every table that holds tenant data. Already in place via Supabase. Strict enforcement: every query passes the tenant's claim through; no service-role queries except for explicit cross-tenant operations (admin metrics, anonymized benchmarks).
Vault per-tenant transit for signing keys and tenant-encrypted secrets. The tenant's signing key never leaves Vault.
Per-tenant ClickHouse database — same cluster, distinct database, query-time tenant filter enforced by a thin proxy (not by the application code). ClickHouse RBAC is too coarse for "every query auto-scoped to tenant" — the proxy is required.
Per-tenant Kafka topic naming convention (tenant.{id}.audit.events, tenant.{id}.policy.decisions) with ACLs.
Per-tenant rate limits on the control plane API.

What multi-tenancy does not mean

It does not mean per-tenant Kubernetes namespaces or per-tenant databases for shared SaaS. The economics break. The cell deployment is the answer for customers that need that level of isolation.

---

3. Identity bootstrap (the chicken-and-egg problem)

How does an agent get its first identity without already having one?

For agents inside the four open-framework runtimes (LangChain / CrewAI / AutoGen / OpenAI Agents SDK)

The customer's CI pipeline registers a new agent definition with ARX during deployment, providing the customer's existing CI service-account OIDC token (GitHub Actions OIDC, GitLab OIDC, etc.).
ARX exchanges the CI token (RFC 8693) for an agent-bootstrap credential scoped to a single SVID issuance.
The deployed agent's process loads the bootstrap credential from a one-time-use mount (k8s secret, AWS Parameter Store SecureString, etc.).
On first run, the agent process calls POST /v1/identity/agents/bootstrap with the bootstrap credential + a workload-attestation payload (k8s service account, Kubernetes Pod identity token, AWS instance identity document, Azure managed identity, GCP workload identity federation).
ARX validates both, mints the long-lived agent identity (SVID) keyed on the workload-attestation claims.
From here on, the agent uses standard SPIFFE workload-identity flow.

For agents on the seven commercial platforms

The platform itself owns identity creation (Foundry creates an Entra service principal, Bedrock binds an IAM role, Salesforce assigns to a User, etc.). ARX observes identity creation (via discovery, see C1.1) and binds an ARX-internal SVID to the platform-native identity for the purposes of policy enforcement, audit, and kill-switch routing. ARX does not issue the platform-native identity.

Initial human-owner binding

At agent registration, the human deploying the agent must be authenticated to ARX (via the customer's SSO). The agent's owner_human_user_id is stamped from that session. Re-attestation is quarterly by default, configurable. If the human leaves (signaled by SCIM deprovisioning from the customer's IdP), the agent goes orphaned → suspended → terminated on a configurable schedule.

What this isn't

It isn't a new identity protocol. It's standard OAuth 2.1 + RFC 8693 token exchange + SPIFFE workload attestation + SCIM. The only original work is the binding registry (which agent belongs to which human) and the agent-bootstrap exchange (using CI's OIDC token to mint a one-time-use bootstrap credential).

---

4. Compliance evidence pipeline

```mermaid flowchart LR PEP[PEP emits\nEnforcementEvent] AUD[Audit logger emits\nAuditEvent] IDP[Identity issuer emits\nIdentityEvent] MAN[Manifest changes\nemit ManifestEvent]

TAG["Evidence tagger\n(static map per framework)"]

KAF[("Kafka\nstream")]

EMIT["Per-event evidence tagger\n— tags: framework, control_id,\nsource_attribution_at_release_sha"]

CHAIN["Merkle batcher\n(1 block / 5s / tenant)"] TSA["RFC 3161 timestamp\nauthority"] CUSTBKT["Customer-controlled\nS3 / Azure Blob / GCS"]

PKG["Evidence package builder\n(quarterly + on-demand)"] VERIFY["arx-verify CLI\n(customer-side)"] AUDITOR["Auditor"]

PEP --> KAF AUD --> KAF IDP --> KAF MAN --> KAF KAF --> TAG TAG --> EMIT EMIT --> CHAIN CHAIN --> TSA CHAIN --> CUSTBKT EMIT --> PKG PKG --> AUDITOR CUSTBKT --> VERIFY VERIFY --> AUDITOR ```

The pipeline is read-once, derive-many. Events emit once; the evidence tagger annotates with the framework controls each event satisfies; the chain anchor batches events into Merkle blocks and writes them into the customer's bucket; the package builder reads from the tagger's index plus walks the chain for inclusion proofs.

The auditor doesn't trust ARX. They run arx-verify against the customer's bucket, validate the chain end-to-end, validate the RFC 3161 timestamps against an independent timestamp authority, and accept or reject. ARX's role is to provide the inputs to that verification, not to assert correctness.

---

5. Cost model — pick one and defend it

Recommendation: per-governed-action pricing, with a base platform fee.

Annual price = Platform base + (per-tenant per-action) × (action volume) − ramp discounts at volume

Platform base: $250K (Shared SaaS), $1.5M (Single-tenant cell). Includes up to N agents, X actions/month.
Per-action overage: $0.0001 to $0.001 per governed action (10⁻⁴ to 10⁻³ per decision). At 10K agents × 100 actions/day × 365 days = 365M actions/year × $0.0005 = $182K overage.
Volume discount: at >100M actions/year, blended rate drops by 50%.

Why not per-agent: per-agent encourages customers to under-deploy agents to control cost — wrong incentive for a governance product. The product gets better the more agents are governed.

Why not per-seat: irrelevant to the value delivered. Seats are humans; agents aren't humans.

Why not per-trace: trace volume is too noisy (some agents emit hundreds of spans per action; others emit a handful). Hard to forecast.

Per-governed-action is the right unit because it's the unit of value: every action either passed through or was blocked. Customers pay for governance applied. CFO can model from current agent action volumes. Prices anchor against the alternative cost (an unmediated agent action that breaks something costs N × the per-action governance fee).

The hidden cost the customer cares about: infrastructure passthrough. ClickHouse + Kafka + Postgres at 100M actions/year is real money. Either bundle into the per-action fee (simple, customer doesn't care) or break it out (transparent, more work for the customer's procurement). Recommend bundle.

---

6. Failure domains

The platform's reliability story comes down to which components, when down, take customer-visible operations down with them.

```mermaid flowchart TB subgraph FAILCLOSED["FAIL-CLOSED — outage stops governed actions"] F1["Identity issuer (C2.1)"] F2["Credential broker (C2.2)"] F3["Distributed revocation list (C6.2)"] F4["PDP bundle distributor first-fetch path (C3.3)"] end

subgraph FAILDEGRADED["FAIL-DEGRADED — actions continue; freshness lags"] D1["Discovery broker (C1.1)"] D2["Trace ingest / OTLP (C5.1)"] D3["Evidence emitter back-fill (C4.1)"] D4["Cost attribution (C5.2)"] D5["DRL push freshness (C6.2)"] end

subgraph FAILSILENT["FAIL-SILENT — non-customer-impacting"] S1["Vendor questionnaire renderer (C4.4)"] S2["Engagement Canvas (consultant tool)"] S3["Workforce dashboard"] S4["Quarterly evidence package generator"] end

subgraph DATAPLANE["DATA PLANE LOCAL — survives full control-plane outage for ≤4h"] P1["In-process PDP w/ last-known bundle"] P2["DRL bloom filter snapshot in bundle"] P3["Local audit-event buffer (replay on recovery)"] end ```

The reliability contract

Fail-closed components are 99.99%-SLO targets (52 minutes of downtime / year max). These are the "the company gets sued if these are down for an hour" components.
Fail-degraded components are 99.9% SLO (8.7 hours / year max). The platform keeps governing; some queries lag.
Fail-silent components are 99% SLO (87 hours / year max). Internal-facing or async work.
The data-plane local survival window is 4 hours — the maximum time the customer's agents can keep operating with the control plane fully unreachable, on the strength of locally-cached PDP bundles + DRL bloom filters + local audit buffers. Beyond 4 hours, configurable: continue with last-known data (risky for fast-changing policies) or freeze.

The customer's incident-review story when ARX has an outage

Honest framing for the day this happens: "ARX's control plane was unreachable for X minutes. During that window, your in-process PDPs continued to enforce on the last bundle they had cached, the DRL bloom filter blocked actions for known-revoked agents, and audit events buffered locally. Newly-issued tokens couldn't be minted, so any agent process that started in that window failed to bootstrap. Your existing in-flight agents continued to operate within their last-known policy. Audit events backfilled to the chain on recovery, anchored with a single timestamp range covering the outage." That is a defensible incident review. Don't ship a platform where the answer is "and your agents stopped."

---

7. The single architectural decision that, if wrong, kills the company

The PDP must run in-process at every PEP.

If the PDP is a remote call across the network for every agent action, the platform's latency story becomes "we add 30-100ms to every action your agents take" — which is enough that the customer will route their highest-volume agent traffic around ARX. The remote-PDP architecture is also impossibly expensive at 10K-agent scale (every action is a network round-trip + a Cedar evaluation; that's a hot path with serious capacity demands).

The in-process PDP design depends on:

A policy language with a fast in-process evaluator (Cedar, in milliseconds; not a custom DSL or remote service).
A bundle format that's safe to distribute and small enough to hot-load (Cedar bundles are a few hundred kilobytes for typical policy sets; not a problem).
Customers willing to install the PEP in their agent runtimes, sidecars, MCP gateways, native action hooks. This is the operational cost the customer pays for the architecture.

If point #3 is wrong — if customers refuse to install in-process PEPs and demand a fully-remote enforcement model — then either (a) the platform absorbs the latency hit and loses the high-volume customers, or (b) the platform reduces enforcement scope to what can survive on a 50-100ms per-action budget (only high-risk actions get checked; the rest pass through with logging only). Both are bad.

This is the bet. It's defensible — the entire service-mesh ecosystem (Istio, Linkerd, Consul Connect) succeeded on this exact pattern. Customers in 2026 are operationally sophisticated enough to install sidecars and SDKs. But it's the bet that has to work.

The single secondary risk: customer resistance to running the eBPF shadow-agent sensor. Without it, shadow discovery degrades to log-and-egress-inference — which finds maybe half the shadow agents instead of nearly all. Plan for the half-coverage scenario; sell the full-coverage scenario to customers who'll deploy the sensor.

---

Honest pushback (per the workflow)

The architecture above carries one assumption that is structurally hard to validate before a customer is in production: that the PDP's in-process Cedar evaluator stays under the latency budget at every PEP topology. It will work in the SDK and sidecar paths. It probably won't work natively in the Salesforce Apex sandbox (Apex can't execute arbitrary native code; the PDP for Apex has to be a remote call to a regional ARX node). It will work in the egress proxy. It will work in the MCP gateway. It will partially work in the platform-native action hooks (depends per platform on whether they allow in-process libraries).

So the realistic latency story is: p99 < 5ms for in-process PEPs, p99 < 30ms for the platforms where in-process is impossible, p99 < 60ms for egress proxies that have to call out to a regional ARX node. The single remote-call scenario is acceptable because Apex is calling out for a single permit/deny and the rest of its action runs locally. Don't over-promise. The "<5ms p99" line gets walked back to "<5ms in-process; 30ms remote-call" the moment a sophisticated buyer asks.