01 — Component decomposition

Project-Agent repo-root control-plane-evaluation/01-component-decomposition.md

For each of the six capabilities, the minimum viable subsystem, the data structures and APIs, the protocols/standards depended on, the runtime topology, and the failure-mode behavior. Then the single technical assumption most likely to be wrong.

The naming convention used below: C{cap}.{component} — e.g., C1.discovery-broker is component 1 of capability 1.

---

Capability 1 — Discovery (sanctioned + shadow agents in <24h, all platforms)

C1.1 — Discovery broker

Minimum viable subsystem: a per-platform poller + event-receiver pool that asks each agent fabric "list your agents and their owners" through whichever interface it exposes. For platforms without a list-agents API, a fallback that reads platform-side audit logs and infers agent existence from observed action signatures.
Data structures: DiscoveredAgent { id, platform, platform_native_id, owner_human_user_id_or_unknown, declared_capabilities, observed_actions_30d, first_seen, last_seen, source: "list_api" | "audit_inference" | "egress_inference" | "scim_inferred" | "mcp_handshake", confidence: 0.0–1.0 }. Stored content-addressed; identity grounded in (platform, platform_native_id).
APIs: POST /v1/discovery/scan (per-platform, async, returns scan_id), GET /v1/discovery/agents (search/filter), GET /v1/discovery/scan/{id} (status), POST /v1/discovery/sources (configure where to look). Webhook ingress at POST /v1/discovery/ingest for platforms that push (Foundry tenant audit stream, Salesforce Event Monitoring).
Protocols / primitives depended on: SCIM 2.0 read for human-owner inference where the platform synced user identities (RFC 7643). Microsoft Graph for Foundry/Copilot Studio. Salesforce Event Monitoring API. Google Cloud Asset Inventory + Vertex/Gemini admin APIs. AWS Bedrock GetAgent + ListAgents. ServiceNow Now Platform AI Agent table queries. UiPath Orchestrator API. watsonx Orchestrate REST. For open frameworks: file-system + container-image scanning (docker history, pip freeze, requirements.txt static analysis) + outbound-traffic egress capture.
Runtime topology: control-plane workers (Celery) + per-tenant scheduler. No data-plane component for the API-driven side. For shadow discovery, an optional eBPF egress sensor as a customer-side daemonset (k8s) or VM agent — see C1.3.
Failure mode if down: discovery falls behind. Existing-agent inventory is not affected. New shadow agents go undetected for the duration. Fail-degraded, not fail-closed (no enforcement is gated by discovery being up).
Latency / freshness target: new sanctioned agent discovered within 60 seconds of platform-side event (push-driven where available, ≤10-minute poll otherwise). New shadow agent discovered within 24 hours of first egress.
Single technical assumption most likely to be wrong: that the seven commercial fabrics will expose enough metadata via admin APIs to enumerate agents reliably. Salesforce and ServiceNow probably will (they want partner-program ecosystems). Microsoft Foundry, AWS Bedrock AgentCore, Google Gemini Enterprise are the wagers — they may expose enumeration only to first-party SIEM hooks or charge for it. Falsification: within the first 60 days, attempt a no-special-access enumeration against each of the seven; if more than two return "must use our SIEM connector" or "need a partner agreement", the product needs a different discovery posture for those platforms.

C1.2 — Shadow-agent egress sensor (customer-side)

Minimum viable subsystem: an optional eBPF program (cilium/tetragon-style) attached to the kubelet socket + node network namespace that observes outbound TLS handshakes to known LLM-vendor SNI hostnames (api.openai.com, api.anthropic.com, generativelanguage.googleapis.com, bedrock-runtime.*.amazonaws.com, *.foundry.azure.com, etc.) and to known agent-fabric egress patterns. Emits a ShadowSignal to the control plane with (workload_identity, target_hostname, byte_count, request_signatures).
Data structures: ShadowSignal { tenant_id, observation_window, source_workload_id, sni_hostname, dest_ip, request_count, hash_of_first_request_line, observed_at }.
Protocols: eBPF (libbpf), Kubernetes admission webhooks for daemonset rollout, OpenTelemetry exporter to control plane.
Runtime topology: customer-side data-plane component. Required to live in the customer environment for this to have any teeth — running the sensor in the control plane requires backhauling all egress traffic, which no security buyer will accept.
Failure mode: shadow detection lags. Sanctioned-agent governance is not affected.
Single assumption most likely to be wrong: that customers will install a privileged eBPF daemonset specifically for ARX. They may not. Falsification: ask three design-partner CISOs in pre-sales whether they'd install it. If two of three say "only via our existing Cilium / Falco / Tetragon deployment", we ship an integration with those instead and abandon a first-party sensor.

C1.3 — Inventory store + change feed

MVS: a Postgres table + Kafka topic. Agents and their attributes are versioned (slowly-changing-dimension type 2). A new owner, new permission scope, or new platform-native config emits an inventory.change event consumed by policy/identity/observability subsystems.
APIs: internal only. Consumed by C2 (identity), C3 (policy), C5 (observability).
Failure mode: the change feed is the loose-coupling spine. If Kafka is down, downstream subsystems read directly from Postgres at degraded freshness.

---

Capability 2 — Identity (cryptographic per-agent identity, bound to human owner, ZSP scoped credentials)

C2.1 — Identity issuer (NHI authority)

MVS: a workload-identity issuer that mints JWT-SVID tokens (SPIFFE) per agent identity, one per (agent, target audience, scope) tuple, with TTL ≤ 60 minutes. SVID spiffe_id = spiffe://<tenant>.arx/<platform>/<agent_id>. Subject claims include the human owner's verified identity (OIDC sub from the customer's IdP) bound at issuance time, signed with the tenant's per-tenant signing key.
Data structures: AgentIdentity { spiffe_id, tenant_id, platform, platform_native_id, human_owner_oidc_sub, manifest_hash, public_key_thumbprint, created_at, last_rotated_at, status }. Signing keys per tenant in HashiCorp Vault Transit (envelope-encrypted with a tenant KEK).
APIs: POST /v1/identity/agents (issue), POST /v1/identity/agents/{id}/rotate, POST /v1/identity/agents/{id}/revoke, POST /v1/identity/token (mint a scoped JWT-SVID for an audience), GET /.well-known/jwks.json per tenant. Conform to OAuth 2.0 Token Exchange (RFC 8693) — agents present an upstream identity (their MCP runtime token, their AWS-Bedrock IAM role, their Foundry managed identity) and ARX exchanges it for a scoped target-system credential.
Protocols / primitives: SPIFFE/SPIRE workload identity. JWT-SVID format. RFC 8693 Token Exchange. RFC 7519 JWT. OAuth 2.1 + FAPI 2.0 hardening profile (mTLS, sender-constrained tokens via DPoP RFC 9449). For x509-SVID fallback (where a target system requires mTLS not bearer), x509-SVID short-lived certs.
Runtime topology: control-plane API + per-tenant Vault Transit signing. No data-plane involvement except the SDK/sidecar that fetches+caches a JWT-SVID for the duration of a single agent action.
Failure mode if down: fail-closed. If the identity issuer is unreachable, agents cannot mint new tokens; cached tokens drain to expiry and then enforcement gates close. The single most consequential failure mode in the platform — an extended outage halts every governed agent. Mitigation: tenant-local SPIRE replica that can issue offline against a pre-seeded trust bundle for a bounded window (≤4h).
Single assumption most likely to be wrong: that downstream commercial platforms will accept an externally-issued JWT-SVID as a valid identity claim for their agents. They almost certainly won't for v1 — Salesforce wants a Salesforce identity, Foundry wants an Entra managed identity. The realistic posture is that the SPIFFE identity is ARX-internal (used for policy, audit, kill-switch routing) and ARX maintains a mapping table to each platform's native identity. Calling this "cryptographic per-agent identity" requires this nuance up front. Falsification: survey the seven platforms' identity-acceptance posture in Q1; the result will determine whether Capability 2 is an issuer or a binder.

C2.2 — Credential broker (zero-standing-privilege)

MVS: for every (agent, target_system) pair, mint a short-lived target-system credential just-in-time at action invocation. AWS: sts:AssumeRole with session policy. GCP: iam.serviceAccounts.signJwt for federation. Salesforce: per-call OAuth 2.0 named-credential exchange. Microsoft: managed-identity federated credential or on-behalf-of flow. Generic: OAuth 2.0 client_credentials scoped per-call with aud = target API, no refresh token persisted.
Data structures: CredentialBinding { agent_id, target_platform, scope_template, issuer_role, max_ttl_seconds, last_issued_at }. The credential itself is never persisted post-issuance.
APIs: POST /v1/credentials/issue (server-to-server only, called by the policy engine after PERMIT verdict) — returns { token, expires_at, audience, scope_actually_granted }. POST /v1/credentials/revoke cascades to every active binding.
Protocols: STS-style flows per cloud, RFC 8693 token exchange where supported, OAuth 2.0 client credentials with audience binding (RFC 8707 resource indicators).
Runtime topology: control plane. The mint happens in the policy-engine's same request path (≤50ms budget for the credential mint or the action latency target dies).
Failure mode if down: fail-closed. Same outage profile as C2.1.
Single assumption most likely to be wrong: that the policy engine's per-action credential mint stays under 50ms p99. STS calls into AWS regularly take 100-300ms. Mitigation is hot-path credential caching with TTL ≤ 60s and revocation invalidation. Falsification: measure end-to-end with a 10-platform synthetic load by end of Q2; if cached p99 > 80ms, the credential model needs revisiting.

C2.3 — Owner-binding registry

MVS: mapping table (agent_id → owner_human_user_id, owner_email, owner_team, ownership_attested_at, attestation_method). Owner is the human personally accountable for the agent's behavior. Re-attestation cadence configurable (default quarterly).
APIs: POST /v1/identity/agents/{id}/owner (assign / re-attest), GET /v1/identity/agents/orphans (no live owner — typical when a human leaves the company).
Failure mode: an orphaned agent is automatically suspended after a configurable grace period. This is the only place in the system where an HR signal (departure) propagates to NHI termination automatically.

---

Capability 3 — Runtime policy enforcement (action-level, across 11 platforms)

C3.1 — Policy decision point (PDP)

MVS: an in-process Cedar policy evaluator (AWS-open-sourced; deterministic; sub-ms typical evaluation) loaded from a per-tenant policy bundle. Bundle contains entity schema (Agent, Action, Target, Context) + Cedar policies + entity store deltas. Bundle hot-reloads from object storage on a 30-second poll, with version pinning. Per-tenant bundle distinct from platform-default bundle; combined at evaluation time.
Data structures: the bundle = (schema.cedarschema, policy_set.cedar, entities.json, bundle_version, tenant_id, signed_at, signature). Decision = Decision { effect: Permit|Forbid, determining_policies, errors, evaluation_ms }.
APIs: internal only — Decision evaluate(principal, action, resource, context). Decision logs streamed to Kafka topic agent.policy.decisions (every decision, not every audit row).
Protocols: Cedar policy language (open-source AWS spec). Bundle distribution is a glorified S3-versioned-object pull with conditional-GET. Decision logging follows OpenTelemetry log SDK.
Runtime topology: decentralized PDP. Every place a governed action is intercepted (the SDK sidecar, the MCP gateway, the platform-side enforcer) embeds the Cedar evaluator in-process. No cross-network call per decision. This is how p99 < 5ms gets achieved.
Failure mode if down: the PDP itself can't be down — it's in-process. The bundle distributor can fail; the PDP keeps using its last-known bundle. If a never-bootstrapped sidecar can't reach the bundle distributor, fail-closed (deny all). Configurable per tenant.
Latency budget: p99 < 5ms for evaluation. p99 < 50ms for bundle hot-reload (background, not on the hot path).
Single assumption most likely to be wrong: that all 11 enforcement points can run an in-process Cedar evaluator. They can't — Salesforce Apex is a sandbox, Foundry agents may be running inside Microsoft's runtime, Bedrock AgentCore may not allow third-party libraries. For those, the PDP becomes a remote call to a regional ARX enforcement node — adds 10-30ms but is unavoidable. Falsification: for each of the 7 commercial platforms, by Q2 confirm whether ARX can run code in-process on their action path. The three that say "no" determine the architecture.

C3.2 — Policy enforcement point (PEP) — five flavors

For different platforms, the PEP topology differs:

| PEP topology | Used when | Mechanism | |---|---|---| | In-process SDK | OpenAI Agents SDK, LangChain/LangGraph, CrewAI, AutoGen | Customer adds an ARX guardrail/middleware that wraps tool calls + LLM calls. PDP runs in the same process. | | MCP gateway | Any MCP-speaking client | A network proxy that speaks MCP protocol; intercepts every tools/call and tools/list; queries PDP; rewrites or rejects. Intercepts over the standard MCP transport (stdio or SSE/HTTP). | | Sidecar | Customer-self-managed runtimes (the customer's Python agent in their k8s) | A sidecar container in the same pod that the agent's outbound traffic is routed through (envoy proxy + ext_authz to PDP). | | Platform-native action hook | Salesforce Agentforce (Apex managed package), ServiceNow AI Agents (Flow Designer step), UiPath (Studio activity), watsonx Orchestrate (skill wrapper) | Native package/extension installed on the platform that calls back to the PDP. | | Egress proxy | Foundry, Bedrock AgentCore, Gemini Enterprise — where ARX cannot install in-process | The customer's network egress for that platform routes through an ARX regional egress node. Limited fidelity (can't see prompt, can see tool call args + LLM API call destinations + structured tool calls). Best-available where in-process is impossible. |

Data structures: every PEP emits EnforcementEvent { event_id, agent_id, action_id, principal_spiffe_id, action_name, target, context_hash, decision, decision_ms, pep_topology, pep_version, occurred_at }.
APIs (PEP → control plane): POST /v1/enforcement/log (batch; OTLP).
Failure mode: in-process PEPs (SDK, sidecar, MCP gateway) can be configured fail-closed or fail-open per (tenant, action_class). Egress proxies fail-closed by definition (if the proxy is down, the egress can't get out). Native action hooks fail per platform's behavior (typically fail-open in Salesforce, fail-closed in ServiceNow when configured as required).
Single assumption most likely to be wrong: that customers will deploy a sidecar for every governed agent's runtime. Some will; many won't because of the operational burden in production envs. The MCP-gateway flavor is the more realistic universal entry point because MCP adoption is increasing fast and the gateway pattern is operationally invisible to the customer.

C3.3 — Policy authoring + bundle distribution

MVS: a Cedar policy editor in the ARX UI + a CLI for arx policy push. Bundles validated with cedar validate before publish. Each publish creates a new immutable bundle version stored in S3 with object lock; the PDPs poll a tenant-scoped manifest pointing at the latest version.
APIs: POST /v1/policies/bundles (upload), POST /v1/policies/bundles/{id}/promote (mark as live for a tenant), GET /v1/policies/bundles/{id}/diff/{prev_id} (compare).
Failure mode: an authoring outage doesn't stop enforcement; it stops policy changes.

---

Capability 4 — Compliance evidence (continuous, mapped to 6+ frameworks)

C4.1 — Evidence emitter

MVS: every EnforcementEvent, every AuditEvent, every IdentityEvent, every ManifestChange is tagged at emission time with the framework controls it serves as evidence for. Mapping is a static (event_type, condition) → control_id[] table maintained per framework. Tags written into the same event at write time so evidence retrieval is a query, not a join.
Data structures: EvidenceTag { event_id, framework: "soc2_2017"|"iso_42001_2023"|"eu_ai_act"|"nist_ai_rmf_1_0"|"owasp_nhi_top10_2025"|"owasp_agentic_top10_2025", control_id, source_attribution: "{file_path}:{line_range}@{release_sha}" }.
APIs: internal, but exposed read-only as GET /v1/compliance/controls/{framework}/{control_id}/evidence (paginated event list).
Source-attribution requirement: every (event_type, control) mapping carries the source-line-range of the code path that produced the event, hash-pinned to the ARX release SHA. Built at CI time by a script that walks the codebase looking for # control: CC6.1 annotations + emits a generated evidence_map.json artifact alongside each release.
Failure mode: if the emitter is down, events still write — they just don't get evidence-tagged. A back-fill job re-tags events when the emitter recovers (the mapping is idempotent and event-IDs are stable).

C4.2 — Audit chain (tamper-evident, customer-anchored)

MVS: audit events batched into Merkle blocks every N seconds (target: 1 block/5s/tenant). Each block: { block_id, prev_block_hash, merkle_root, event_count, time_window, tenant_id, signed_at, rfc3161_timestamp_token }. Block written to customer-controlled S3 / Azure Blob / GCS via cross-account role assumption. Customer holds an arx-verify CLI that walks the chain in their bucket, verifies Merkle inclusion proofs, validates RFC 3161 timestamps against the configured timestamp authority, and emits a single chain-integrity status — all without any API call back to ARX.
Data structures: AuditBlock as above. Per-event AuditEvent { event_id, tenant_id, agent_id, principal_spiffe_id, action, target, decision, occurred_at, redacted_fields[], hash_of_full_payload, evidence_tags[] }. PII/secret payloads stored encrypted with the tenant's CMEK; only a hash appears in the chain.
Protocols: RFC 3161 Time-Stamp Protocol (TSP) — independent timestamp authority (e.g., DigiCert TSA, FreeTSA). Cross-account S3 role assumption. AWS S3 Object Lock or Azure immutable blob storage on the customer side (configured by the customer).
APIs: POST /v1/audit/destinations (configure customer bucket + role), GET /v1/audit/chain/status (current block height + last-anchored timestamp), the standalone arx-verify CLI shipped via PyPI.
Failure mode: if the cross-account write fails, blocks queue locally and replay when the destination becomes writable. If the customer destination is permanently unreachable (mis-configured role), the chain head is still locally durable but un-anchored — surface this on the Board View.
Single assumption most likely to be wrong: that customers will accept a 5-second batching latency on audit anchoring. CISOs may want sub-second; auditors don't care. Real risk is that some auditors require a specific timestamp authority (e.g., a public-CA-rooted TSA) and the platform must let the customer configure which.

C4.3 — Evidence package builder

MVS: quarterly (and on-demand) job that compiles, per-framework, a package = (evidence event lists) + (control mapping with source attribution) + (auditor-verifiable Merkle proofs for inclusion of each event in the chain) + (a one-page exec summary) + (the arx-verify CLI binary + script). PDF for human consumption + structured JSON-LD for ingestion by GRC platforms.
APIs: POST /v1/compliance/packages (build), GET /v1/compliance/packages/{id} (download). Pre-configured framework support: SOC 2 (TSC 2017 + 2022 update), ISO/IEC 42001:2023 AI management systems, EU AI Act (Article 9 risk management, Article 10 data governance, Article 11 technical documentation, Article 12 record-keeping, Article 14 human oversight, Article 17 quality management, Article 26 deployer obligations), NIST AI RMF 1.0 + GenAI Profile NIST AI 600-1, OWASP NHI Top 10 (2025), OWASP Agentic AI Threats and Mitigations (Feb 2025), CSA AI Controls Matrix.
Failure mode: a stale package is OK (the chain is the source of truth). A failed package build is a customer-facing alert.

C4.4 — Vendor-questionnaire renderer

MVS: a templated renderer for SIG (Shared Assessments), CAIQ (CSA Cloud Controls Matrix v4), HECVAT, Vanta-questionnaire-export, OpenQ. Pulls live from C4.1's evidence store; produces customer-shareable PDFs/spreadsheets with hash-pinned answers.
Failure mode: non-critical; this is a productivity feature for the customer's TPRM team.

---

Capability 5 — Unified observability + cost (one queryable trace per agent action with cost, audit, behavior correlated)

C5.1 — Trace ingest (OpenTelemetry GenAI semantic conventions)

MVS: an OTLP/HTTP receiver that accepts traces, metrics, and logs. Spans must conform to OpenTelemetry GenAI semantic conventions (the gen_ai.* attribute namespace — gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.tool.name, gen_ai.agent.id, etc.). For platforms that don't emit GenAI conventions natively (most of them, today), the PEP (C3.2) emits the spans on their behalf at intercept time.
Data structures: standard OTLP. The trace key is (tenant_id, agent_id, action_id); spans within an action share trace_id. Multi-agent flows (A2A delegations) propagate trace_id across agent boundaries via W3C Trace Context.
APIs: POST /v1/otlp/v1/traces, POST /v1/otlp/v1/logs, POST /v1/otlp/v1/metrics. PromQL/LogQL-compatible read API for query.
Storage: ClickHouse for span store (column-oriented, fast aggregations on gen_ai.usage.*). 90-day hot, 7-year cold (parquet on S3). Cardinality discipline mandatory — gen_ai.prompt/gen_ai.completion payloads stored separately, not as span attributes.
Failure mode: ingest queue (Kafka) buffers up to 24h; longer outage drops spans (but not enforcement events, which are on a different durability tier).

C5.2 — Cost attribution

MVS: every span carrying gen_ai.usage.input_tokens + gen_ai.usage.output_tokens joined against a per-tenant model_prices table → derived gen_ai.usage.cost_usd. For tool calls without token counts (action-level), cost = (tool_call_unit_price × N) where unit prices come from a per-target table customers configure (Salesforce API call quota, AWS API GB-out, etc.). Rolled up by (tenant, agent, cohort, function, day).
Data structures: model_prices { provider, model_id, input_per_1k_tokens, output_per_1k_tokens, effective_from }. tool_prices { tool_uri, unit_cost, effective_from }.
APIs: GET /v1/usage/agents/{id}, GET /v1/usage/cohorts/{id}, GET /v1/usage/org/summary?group_by=agent|function|day.
Failure mode: cost lags are tolerable. Historical spans recompute cost on price change (event-time pricing).

C5.3 — Trace ↔ audit ↔ enforcement correlation

MVS: every AuditEvent, every EnforcementEvent, and every OTLP span share (trace_id, span_id, agent_id, action_id). The unified-query API joins across all three stores so a single GET /v1/traces/{trace_id} returns the full picture: prompts (redacted as policy dictates) → tool calls → enforcement decisions → downstream API calls → cost. This is the "single queryable trace" deliverable.
APIs: GET /v1/traces/{trace_id} (full join), POST /v1/traces/search (across stores).
Failure mode: if the join fails (one store's row missing), return what's available with partial: true. Never hide enforcement decisions even if the corresponding span is missing.

C5.4 — Behavioral telemetry (for drift detection input)

MVS: spans + audit events fed into the drift detector (already real in app/core/drift_detector.py). Drift event becomes a first-class entity emitted on the same correlation keys.

---

Capability 6 — Instant kill switch (propagation across every system the agent has touched, in seconds)

C6.1 — Kill-switch orchestrator

MVS: a single POST /v1/agents/{id}/terminate (or POST /v1/cohorts/{id}/terminate, or POST /v1/org/{id}/emergency_stop) wraps a saga across:

Set agent status terminated in inventory (atomic; future PEP queries see it as denied).
Push a terminated:{spiffe_id} revocation entry into the distributed revocation list (DRL — see C6.2). PDPs at every PEP read the DRL on every decision; new decisions deny within ≤ DRL-TTL seconds.
Cascade-revoke every CredentialBinding (delete the row + actively call the issuer revoke endpoint where supported — AWS sts:RevokeAccessKey doesn't exist for STS sessions but iam:DeleteAccessKey does; OAuth revoke per RFC 7009 for OAuth-issued tokens).
Issue platform-specific kill commands where the platform supports them: Foundry agent disable via Microsoft Graph, Salesforce Agentforce flow deactivation via Tooling API, Bedrock agent prepare-with-disabled-actions, ServiceNow flow pause, UiPath Orchestrator job-stop, watsonx skill disable.
Drain in-flight runs (signal Docker executor for self-managed; for foreign platforms, best-effort cancel via platform API).
Emit agent.terminated event into the audit chain (anchored to customer S3).
Generate exit-attestation PDF — what the agent did, what it had access to, when terminated, by whom — written to a customer-pinned location.

Data structures: Termination { id, scope: agent|cohort|org, initiated_by, initiated_at, steps: TerminationStep[], status: in_flight|complete|partial_failure, attestation_url }.
APIs: POST /v1/agents/{id}/terminate, POST /v1/cohorts/{id}/terminate, POST /v1/org/{tenant_id}/emergency_stop (admin + quorum required). Webhook event agent.terminated.completed.
Latency target: p95 ≤ 30 seconds end-to-end for a single-agent termination including DRL propagation. Org-wide emergency-stop ≤ 60 seconds.
Failure mode: partial-failure is the normal state, not the exception. If Salesforce's Tooling API is down when termination fires, that step is queued + retried with exponential backoff for up to 24h. The agent's enforcement-side denial (#1 + #2) is unaffected — even if the platform-side kill (#4) is delayed, ARX-mediated actions are blocked immediately. Critically, steps #1 and #2 are fail-closed and cannot be skipped; everything else is best-effort.

C6.2 — Distributed revocation list (DRL)

MVS: a Redis Cluster (per region) holding a sorted set of (spiffe_id → revocation_timestamp). Every PEP fetches deltas every 1s (push via Redis Streams to subscribed PEPs, fallback to pull). PDP evaluation includes a single Redis lookup (≤2ms) per decision. DRL entry persists for max(token_ttl, 30 days) — long enough that any token issued before revocation is dead.
Data structures: RevocationEntry { spiffe_id, revoked_at, reason_code, ttl_until }. Plus a bloom filter baked into every bundle (C3.3) for offline PDPs.
Protocols: Redis 7+ streams. JWT-SVID introspection (RFC 7662) as fallback for clients that prefer pull.
Failure mode: fail-closed. If a PEP can't reach the DRL within its 1s freshness window, the bundle's bloom filter handles a configurable retention; past that, the PEP's policy can be configured to deny all new actions until DRL is reachable.

C6.3 — Exit attestation generator

MVS: at termination, a deterministic report bundling: declared intent at termination time, last 90 days of action counts, last credential-binding state, all approval decisions, all drift events, the final audit-chain block ID. Signed by the platform and the requesting human; written to the customer's audit destination as a .tar.gz.

---

Cross-cutting observations

The fail-closed components are the ones that determine the platform's reliability story: Identity issuer (C2.1), Credential broker (C2.2), DRL (C6.2). All three are on the synchronous hot path of every governed action. If any of them is unreachable for a tenant for more than a few minutes, that tenant's agents stop. This is the right tradeoff — security platforms should fail closed — but it raises the bar on the SRE story to "five-9s for a small set of components" rather than "three-9s for everything."
Cedar over OPA/Rego for the PDP because Cedar was designed for application-authorization-style entities (Principal, Action, Resource, Context) which is the right shape for agent governance. OPA/Rego is more flexible but slower and harder to audit. This is a non-default choice and should be defended explicitly to anyone who asks.
OpenTelemetry GenAI semantic conventions are immature as of May 2026 (still in active spec evolution). Pinning to a specific spec version and lobbying for spec stability via the OTel SIG is part of the work, not just adopting the standard.
MCP and A2A are emerging interop standards — MCP for tool calls, A2A for agent-to-agent. Both are rapidly evolving. The architecture should not bet on protocol stability beyond 6 months.