Documentation
72-Hour Deployment Runbook — Manifests to Live
arxsec-site / docs/deployment/72-hour-runbook.md
Audience: ARX engagement team and customer's CHRO operations team running a coordinated go-live. Read end-to-end before deployment day.
One sentence: Atlas has produced the manifest set, the executive team (CHRO/CFO/CISO/CEO) has signed off, and we have 72 hours to take 10,000–20,000 agents from manifest YAMLs to live in the customer's production workforce console.
---
Prerequisites
This runbook assumes the following are already in place. If any prerequisite is missing, do not start the clock — fix the gap and re-baseline.
| # | Prerequisite | Owner | Verification | |---|---|---|---| | 1 | Atlas deployed in customer's K8s cluster | Customer SRE | kubectl -n arx-atlas get pods shows Running | | 2 | NetworkPolicy verified by customer's auditor | Customer Security | kubectl get networkpolicy atlas-egress-lock -o yaml matches docs/atlas/network-policy.yaml | | 3 | Atlas's manifest set produced and reviewed | CHRO + Atlas | The manifest set is in customer's S3 bucket, signed | | 4 | Executive sign-off recorded | CEO + CHRO + CFO + CISO | Sign-off captured in customer's S3 audit bucket | | 5 | ARX SaaS connector inventory matches customer's stack | ARX engagement team | All connectors named in manifests are deployed | | 6 | HRIS (Workday) sync configured | Customer IT | Workday connector authenticates and lists employees | | 7 | Per-manager queue infrastructure provisioned | ARX | Each customer manager has a workforce-console queue | | 8 | Audit chain S3 bucket + KMS key configured | Customer Security | Witness signature lands every 5 min |
Hard stop: if Atlas's manifest set cannot be verified against the customer's signed approval, do not proceed. Re-cycle through Atlas's audit capability to identify the discrepancy.
---
The 72-hour timeline
`` T-0 T+24 T+48 T+72 │ │ │ │ ├── PROVISION ── VALIDATE ── GO LIVE ──┤ │ │ │ │ ├── all ├── shadow ├── cohort │ │ agents │ mode + │ waves │ │ created │ pre-flight│ to live │ │ │ checks │ │ ``
Three windows. Each window has named milestones, named owners, and exit criteria that must pass before the next window starts.
---
Window 1 — PROVISION (T+0 to T+24)
Goal: all 10K–20K agents instantiated, credentials issued, managers bound, ready for validation.
T+0:00 to T+0:30 — Hand-off and intake
- [ ] T+0:00 — Customer's CHRO ops team and ARX engagement team
meet on shared comms (Slack channel #arx-deploy-day).
- [ ] T+0:05 — Customer signs the deployment go-ahead;
sign-off lands in audit chain.
- [ ] T+0:15 — ARX engagement engineer pulls the signed
manifest set from customer's S3 bucket.
- [ ] T+0:25 — ARX runs
arxctl validate-manifestsagainst
every manifest in the set. Any failure → stop the clock, escalate.
Exit criteria: manifest set verified, sign-off recorded, no validation failures.
T+0:30 to T+4:00 — Bulk instantiation
- [ ] T+0:30 —
POST /v1/agents/cohorts/from-manifestswith
the full manifest set. Response is a single cohort_set_id and a per-cohort job array.
- [ ] T+1:00 — Pre-flight validator (N3) runs in parallel:
checks each manifest against connector availability, HRIS manager resolution, S3 audit bucket reachability, and runtime budget plausibility.
- [ ] T+2:00 — First batch (~2,000 agents) created. Check the
Roster page in workforce console — agents appear with status=provisioning.
- [ ] T+3:00 — Bulk HRIS sync (N4) completes. Each agent's
reporting_manager_email resolves against Workday's employee→manager mapping in a single snapshot.
- [ ] T+4:00 — All agents created. Status transitions to
provisioned (shadow).
Exit criteria: 100% of agents have status=provisioned, every manifest validated, every manager bound.
Failure mode: if pre-flight validator returns >2% failures, halt provisioning and triage. Common causes: (a) connector credential expired during the sign-off window, (b) HRIS sync stale, (c) manifest references a connector not yet deployed.
T+4:00 to T+12:00 — Per-agent credential issuance
- [ ] T+4:00 — Credential vault begins minting per-agent root
credentials. Throughput target: ~1,500 credentials/hour = ~10K credentials in ~7h.
- [ ] T+8:00 — First sample agent (1% from each cohort)
executes a no-op smoke test against its declared connector. Verifies credential is valid and intercept API returns verdict=ALLOW.
- [ ] T+12:00 — All agents have credentials issued and
verified.
Exit criteria: every agent's credential is in the vault and returns verdict=ALLOW on a no-op call.
Failure mode: if any cohort fails its smoke test, isolate the cohort, do not advance it past Window 1. Other cohorts continue.
T+12:00 to T+24:00 — Audit chain bootstrap
- [ ] T+12:00 — First witness signature lands in customer's S3
audit bucket from the provisioning events (10K+ entries).
- [ ] T+18:00 — Customer's auditor runs
arxctl verify-chain
against the customer's S3 bucket. Verifies hash chain integrity from infrastructure the customer controls.
- [ ] T+24:00 — Audit chain steady-state: witness signatures
every 5 minutes, no integrity failures.
Exit criteria: audit chain integrity verified by customer's own tooling. Window 1 complete.
---
Window 2 — VALIDATE (T+24 to T+48)
Goal: every cohort runs in shadow mode against its declared connectors. No real writes commit. Telemetry validates expected behavior.
T+24:00 to T+30:00 — Shadow-mode activation
- [ ] T+24:00 — All cohorts move from
provisioned (shadow)to
running (shadow). Shadow-mode toggle (N2) is configured at cohort level: every intercept call returns the verdict the production runtime would return, but the connector layer returns dry-run output instead of committing writes.
- [ ] T+25:00 — Each cohort executes against its
examples/request.json payload (smoke test). Results visible in workforce console under "Cohort Telemetry."
- [ ] T+30:00 — First 6 hours of shadow telemetry recorded.
Drift baseline set.
Exit criteria: every cohort produces shadow-mode results matching the manifest's declared output schema.
T+30:00 to T+42:00 — Manager queue dry-run
- [ ] T+30:00 — Trigger one synthetic escalation per manager
queue (every customer manager who has at least one bound agent gets one synthetic approval request).
- [ ] T+34:00 — Track manager response rate. Target: 80% of
managers respond within 4 hours.
- [ ] T+38:00 — Auto-escalation kicks in for non-responding
managers (escalate to skip-level after 4h SLA).
- [ ] T+42:00 — Every manager queue has at least one decision
recorded. Slack/Teams notification adapter verified.
Exit criteria: every manager queue has a recorded decision on a synthetic escalation. Auto-escalation has fired at least once to verify the SLA path.
T+42:00 to T+48:00 — Pre-go-live sign-off
- [ ] T+42:00 — Customer's CHRO reviews the cohort telemetry
dashboard. Anomalies flagged for cohort-level go/no-go decision.
- [ ] T+44:00 — Customer's CISO reviews the audit chain. Spot
checks 1% of entries to verify content matches expected shadow-mode behavior.
- [ ] T+46:00 — Customer's CFO reviews cost telemetry against
retainer-forecast. Dollar-denominated cohort projections compare cleanly to budget.
- [ ] T+48:00 — Pre-go-live sign-off recorded. Each
executive signs at the cohort-grouping level (e.g. "approve Engineering IC cohort group go-live").
Exit criteria: all three executives have signed at the cohort-grouping level. Window 2 complete.
Failure mode: if any cohort fails review, hold that cohort in shadow mode beyond T+48. Other cohorts proceed to Window 3 on schedule.
---
Window 3 — GO LIVE (T+48 to T+72)
Goal: cohorts move from shadow to live in waves. Each wave ramps over a defined window, monitored, with rollback at the ready.
T+48:00 to T+54:00 — Wave 1 (Research-shape cohorts)
Research-shape agents go first. They are read-only by design, so the live transition has the lowest blast radius.
- [ ] T+48:00 — Operator flips Research cohorts from shadow
to live in workforce console UI ("Cohort Go-Live" page, added in Phase 8).
- [ ] T+48:30 — Live telemetry begins. Compare observed call
volume against shadow-mode baseline. Variance >25% → pause.
- [ ] T+50:00 — Sample 0.5% of executions for human review
(random sample by ARX engagement engineer + customer's CHRO ops team).
- [ ] T+54:00 — Research wave fully live and stable.
T+54:00 to T+60:00 — Wave 2 (Production-shape cohorts)
Production agents create artifacts (PRs, decks, drafts) but writes flow through manager approval. Each write is audit-chain-recorded.
- [ ] T+54:00 — Operator flips Production cohorts to live.
- [ ] T+54:30 — Manager queues observe the first real
escalations. Auto-escalation SLA monitored.
- [ ] T+56:00 — First 100 manager-approved writes commit to
customer systems. Audit chain entries verified.
- [ ] T+60:00 — Production wave fully live and stable.
T+60:00 to T+66:00 — Wave 3 (Coordination-shape cohorts)
Coordination agents take state changes across systems (route tickets, schedule meetings, assign owners). Multi-system writes make this the highest blast radius.
- [ ] T+60:00 — Operator flips Coordination cohorts to live.
- [ ] T+60:30 — Sample 1% of state changes for human review.
- [ ] T+62:00 — First multi-system orchestration completes
(e.g. Salesforce + Slack + calendar coordinated by one sales-ic-coordination-handoff-broker).
- [ ] T+66:00 — Coordination wave fully live and stable.
T+66:00 to T+72:00 — Operational handoff
- [ ] T+66:00 — ARX engagement team and customer's CHRO ops
team run the handoff session. Customer's ops team takes day-to-day operational ownership.
- [ ] T+68:00 — Customer's first manager-self-service action
(a real manager, not a designated test user, makes a real approval decision in the queue).
- [ ] T+70:00 — Customer's CFO confirms the deliverable
schedule's first quarter is in motion. Dollar-denominated productivity-gain measurement begins for the upcoming true-up.
- [ ] T+72:00 — Deployment closed. The CEO is briefed on the
cohort-by-cohort go-live status and the open watch items.
Exit criteria: customer's CHRO ops team has accepted day-to-day ownership. Audit chain stable. Manager queues active. First real-user actions recorded.
---
Rollback procedure
At any point during Windows 2 or 3, if a cohort exhibits unexpected behavior, the operator can roll back:
| Failure | Rollback action | Recovery time | |---|---|---| | Cohort write rate >2× expected | Flip back to shadow mode | <60s | | Connector authenticating as wrong identity | Halt cohort, re-issue credentials | ~30 min | | Audit chain integrity failure | Halt all cohorts, escalate to customer Security + ARX engineer | until resolved | | Manifest validator detects drift in live | Halt the affected cohort, regenerate manifest via Atlas | ~2 hours | | Customer-side connector outage | Pause cohorts that depend on it; non-affected cohorts continue | until connector restored |
The Termination capability (one-button revoke + halt + exit attestation) is always available and works at the cohort level or the per-agent level.
---
Roles and ownership
| Role | Owner | Responsibility | |---|---|---| | Deployment Lead | ARX engagement engineer | Owns the clock. Calls go/no-go on each window. | | Customer Lead | Customer CHRO | Owns the customer-side decisions. Signs at cohort-group level. | | Security Lead | Customer CISO | Audit chain validation, NetworkPolicy verification, security incident response. | | Operations Lead | Customer CHRO ops team | Takes day-to-day operational ownership at T+66. | | Atlas Operator | Customer chief-of-staff | Operates Atlas (manifest regen, drift response). | | ARX Standby | ARX SRE | On-call for any ARX-side incident throughout the 72 hours. | | Connector Standby | Customer IT | On-call for connector / HRIS / S3 incidents. |
Hard rule: if any of the above roles are unavailable for the duration, do not start the clock. Reschedule.
---
Timeline budget summary
| Window | Duration | Owners | Output | |---|---|---|---| | 1. Provision | 24h | ARX + customer IT | All agents created, credentials minted, managers bound, audit chain bootstrapped | | 2. Validate | 24h | ARX + customer CHRO/CISO/CFO | Shadow telemetry recorded, manager queue dry-runs, exec sign-off | | 3. Go-Live | 24h | ARX + customer ops team | Cohorts live in waves, ops handoff, first real-user actions |
Each window has 4–6 hours of internal slack. Use that for unplanned issues, not for stretching task durations.
---
Communication cadence
- Hourly: ARX engagement engineer posts status in
#arx-deploy-day
channel: current window, current milestone, on-track / behind / blocked.
- Every 6 hours: Deployment Lead and Customer Lead sync on shared
call (15 min, status + decisions).
- At each window boundary (T+24, T+48, T+72): All roles sync
(30 min, gate review).
---
What we're NOT doing in 72 hours
- Production-grade fine-tuning of agents. Each cohort is shipped
as it came from Atlas. Tuning happens in subsequent quarters via the customer's own engineering or ARX engagement team.
- The complete 84 × 3 cell coverage. Most customers deploy
~25–49 stock templates plus 100–200 customer-built variants. The full matrix fills over months/quarters, not 72 hours.
- Atlas implementation. Atlas is already deployed before this
runbook starts (per Prerequisite 1).
- Connector deep engineering. Any customer-specific homegrown
system requires its own connector slice, scheduled separately.
---
Pre-deployment checklist (1-week countdown)
| T-7d | Mock-deploy in customer's staging environment | | T-5d | Atlas's manifest set frozen | | T-3d | Executive sign-off captured | | T-1d | All roles confirmed available for the 72-hour window | | T-0 | Run this runbook |
---
Post-deployment
- T+72 to T+96: ARX engagement engineer monitors cohort
telemetry, flags any drift to customer's Operations Lead.
- T+96 to T+30d: First quarterly true-up window opens.
Productivity-gain measurement begins per the engagement letter.
- T+30d: First manager-skill-up review. Customer's managers
who responded slowly in Window 2 get a brief refresher on the approval-queue UX.
---
Versioning
- Runbook version: 1.0.0 (Phase 1 / Commit 5).
- Updated when the engineering capabilities backing the runbook
ship: bulk-instantiation (Phase 2), shadow-mode toggle (Phase 7), cohort go-live UI (Phase 8). Each phase landing updates the corresponding section of this runbook.
This runbook is the choreography. The engineering capabilities in Phases 2–7 are what make the choreography executable in 72 hours rather than 12 weeks.