Public documentation for governed AI labor
SDKs/Governance/Connectors
Arx / Docs / Adversarial-Prompt Regression Suite (scaffold)

Documentation

Adversarial-Prompt Regression Suite (scaffold)

arxsec-app / tests/llm/red_team/README.md

arxsec-app repo-root tests/llm/red_team/README.md

This directory holds the adversarial-prompt regression suite for ARX's first-party AI surface — the LLM router (app/llm/) and related code. Per docs/governance/ai-risk-policy.md §5.1 mandatory practice 5, every first-party AI feature must have an eval suite registered here (or in app/llm/eval/), and per mandatory practice 6, a pre-deploy gate must run them.

This is a scaffold. The corpus and the runner shape are real and exercised by test_red_team_corpus.py. What is intentionally NOT here yet:

  • A large corpus of vendor-targeted jailbreaks. Foundation-model

vendors maintain those internally; ARX's corpus focuses on router-specific failure modes (scope escape, service-internal attribution, failover-induced disclosure, malformed-response handling) plus a small example set of vendor-targeted cases to document the pattern.

  • Live runs against real foundation-model providers. CI runs the

corpus against mocked providers (deterministic, free). A separate scheduled job — not in this directory — runs the same corpus against live providers periodically.

  • CI gating. The pytest entrypoint runs today; the deploy-gate hook

is a follow-up.

Corpus structure

corpus.json is the source of truth. Each case has:

  • id — stable unique identifier.
  • category — one of:
  • jailbreak — try to bypass the model's safety policy.
  • injection — try to make the model take an action the user did not intend.
  • scope-bypass — try to cause the router to call a tier or

provider the caller is not authorized for.

  • attribution — try to invoke the router in a way that produces

no audit row (the canonical example: service-internal call without agent_id, see router.py lines 206–210).

  • crashy — known-malformed inputs that should fail gracefully

rather than crash the worker.

  • severityBLOCKER (CI-blocking) or WARN (alert-only).
  • description — what this case is actually testing.
  • input — the request payload or scenario.
  • expectation — human-readable description of "pass."
  • assertion — machine-readable predicate (currently a small DSL —

see _eval_case in test_red_team_corpus.py).

Adding a case

  1. Open corpus.json and add a new case object to the cases array.
  2. Add a regression test for any case where the assertion DSL is

insufficient.

  1. If the case is for a class not yet covered, document it in this

README first.

  1. Run pytest tests/llm/red_team/ locally.
  2. Submit. The case becomes part of the pre-deploy gate once the CI

wiring lands.

Tracker rows

This scaffold partially closes:

  • SAF.4 (adversarial-prompt regression suite) — directory and

runner exist, real corpus growth is incremental.

  • SAF.5 (pre-release eval gate) — pytest entrypoint runs today,

CI gate is a follow-up.

A row moves to Met when:

  • The corpus covers every threat class enumerated in

docs/security/threat-model-llm-router.md (T-1 through T-8) with at least one case each, and

  • The CI pipeline blocks merges when any BLOCKER case fails.