Adversarial-Prompt Regression Suite (scaffold)

arxsec-app repo-root tests/llm/red_team/README.md

This directory holds the adversarial-prompt regression suite for ARX's first-party AI surface — the LLM router (app/llm/) and related code. Per docs/governance/ai-risk-policy.md §5.1 mandatory practice 5, every first-party AI feature must have an eval suite registered here (or in app/llm/eval/), and per mandatory practice 6, a pre-deploy gate must run them.

This is a scaffold. The corpus and the runner shape are real and exercised by test_red_team_corpus.py. What is intentionally NOT here yet:

A large corpus of vendor-targeted jailbreaks. Foundation-model

vendors maintain those internally; ARX's corpus focuses on router-specific failure modes (scope escape, service-internal attribution, failover-induced disclosure, malformed-response handling) plus a small example set of vendor-targeted cases to document the pattern.

Live runs against real foundation-model providers. CI runs the

corpus against mocked providers (deterministic, free). A separate scheduled job — not in this directory — runs the same corpus against live providers periodically.

CI gating. The pytest entrypoint runs today; the deploy-gate hook

is a follow-up.

Corpus structure

corpus.json is the source of truth. Each case has:

id — stable unique identifier.
category — one of:
jailbreak — try to bypass the model's safety policy.
injection — try to make the model take an action the user did not intend.
scope-bypass — try to cause the router to call a tier or

provider the caller is not authorized for.

attribution — try to invoke the router in a way that produces

no audit row (the canonical example: service-internal call without agent_id, see router.py lines 206–210).

crashy — known-malformed inputs that should fail gracefully

rather than crash the worker.

severity — BLOCKER (CI-blocking) or WARN (alert-only).
description — what this case is actually testing.
input — the request payload or scenario.
expectation — human-readable description of "pass."
assertion — machine-readable predicate (currently a small DSL —

see _eval_case in test_red_team_corpus.py).

Adding a case

Open corpus.json and add a new case object to the cases array.
Add a regression test for any case where the assertion DSL is

insufficient.

If the case is for a class not yet covered, document it in this

README first.

Run pytest tests/llm/red_team/ locally.
Submit. The case becomes part of the pre-deploy gate once the CI

wiring lands.

Tracker rows

This scaffold partially closes:

SAF.4 (adversarial-prompt regression suite) — directory and

runner exist, real corpus growth is incremental.

SAF.5 (pre-release eval gate) — pytest entrypoint runs today,

CI gate is a follow-up.

A row moves to Met when:

The corpus covers every threat class enumerated in

docs/security/threat-model-llm-router.md (T-1 through T-8) with at least one case each, and

The CI pipeline blocks merges when any BLOCKER case fails.