Status: Proposed Date: 2026-06-12 Deciders: PatrickJS Index: Design decisions Depends on: ADR-0001 (agent steps)
Proposed design. Nothing here is a shipped behavior claim; claims and tests land with the implementation per AGENTS.md.
AGENTS.md mandates a review discipline that is currently entirely manual: before a tranche is complete, a second agent or fresh session runs with the objective “find where the implementation betrays README.md and docs/, and prove it empirically with a scratch pipeline.” The reviewer falsifies; it does not confirm. This discipline exists because self-verification failed here before — checks passed while promises broke.
Today this works through prompt copy-paste and goal-directory receipts (goals/*/state.yaml records a worker/judge pattern). Nothing in the pipeline knows the review happened, what it examined, or what it found. The claims registry (tests/claims.json) makes claim→test existence checkable mechanically; the reviewer owns sufficiency — whether the test actually exercises the promise.
Forces: ADR-0001 provides policy-bounded agent tasks with transcripts; source composition already provides scratch checkouts (.async/sources warm clones) for many-repo runs; an adversarial reviewer needs read access plus the ability to run things in a scratch copy, but must never gain write access to the tree under review; a falsification objective is only credible if the reviewer’s failures block something.
Productize the discipline as a documented pattern — an agent() task with a falsification prompt, a scratch source checkout, and a structured receipt — shipped first as an example, promoted to a primitive only if the pattern stabilizes.
review job, not a new core concept. A pipeline declares a review job whose task is agent() with: a scratch checkout of the repo at the candidate commit (reusing source machinery), command policy granting read tools plus the project’s own verification commands inside the scratch copy only, and the claims registry as declared input.upheld with the exercising command, or falsified with a reproduction..async/runs/<run-id>/review.json: claims examined, verdicts, reproduction commands, transcript reference. A falsified verdict fails the task — and therefore the job — so review findings block exactly like test failures.| Dimension | Assessment |
|---|---|
| Complexity | Low — composes ADR-0001 + existing sources |
| Rigidity | Low — prompt and policy iterate per repo |
| Enforceability | Receipt schema checkable; pattern itself opt-in |
| Risk | Reviewer quality varies; pattern may stay bespoke |
Pros: examples are exercised by release:check, so the pattern stays runnable; learns what the receipt schema should be before freezing it; zero new core surface.
Cons: “documented pattern” is weaker than a primitive — drift across adopters; receipt schema informal until promoted.
review primitive in core now| Dimension | Assessment |
|---|---|
| Complexity | Medium-high — new job kind, receipt schema in record shape |
| Rigidity | High — schema freezes at 1.0 |
| Enforceability | Strong — schemaVersion-ed receipts, standard CLI |
| Risk | Freezing a shape designed from one repo’s experience |
Pros: receipts become portable evidence across projects; tooling (badges, dashboards) gets a stable target. Cons: this repo is the only known user of the discipline; designing the frozen schema from n=1 is how surfaces end up wrong at 1.0.
| Dimension | Assessment |
|---|---|
| Complexity | Zero |
| Rigidity | None |
| Enforceability | None — discipline lives in AGENTS.md prose |
| Risk | The known one: skipped or shallow reviews leave no trace |
Pros: maximum reviewer freedom; no machinery. Cons: unfalsifiable process — nothing records whether the adversarial pass happened or what it covered; the discipline’s own standard (“prove it empirically”) applied to itself fails.
C fails the repo’s own bar: a review discipline whose execution leaves no evidence is exactly the kind of claim AGENTS.md distrusts. B is premature standardization — the receipt schema worth freezing is the one that survives contact with real reviews, and there has been exactly one project’s worth of those. A is the falsifiable middle: the pattern ships as a runnable example (so it cannot rot silently), receipts accumulate, and promotion to primitive happens with schema evidence in hand.
The sharpest open question in A is reviewer independence. A reviewer agent configured by the same pipeline.ts it reviews could be steered by a malicious or sloppy change (weakened prompt, narrowed policy). Mitigation in the pattern: the review job’s prompt and policy live in a file the review task declares as input, so tampering dirties the review and is visible in the diff — imperfect, honest about being so.
examples/adversarial-review/ exercised by the examples test, with a mocked agent (command.mock) proving the receipt path and failure propagation without a model in CI.review.json shape; collect real receipts from this repo’s own tranches.