From 58b39e213036fe3fad53eb0048e6203755154b94 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 17:29:20 +0530 Subject: [PATCH 01/13] feat(v1): structural checkers + semantic-context search + checker-strength diagnostics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit WP3/WP4/WP2/WP6 of the v0.11.0 -> V1 strengthening program (research-report driven). - C3 py-signature:/py-const: structural checkers (dorian/pyast.py): AST-based, close the symbol-existence and string/regex comment-survival ceilings; gutted-body remains the documented ceiling (only C4 catches a body change behind an unchanged signature). - C3 code: semantic-context regex over comment/docstring-stripped Python (same ReDoS worker-timeout as regex:), so a fact surviving only in a comment/docstring FAILs. - checker-strength / claim-risk diagnostics (dorian/strength.py): classify truth strength per checker, flag kind-vs-strength adequacy mismatches, advisory C4 zero/constant-assertion lint; surfaced in `dorian bindings` (JSON + human) and the opt-in --binding-gate warn output. Advisory only — never a verdict/trust/exit change. - watch derivation + binding diagnostics recognize the new C3 forms (seal, bindings). - spec/checkers.md + docs/AGENT_CLAIMS.md document the new grammars and the ceiling. +58 tests (test_pystructural, test_semantic_context, test_strength); 619 non-slow pass. ERROR-vs-FAIL discipline preserved; trigger-vs-truth split made explicit, not blurred. Co-Authored-By: Claude Opus 4.8 (1M context) --- V1_IMPLEMENTATION_TRACKER.md | 111 +++++++++++++ docs/AGENT_CLAIMS.md | 5 +- spec/checkers.md | 44 +++++ src/dorian/bindings.py | 5 +- src/dorian/checkers/c3_ref.py | 62 ++++++- src/dorian/commands.py | 33 +++- src/dorian/pyast.py | 293 +++++++++++++++++++++++++++++++++ src/dorian/seal.py | 11 +- src/dorian/strength.py | 223 +++++++++++++++++++++++++ tests/test_pystructural.py | 290 ++++++++++++++++++++++++++++++++ tests/test_semantic_context.py | 106 ++++++++++++ tests/test_strength.py | 219 ++++++++++++++++++++++++ 12 files changed, 1393 insertions(+), 9 deletions(-) create mode 100644 V1_IMPLEMENTATION_TRACKER.md create mode 100644 src/dorian/pyast.py create mode 100644 src/dorian/strength.py create mode 100644 tests/test_pystructural.py create mode 100644 tests/test_semantic_context.py create mode 100644 tests/test_strength.py diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md new file mode 100644 index 0000000..3cf4254 --- /dev/null +++ b/V1_IMPLEMENTATION_TRACKER.md @@ -0,0 +1,111 @@ +# V1 implementation tracker + +Working tracker for the v0.11.0 → V1 strengthening program driven by +`RESEARCH_REPORT_DORIAN_0_11_0.md`. Behavior is verified against the **current +code**, not the report; where they disagree, code wins and the disagreement is +recorded here. + +## Phase 0 — version gate + scope evidence + +**Version gate: PASSED.** + +| Surface | Observed | +|---|---| +| `pyproject.toml` `[project].version` | `0.11.0` | +| `src/dorian/__init__.py` `__version__` | `0.11.0` | +| branch | `main` | +| commit SHA (start) | `78dcd1a6a242110e55dc31fd1db2e811de3e3898` | +| working tree | clean except untracked `.claude/`, `AGENTS.md`, `CLAUDE.md`, `RESEARCH_REPORT_DORIAN_0_11_0.md` | +| Python | 3.12.4 | +| toolchain | `uv` 0.5.9; `uv run pytest`; ruff for lint/format | +| baseline tests | `uv run pytest -m "not slow"` → **561 passed, exit 0**; 636 total incl. slow | + +## Phase 1 — baseline reconstruction (from current code) + +### Module map +- `model.py` — `Warrant`/`Claim`/`CheckerSpec`/`ReadSetEntry`, content-addressed id, canonical JSON. `CheckerType = C1|C3|C4|C5` (a *Literal* hint; registry dispatch is on the string `type`). +- `checkers/base.py` — `run_checker` is the single dispatch + the single execution-policy gate (blocked → `Verdict.ERROR`). +- `checkers/c1_span.py` — span anchor, relocation-tolerant, optional c2lite. +- `checkers/c3_ref.py` — `path:` / `symbol:` / `string:` / `regex:`; regex match in a spawn-killed worker (ReDoS backstop). +- `checkers/c4_test.py` — `pytest:`, careful exit-code mapping; ERROR≠FAIL. +- `checkers/c5_data.py` — typed data forms + opaque `shell:`. +- `policy.py` — `ExecutionPolicy`, `executable_kind` (single source of "what executes": C4=pytest, C5 shell=shell). +- `seal.py` — born-verifiable seal; scope lint; watch derivation; additive symbol-definer widening; duplicate-id reject; atomic write; idempotent re-seal. +- `revalidate.py` — changed-path discovery, rename persistence, cheapest-first checks (C1 REVOKED` is NOT drift.** Report (medium-confidence) called it stale. Verified: `fold.fold()` only emits TRUSTED/DEGRADED/REVOKED/UNKNOWN; the *born* trust state is `WARRANTED` (set at seal); the first fold therefore renders `WARRANTED -> `. `tests/test_render_md.py:168-169` pins `WARRANTED -> REVOKED` and `WARRANTED -> UNKNOWN` as correct md output. Action: **do not "fix"; add a short trust-state vocabulary note to remove reader confusion.** +- **C4 adequacy blind spot** — report marks INFERENCE; confirmed: `c4_test.py` maps pytest exit codes only, no assertion/relevance inspection. Valid advisory target (WP6). +- **PyPI install wording** — report marks UNVERIFIED. Per project state, dorian is NOT on PyPI; README "until the first PyPI release … install from source" is accurate. Keep. + +## Report coverage matrix (every material finding classified) + +Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-document · BENCH=must-benchmark · BOUNDARY=honest non-goal · DONE=already in v0.11.0 · DEFER=post-V1/blocked. + +| # | Report finding / recommendation | Category | Current evidence | Planned action | Acceptance/verification | Status | +|---|---|---|---|---|---|---| +| 1 | README trust-state vocab (WARRANTED vs TRUSTED/…) | DOC | code correct; README lacks a glossary | add trust-state legend; keep examples | docs test + render_md tests stay green | TODO | +| 2 | ERROR must never collapse into BROKEN | DONE+TEST | base/fold/revalidate all enforce | keep; add a guard test if any new path | existing + new ERROR≠BROKEN tests | TODO | +| 3 | C1 span + c2lite regression | DONE | test_c1.py | none (keep green) | test_c1 passes | DONE | +| 4 | C3 regex ReDoS timeout regression | DONE | test_c3_regex_timeout.py (slow) | none | passes | DONE | +| 5 | C3 symbol existence ceiling / gutted-body | IMPL+DOC | symbol: existence-only | add `py-signature:` structural checker (WP3) | gutted-body PASS under symbol, FAIL under signature when sig changes; body-only stays PASS (documented ceiling) | TODO | +| 6 | C3 string/regex comment/docstring survival | IMPL+DOC | raw text search | add semantic code-context search mode (WP4) | literal only in comment/docstring → FAIL in code mode | TODO | +| 7 | C4 pytest vacuous/zero-assertion adequacy | IMPL | none | advisory adequacy lint (WP6) | zero-assertion / assert-True node warns; normal test does not | TODO | +| 8 | C5 typed grammar limits / snapshot brittleness | BOUNDARY+DOC | documented | document in V1-meaning; optional structural data checker DEFER | doc states grammar bounds | TODO | +| 9 | duplicate claim-id rejection | DONE | seal.py step 0 | keep | test_seal covers | DONE | +| 10 | scope-lint named-read-set-only limitation | DONE+DOC | SECURITY_BOUNDARY | keep wording | docs test | DONE | +| 11 | deny-exec/deny-shell fail-closed, not sandbox | DONE | policy.py, docs | keep | test_deny_exec_policy | DONE | +| 12 | sidecar source-of-truth vs SQLite derived | DONE | seal/revalidate/sync | keep | test_store/sync | DONE | +| 13 | canonical JSON / content-addressed identity | DONE | model.compute_id + Warrant.load integrity | keep | test_model/determinism | DONE | +| 14 | atomic no-write on failed seal | DONE | seal os.replace + refusal order | keep | test_seal/deny_exec | DONE | +| 15 | changed-path discovery + persisted rename | DONE | revalidate + store rename_log | keep | test_revalidate | DONE | +| 16 | checker ordering + FAIL vs ERROR discipline | DONE | revalidate _check_claim | keep | existing | DONE | +| 17 | fold + blast/recall lineage | DONE | fold.py, blast.py | keep | test_fold/test_blast | DONE | +| 18 | audit/state separate-transaction limitation | BOUNDARY | fold.py docstring documents it | document in V1-meaning as known limitation | doc names it | TODO | +| 19 | binding ambiguity handling | DONE | symbol_index ambiguous_symbol_mentions + flag | keep; extend provenance (WP5) | test_symbol_index | DONE | +| 20 | oversized/unparseable file diagnostics | IMPL | silently skipped today | surface multi-index unparse diagnostics (WP5) loudly | giant/unparseable supported file → diagnostic not silent | TODO | +| 21 | pyproject script binding | DONE | pyproject_script_definers | keep | test_symbol_index | DONE | +| 22 | watch glob over/under-match risk | TEST | _covered glob logic | add a glob over/under test if WP5 touches it | test | TODO | +| 23 | public/fork self-attested verdict risk | IMPL+DOC | head-mode only | trusted-base checker-source (WP7) | exploit fixtures: PR-added/modified exec checker not run; non-exec rewrite surfaced | TODO | +| 24 | trusted-base design + non-sandbox caveat | IMPL+DOC | design-only | implement `--checker-source base` + Action input; keep non-sandbox caveat | WP7 test matrix | TODO | +| 25 | historical benchmark docs (v0.7.0, v0.9.0) | DOC | unlabeled as historical in body | add HISTORICAL banner; README cross-link labels | docs wording test | TODO | +| 26 | public benchmark protocol w/o results | DOC | protocol only | keep; note in current-results doc | unchanged | TODO | +| 27 | current-version benchmark rerun | BENCH | none | rerun + version-stamped `BENCHMARK_CURRENT.md` | bench smoke + stamp present | TODO | +| 28 | extractor remains draft/experimental | DONE | README + AGENT_CLAIMS | keep; do not promote | docs test | DONE | +| 29 | release/install-status uncertainty | DOC | README source-install accurate | keep; V1 release report states status | report | TODO | +| 30 | checker-strength / claim-risk visibility | IMPL | bindings flags exist but no strength score | strength + claim-risk diagnostics (WP2) | behavior+symbol → adequacy-mismatch; unbacked load-bearing → high risk | TODO | +| 31 | multi-index binding (routes/config/etc.) | IMPL | python+script only | config-key index (WP5), provenance-tagged | config-key change selects claim; ambiguous skipped+warned | TODO | +| 32 | warrant-quality mutation harness | BENCH | repo-level bench only | `dorian bench warrant-quality` (WP8) | deterministic per-claim trigger/truth score on fixture | TODO | + +## Work-package status (live) + +| WP | Title | Status | +|---|---|---| +| WP1 | docs/evidence hygiene | TODO | +| WP2 | checker-strength / claim-risk linter | TODO | +| WP3 | Python structural checkers (py-signature, py-const) | TODO | +| WP4 | semantic-context source search | TODO | +| WP5 | multi-index binding (config-key) | TODO | +| WP6 | C4 test-adequacy lint | TODO | +| WP7 | trusted-base checker-source mode | TODO | +| WP8 | warrant-quality mutation harness | TODO | +| WP9 | current-version benchmark results | TODO | +| WP10 | V1 release prep / decision | TODO | diff --git a/docs/AGENT_CLAIMS.md b/docs/AGENT_CLAIMS.md index 18202e2..ccb4c24 100644 --- a/docs/AGENT_CLAIMS.md +++ b/docs/AGENT_CLAIMS.md @@ -120,7 +120,10 @@ ignores `timeout_s`, and a backtracking pattern can stall `revalidate`. The authoritative grammar is [`spec/checkers.md`](../spec/checkers.md). In brief: -- **C3** — `path:

` · `symbol:::` · `string:::` · `regex:::` +- **C3** — `path:

` · `symbol:::` · `string:::` · `regex:::` · `py-signature:::::` · `py-const:::::` · `code:::` + - **`py-signature:`** is stronger than `symbol:` for "function `X` takes args `…`": it compares the parsed signature (names/order/kind always; annotations/defaults/return/async only when you state them), so a parameter rename or default change FAILs where `symbol:` still passes. A body-only change is the documented ceiling — back behavior claims with `C4 pytest:`. + - **`py-const:`** is stronger than `regex:` for "`X` is 30": it compares the assignment's literal **value** via the AST, so a comment or docstring mention can never pass and `30`/`0x1E` are equal. A non-literal RHS ERRORs (use a different checker). + - **`code:`** is `regex:` over comment/docstring-stripped Python — use it when a `regex:` would false-pass on a fact that survives only in a comment or docstring. Python-only. - **C4** — `pytest:` (a nodeid is `file::test`) - **C5** — `rowcount:::` · `schema:::c1,c2` · `nullrate:::::` · `domain:::::{a,b}` · `freshness:::::>= ` · `snapshot:` · `reconcile:~~` · `shell:` (needs explicit `watch` + `expect`) - **C1** — a span anchor; its `program` is a read-set entry id. **Not** auto-capturable by `verify`. diff --git a/spec/checkers.md b/spec/checkers.md index eb02acf..2ccee4e 100644 --- a/spec/checkers.md +++ b/spec/checkers.md @@ -35,6 +35,11 @@ symbol::: PASS iff \b(def|class)\s+\b matches the fil string::: PASS iff the literal substring is present regex::: PASS iff re.search(pattern, text, re.MULTILINE) hits the LF-normalized file text +py-signature::::: structural (Python AST): the function + or method has the stated signature +py-const::::: structural (Python AST): the module or + class assignment has the stated literal value +code::: semantic regex over comment/docstring-stripped Python ``` The operand of `string:`/`regex:` may itself contain `:`; only the prefix and @@ -44,6 +49,45 @@ both `TIMEOUT = 30` and `TIMEOUT=30`). When a `string:` check FAILs but a line nearly matches, the detail carries a near-miss hint (line number and similarity ratio only, never file content) pointing at `regex:`. +### Python structural forms (`py-signature:`, `py-const:`) + +`symbol:` proves a name still **exists**; it cannot see a signature change, and +`string:`/`regex:` search raw text, so a fact surviving only in a comment, +docstring, or dead literal still passes. The two structural forms parse the +target's AST (stdlib `ast`, read-only, no execution) and compare structure or +literal **values**, so they tolerate formatting (whitespace, quote style, integer +base) and cannot be satisfied by a comment/docstring mention. + +- `py-signature:::::` — `` is a dotted path to a + function/method (`verify_token`, `Auth.login`). `` is the parameter list + exactly as it would appear inside `def f(...)` — e.g. `token`, `token, algo`, + `token: str, algo: str = "RS256"` — optionally suffixed with `-> ` and/or + prefixed with `async`. Parameter **names, order, and kind** are always compared; + per-parameter **annotations** and **defaults**, the **return annotation**, and + **async-ness** are compared **only when the spec states them** (a names-only spec + ignores the rest). FAIL on a signature drift or a missing function; ERROR on an + unparseable target or a malformed ``. +- `py-const:::::` — `` is a module-level or + class-level assignment target (`TIMEOUT`, `C.LIMIT`). Compares the assignment's + literal value by `ast.literal_eval`, so `30` matches `0x1E` and `"RS256"` matches + `'RS256'`. FAIL on a value drift or a missing constant; **ERROR** when the RHS is + not a literal (the value cannot be determined — never a vacuous PASS). + +**Documented ceiling:** `py-signature:` is blind to a body-only ("gutted body") +change — the signature is unchanged, so it PASSes. Only a C4 `pytest:` test catches +a behavior change behind an unchanged signature. Binding/structure widens *what is +checked*, never *proves behavior*. + +### Semantic-context form (`code:`) + +`code:::` runs a regex (same 500-char cap, compile guard, and +worker-process timeout as `regex:`) over a copy of the **Python** file with comments +and docstrings blanked out. Real string literals (a route path in a dict key, a call +or decorator argument) are kept, so `code:src/routes.py::/v1/login` matches the route +in code but a `TIMEOUT = 30` that survives only in a comment FAILs (`code_missing`). +Python-only: a non-parseable / non-Python target ERRORs (`code_unparseable`), never a +silent pass. Derived watch: the referenced file. + Regex DoS: `regex:` patterns are length-bounded (500 chars) and compile-guarded, AND the match runs in a spawned worker process killed at the checker's `timeout_s` (default 30s). A pathological nested-quantifier pattern that triggers diff --git a/src/dorian/bindings.py b/src/dorian/bindings.py index 0677e0e..976cb91 100644 --- a/src/dorian/bindings.py +++ b/src/dorian/bindings.py @@ -197,7 +197,8 @@ def _checker_named_files(claim: Claim, entry_uris: dict[str, str]) -> set[str]: symbol-definer watch paths added at verify time. A watch path NOT in this set is a re-check TRIGGER that no checker exercises — the binding fix's trigger != truth gap, which the 'trigger-only-symbol' flag surfaces.""" - from dorian.seal import _c5_data_paths # lazy: reuse the canonical C5 path grammar + # lazy: reuse seal's canonical C3 file-operand form set and C5 path grammar + from dorian.seal import _C3_FILE_OPERAND_FORMS, _c5_data_paths named: set[str] = set() for spec in claim.checkers: @@ -207,7 +208,7 @@ def _checker_named_files(claim: Claim, entry_uris: dict[str, str]) -> set[str]: if uri: named.add(uri) elif spec.type == "C3": - named.add(rest.partition("::")[0] if prefix in ("symbol", "string", "regex") else rest) + named.add(rest.partition("::")[0] if prefix in _C3_FILE_OPERAND_FORMS else rest) elif spec.type == "C4" and prefix == "pytest": named.add(rest.partition("::")[0].strip()) # parity with seal._derive_watch elif spec.type == "C5": diff --git a/src/dorian/checkers/c3_ref.py b/src/dorian/checkers/c3_ref.py index c2a2e5f..fe06e35 100644 --- a/src/dorian/checkers/c3_ref.py +++ b/src/dorian/checkers/c3_ref.py @@ -7,6 +7,21 @@ - string::: PASS iff the literal substring is present. - regex::: PASS iff re.search(pattern, text, re.MULTILINE) hits the LF-normalized file text. +- py-signature::::: structural (Python AST): the named + function/method has the stated parameters (and, when + given, annotations/defaults/return/async). FAIL on a + signature drift; ERROR on an unparseable target or + malformed spec. Stronger than `symbol:` (which is + existence-only); the body-only "gutted" change is the + documented ceiling — only a C4 test catches that. +- py-const::::: structural (Python AST): the named + module/class assignment has the stated LITERAL value + (compared by value, so quote style / int base / spacing + are tolerated, and a comment/docstring mention cannot + pass). FAIL on a value drift; ERROR on a non-literal RHS. + +The `py-*` structural forms parse the file's AST (`dorian.pyast`); they read only and +never execute the target. See `dorian/pyast.py` and `spec/checkers.md`. `regex:` is the shape-tolerant form: prefer it over `string:` for facts that must survive reformatting (the v0.0 false-positive class — e.g. 'TIMEOUT\\s*=\\s*30' @@ -42,11 +57,16 @@ import re from pathlib import Path +from dorian import pyast from dorian._regex_worker import MATCH, NO_MATCH, WORKER_ERROR, search_worker from dorian.checkers import registry from dorian.checkers.base import CheckContext, CheckResult, Verdict, resolve_path from dorian.model import CheckerSpec, lf_normalize +# C3 grammar prefixes. `path` takes a bare path; the rest take `::`. +_FILE_OPERAND_FORMS = ("symbol", "string", "regex", "py-signature", "py-const", "code") +_VERDICT = {"PASS": Verdict.PASS, "FAIL": Verdict.FAIL, "ERROR": Verdict.ERROR} + _MAX_PATTERN_LEN = 500 # cheap guard against catastrophic patterns _NEAR_MISS_RATIO = 0.8 _NEAR_MISS_MAX_FILE_BYTES = 1 << 20 # 1 MiB: bound the per-line scan @@ -132,7 +152,7 @@ def _string_fail(path: Path, text: str, literal: str) -> CheckResult: def check(ctx: CheckContext, spec: CheckerSpec) -> CheckResult: prefix, sep, rest = spec.program.partition(":") - if not sep or prefix not in ("path", "symbol", "string", "regex"): + if not sep or (prefix != "path" and prefix not in _FILE_OPERAND_FORMS): return CheckResult(Verdict.ERROR, detail="bad_program") if prefix == "path": @@ -148,7 +168,7 @@ def check(ctx: CheckContext, spec: CheckerSpec) -> CheckResult: return CheckResult(Verdict.ERROR, detail="bad_program") pattern: re.Pattern[str] | None = None - if prefix == "regex": + if prefix in ("regex", "code"): # both are regex over text; same DoS guards if len(needle) > _MAX_PATTERN_LEN: return CheckResult(Verdict.ERROR, detail="bad_program") try: @@ -168,6 +188,44 @@ def check(ctx: CheckContext, spec: CheckerSpec) -> CheckResult: return CheckResult(Verdict.PASS) return CheckResult(Verdict.FAIL, detail="symbol_missing") + if prefix == "py-signature": + verdict, detail = pyast.check_signature(text, needle) + return CheckResult(_VERDICT[verdict], detail=detail) + + if prefix == "py-const": + verdict, detail = pyast.check_const(text, needle) + return CheckResult(_VERDICT[verdict], detail=detail) + + if prefix == "code": + assert pattern is not None # compiled above to validate before we spawn + code_text = pyast.code_only_python(text) + if code_text is None: + return CheckResult( + Verdict.ERROR, + detail="code_unparseable (code: strips comments/docstrings from" + " Python; this target is not parseable Python)", + ) + status = _search_with_timeout(needle, re.MULTILINE, code_text, spec.timeout_s) + if status == "match": + return CheckResult(Verdict.PASS) + if status == "nomatch": + return CheckResult( + Verdict.FAIL, + detail="code_missing (not present in code; comments/docstrings ignored)", + ) + if status == "timeout": + return CheckResult( + Verdict.ERROR, + detail=f"regex_timeout (>{spec.timeout_s}s — catastrophic backtracking?)", + ) + if status == "spawn_error": + return CheckResult( + Verdict.ERROR, + detail="regex_spawn_error (regex worker process failed to start;" + " an embedder needs a spawn-safe __main__ guard)", + ) + return CheckResult(Verdict.ERROR, detail="regex_error") + if prefix == "regex": assert pattern is not None # compiled above to validate before we spawn status = _search_with_timeout(needle, re.MULTILINE, text, spec.timeout_s) diff --git a/src/dorian/commands.py b/src/dorian/commands.py index fc0343d..dff6848 100644 --- a/src/dorian/commands.py +++ b/src/dorian/commands.py @@ -23,7 +23,7 @@ from collections import Counter from pathlib import Path -from dorian import bindings, claims_io, datachecks, gitio, store, symbol_index +from dorian import bindings, claims_io, datachecks, gitio, store, strength, symbol_index from dorian.blast import blast_conn from dorian.capture.manual import parse_manual from dorian.capture.transcript import parse_transcript @@ -94,6 +94,18 @@ def _emit_binding_gate_warnings(prog: str, repo: Path, artifact_uri: str, mode: " (weak binding is a review smell, not proof a claim is false)", file=sys.stderr, ) + # checker-strength / claim-risk is the truth-axis companion to binding flags: + # binding says WHEN a claim re-checks; strength says whether the checker can + # falsify it. Advisory only — never changes the seal verdict or exit code. + try: + claims = list(Warrant.load(repo / (artifact_uri + ".warrant")).claims) + except (gitio.GitError, *_SIDECAR_ERRORS): + return + sdiags = strength.analyze(repo, claims, {d["claim_id"]: d["flags"] for d in diags}) + print(f"{prog}: {strength.summary_line(sdiags)}", file=sys.stderr) + for s in sdiags: + for note in s["adequacy"]: + print(f"{prog}: {s['claim_id']}: {note}", file=sys.stderr) def _print_binding_gate_refusal(prog: str, exc: BindingGateError) -> None: @@ -425,6 +437,19 @@ def cmd_bindings(args: argparse.Namespace) -> int: except _SIDECAR_ERRORS as exc: print(f"dorian bindings: corrupt warrant sidecar: {exc}", file=sys.stderr) return EXIT_REVOKED + # attach the truth-axis diagnostics (checker strength + claim risk) per claim: + # binding flags say WHEN a claim re-checks, strength says whether the checker can + # falsify it. Advisory; never a gate (bindings always exits 0 when readable). + try: + claims = list(Warrant.load(repo / (uri + ".warrant")).claims) + sdiags = { + s["claim_id"]: s + for s in strength.analyze(repo, claims, {d["claim_id"]: d["flags"] for d in diags}) + } + except _SIDECAR_ERRORS: + sdiags = {} + for d in diags: + d["strength"] = sdiags.get(d["claim_id"]) if args.json: print(json.dumps({"artifact_uri": uri, "claims": diags}, sort_keys=True)) return EXIT_OK @@ -434,6 +459,12 @@ def cmd_bindings(args: argparse.Namespace) -> int: print(f"{d['claim_id']} flags: {', '.join(d['flags']) or 'none'}") for m in d["mentions"]: print(f" {m['token']} -> unwatched: {', '.join(m['unwatched_files'])}") + s = d.get("strength") + if s: + reasons = f" ({', '.join(s['reasons'])})" if s["reasons"] else "" + print(f" strength: {s['strength']} risk: {s['risk']}{reasons}") + for note in s["adequacy"]: + print(f" {note}") print(f"{len(diags)} claim(s), {flagged} flagged") return EXIT_OK diff --git a/src/dorian/pyast.py b/src/dorian/pyast.py new file mode 100644 index 0000000..fb2c4b0 --- /dev/null +++ b/src/dorian/pyast.py @@ -0,0 +1,293 @@ +"""Deterministic Python structural comparisons over the stdlib `ast` (no execution). + +Backs the C3 structural subgrammars `py-signature:` and `py-const:`. Both parse the +target file's AST and compare *structure* / *literal values*, so they are tolerant +of formatting (whitespace, quote style, integer base) and cannot be satisfied by a +mention in a comment, docstring, or dead string literal — the documented weak-verdict +ceiling of `symbol:`/`string:`/`regex:`. + +Each entry point returns ``(verdict, detail)`` where ``verdict`` is ``"PASS"`` / +``"FAIL"`` / ``"ERROR"`` (the caller maps to ``CheckResult``). The split mirrors the +checker contract exactly: +- **FAIL** — a real drift: the function/constant is gone, or its signature/value no + longer matches the claim. +- **ERROR** — the checker could not run: an unparseable target, a malformed program, + or a non-literal constant whose value cannot be compared. ERROR is never a vacuous + PASS and never a false FAIL. + +This module imports only ``ast``; it neither imports nor executes the target. +""" + +from __future__ import annotations + +import ast +import io +import tokenize + +# ast.parse on a pathological (but importable) file can blow these without a +# SyntaxError; treat them as "could not run", never as drift. +_PARSE_ERRORS = (SyntaxError, ValueError, RecursionError, MemoryError) + +_SCOPE_NODES = (ast.Module, ast.ClassDef, ast.FunctionDef, ast.AsyncFunctionDef) + + +def code_only_python(text: str) -> str | None: + """Return `text` with comments and docstrings blanked to spaces (line count and + column offsets preserved), or None if it is not parseable Python. + + Real string literals — a route path in a dict key, a call argument, a decorator + argument — are KEPT: only ``#`` comments and bare-string docstring statements are + removed. This is what makes `code:` reject a fact that survives only in a comment + or docstring while still matching the same fact in actual code. + """ + tree = _parse(text) + if tree is None: + return None + doc_start_lines: set[int] = set() + for node in ast.walk(tree): + if isinstance(node, _SCOPE_NODES): + body = getattr(node, "body", None) + if ( + isinstance(body, list) + and body + and isinstance(body[0], ast.Expr) + and isinstance(body[0].value, ast.Constant) + and isinstance(body[0].value.value, str) + ): + doc_start_lines.add(body[0].value.lineno) + + buf = [list(line) for line in text.split("\n")] + + def blank(start: tuple[int, int], end: tuple[int, int]) -> None: + (sl, sc), (el, ec) = start, end + for ln in range(sl, el + 1): + if ln - 1 >= len(buf): + break + row = buf[ln - 1] + lo = sc if ln == sl else 0 + hi = ec if ln == el else len(row) + for i in range(lo, min(hi, len(row))): + row[i] = " " + + try: + for tok in tokenize.generate_tokens(io.StringIO(text).readline): + if tok.type == tokenize.COMMENT or ( + tok.type == tokenize.STRING and tok.start[0] in doc_start_lines + ): + blank(tok.start, tok.end) + except (tokenize.TokenError, IndentationError, SyntaxError): + pass # ast parsed cleanly; a tokenizer hiccup leaves best-effort blanking + return "\n".join("".join(row) for row in buf) + + +def _parse(text: str) -> ast.Module | None: + try: + return ast.parse(text) + except _PARSE_ERRORS: + return None + + +def _find_def(tree: ast.Module, qualname: str) -> ast.AST | None: + """Resolve a dotted qualname (``A.method``) to its def node, last definition + winning (runtime rebinding semantics). Descends ClassDef/FunctionDef bodies.""" + node: ast.AST = tree + parts = qualname.split(".") + for part in parts: + body = getattr(node, "body", None) + if not isinstance(body, list): + return None + match: ast.AST | None = None + for child in body: + if ( + isinstance(child, ast.FunctionDef | ast.AsyncFunctionDef | ast.ClassDef) + and child.name == part + ): + match = child # last wins + if match is None: + return None + node = match + return node + + +def _find_assign(tree: ast.Module, qualname: str) -> ast.expr | None: + """Resolve a dotted constant name (``C.LIMIT``) to its assigned RHS value node, + last assignment winning. Walks into ClassDef containers for the dotted prefix.""" + parts = qualname.split(".") + *containers, name = parts + if not name: + return None + node: ast.AST = tree + for part in containers: + body = getattr(node, "body", None) + if not isinstance(body, list): + return None + match: ast.ClassDef | None = None + for child in body: + if isinstance(child, ast.ClassDef) and child.name == part: + match = child + if match is None: + return None + node = match + body = getattr(node, "body", None) + if not isinstance(body, list): + return None + found: ast.expr | None = None + for stmt in body: + if isinstance(stmt, ast.Assign): + for tgt in stmt.targets: + if isinstance(tgt, ast.Name) and tgt.id == name: + found = stmt.value + elif ( + isinstance(stmt, ast.AnnAssign) + and isinstance(stmt.target, ast.Name) + and stmt.target.id == name + and stmt.value is not None + ): + found = stmt.value + return found + + +def _params(fn: ast.FunctionDef | ast.AsyncFunctionDef) -> list[dict]: + """Normalized parameter list: name + kind, plus annotation/default source reprs + (via ast.unparse, so spacing/quote style/int base are canonical).""" + a = fn.args + out: list[dict] = [] + pos = a.posonlyargs + a.args + dmap = { + arg.arg: ast.unparse(d) + for arg, d in zip(pos[len(pos) - len(a.defaults) :], a.defaults, strict=False) + } + kwd = { + arg.arg: ast.unparse(d) + for arg, d in zip(a.kwonlyargs, a.kw_defaults, strict=False) + if d is not None + } + + def emit(args: list[ast.arg], kind: str, defaults: dict[str, str]) -> None: + for arg in args: + out.append( + { + "name": arg.arg, + "kind": kind, + "annotation": ast.unparse(arg.annotation) if arg.annotation else None, + "default": defaults.get(arg.arg), + } + ) + + emit(a.posonlyargs, "posonly", dmap) + emit(a.args, "arg", dmap) + if a.vararg: + out.append( + { + "name": a.vararg.arg, + "kind": "vararg", + "annotation": ast.unparse(a.vararg.annotation) if a.vararg.annotation else None, + "default": None, + } + ) + emit(a.kwonlyargs, "kwonly", kwd) + if a.kwarg: + out.append( + { + "name": a.kwarg.arg, + "kind": "kwarg", + "annotation": ast.unparse(a.kwarg.annotation) if a.kwarg.annotation else None, + "default": None, + } + ) + return out + + +def _compare_params(expected: list[dict], actual: list[dict]) -> str | None: + """Compare names/kinds/order always; annotations and defaults ONLY where the + expected spec provided them (so a names-only spec ignores annotations).""" + if len(expected) != len(actual): + return f"param count {len(actual)} != expected {len(expected)}" + for e, a in zip(expected, actual, strict=True): + if e["name"] != a["name"] or e["kind"] != a["kind"]: + return f"param {a['name']!r} ({a['kind']}) != expected {e['name']!r} ({e['kind']})" + if e["annotation"] is not None and e["annotation"] != a["annotation"]: + return ( + f"annotation of {e['name']!r}: {a['annotation']!r} != expected {e['annotation']!r}" + ) + if e["default"] is not None and e["default"] != a["default"]: + return f"default of {e['name']!r}: {a['default']!r} != expected {e['default']!r}" + return None + + +def check_signature(text: str, needle: str) -> tuple[str, str]: + """``needle`` is ``::``. ``sigspec`` is the parameter list as + written inside ``def f(...)`` (optionally ``-> ret``, optionally a leading + ``async``). Compares names/kinds/order always; annotations, defaults, the return + annotation, and async-ness only when the spec states them.""" + qualname, sep, spec = needle.partition("::") + qualname, spec = qualname.strip(), spec.strip() + if not sep or not qualname: + return ("ERROR", "bad_program: py-signature needs ::") + + async_required = False + if spec == "async" or spec.startswith("async "): + async_required = True + spec = spec[len("async") :].strip() + if " -> " in spec: + param_src, ret = spec.split(" -> ", 1) + arrow = f" -> {ret.strip()}" + else: + param_src, arrow = spec, "" + try: + probe = ast.parse(f"def __dorian_probe__({param_src}){arrow}: pass") + except _PARSE_ERRORS: + return ("ERROR", f"bad_program: cannot parse expected signature {spec!r}") + pfn = probe.body[0] + if not isinstance(pfn, ast.FunctionDef): + return ("ERROR", f"bad_program: cannot parse expected signature {spec!r}") + + tree = _parse(text) + if tree is None: + return ("ERROR", "target_unparseable: not parseable python") + fn = _find_def(tree, qualname) + if not isinstance(fn, ast.FunctionDef | ast.AsyncFunctionDef): + return ("FAIL", f"function_missing: {qualname}") + + if async_required and not isinstance(fn, ast.AsyncFunctionDef): + return ("FAIL", f"signature_mismatch: {qualname} is not async") + mismatch = _compare_params(_params(pfn), _params(fn)) + if mismatch: + return ("FAIL", f"signature_mismatch: {qualname}: {mismatch}") + if arrow: + want_ret = ast.unparse(pfn.returns) if pfn.returns else None + got_ret = ast.unparse(fn.returns) if fn.returns else None + if want_ret != got_ret: + return ("FAIL", f"signature_mismatch: {qualname}: return {got_ret!r} != {want_ret!r}") + return ("PASS", f"signature ok: {qualname}") + + +def check_const(text: str, needle: str) -> tuple[str, str]: + """``needle`` is ``::``. Compares the assignment's literal + VALUE (via ``ast.literal_eval``), so quote style / int base / spacing are + tolerated and a comment/docstring mention cannot pass. A non-literal RHS ERRORs + (the value cannot be determined), never a vacuous PASS.""" + qualname, sep, expected = needle.partition("::") + qualname, expected = qualname.strip(), expected.strip() + if not sep or not qualname: + return ("ERROR", "bad_program: py-const needs ::") + if not expected: + return ("ERROR", "bad_program: py-const needs an expected value") + try: + want = ast.literal_eval(expected) + except _PARSE_ERRORS: + return ("ERROR", f"bad_program: expected value is not a python literal: {expected!r}") + + tree = _parse(text) + if tree is None: + return ("ERROR", "target_unparseable: not parseable python") + rhs = _find_assign(tree, qualname) + if rhs is None: + return ("FAIL", f"const_missing: {qualname}") + try: + got = ast.literal_eval(rhs) + except _PARSE_ERRORS: + return ("ERROR", f"non_literal: {qualname} is not a literal constant") + if got == want: + return ("PASS", f"const ok: {qualname} == {expected}") + return ("FAIL", f"const_value_mismatch: {qualname} != {expected}") diff --git a/src/dorian/seal.py b/src/dorian/seal.py index 9954e12..d1384d6 100644 --- a/src/dorian/seal.py +++ b/src/dorian/seal.py @@ -37,6 +37,11 @@ ) from dorian.policy import ExecutionPolicy +# C3 prefixes whose program is `::` (so the watched file is the head +# before `::`); `path:` is the exception (its whole operand is the path). Mirrors +# `c3_ref._FILE_OPERAND_FORMS` — kept in sync so a new C3 subgrammar binds its file. +_C3_FILE_OPERAND_FORMS = ("symbol", "string", "regex", "py-signature", "py-const", "code") + class SealError(Exception): """Sealing refused: bad bindings or a checker that is not green right now.""" @@ -116,9 +121,9 @@ def _derive_watch(spec: CheckerSpec, readset: ReadSet) -> CheckerSpec: entry = next((e for e in readset.entries if e.id == spec.program), None) if entry is not None: watch = (entry.uri,) - elif spec.type == "C3": # path:

| (symbol|string|regex)::: + elif spec.type == "C3": # path:

| (symbol|string|regex|py-*|code)::: prefix, _, rest = spec.program.partition(":") - file = rest.partition("::")[0] if prefix in ("symbol", "string", "regex") else rest + file = rest.partition("::")[0] if prefix in _C3_FILE_OPERAND_FORMS else rest if file: watch = (file,) elif spec.type == "C4": # pytest:: the nodeid's file part is the binding @@ -191,7 +196,7 @@ def add(p: str) -> None: ) prefix, _, rest = spec.program.partition(":") if spec.type == "C3": - add(rest.partition("::")[0] if prefix in ("symbol", "string", "regex") else rest) + add(rest.partition("::")[0] if prefix in _C3_FILE_OPERAND_FORMS else rest) elif spec.type == "C4": if prefix == "pytest": # match _derive_watch; other C4 forms ERROR at seal add(rest.partition("::")[0]) diff --git a/src/dorian/strength.py b/src/dorian/strength.py new file mode 100644 index 0000000..21f35dc --- /dev/null +++ b/src/dorian/strength.py @@ -0,0 +1,223 @@ +"""Checker-strength and claim-risk diagnostics — make TRUTH strength visible. + +The protocol keeps two questions apart (``docs/VALIDATION_HONESTY.md``): +- **Trigger coverage** — WHEN a claim is re-checked (binding; ``bindings.py``). +- **Truth strength** — WHETHER a checker can actually FALSIFY the claim (here). + +A green seal says every backed claim held at seal time; it does NOT say the checker +is strong enough to catch a future drift. This module scores that second axis: + +1. classify each checker's truth strength (``existence`` < ``raw_text`` < + ``semantic_text`` < ``snapshot`` < ``data`` < ``structural`` < ``behavioral``; + ``shell_executable`` is opaque), and the strongest backing per claim; +2. flag an ``adequacy_mismatch`` when the claim's ``kind`` needs more than its + checkers provide (a ``behavior`` claim with only an existence/text checker; a + ``quantity`` claim with only an existence checker); +3. run an advisory C4 test-adequacy lint (a bound pytest node with no assertions, or + only a constant assertion, passes vacuously); +4. roll those into a per-claim ``risk`` level (``high``/``medium``/``low``). + +It is purely advisory and deterministic: it never executes a checker, never changes +a verdict, trust state, or exit code, and reports repo-relative facts only. C4 +adequacy parses the test file's AST (read-only); it never runs the test. +""" + +from __future__ import annotations + +import ast +from collections.abc import Sequence +from pathlib import Path + +from dorian import pyast +from dorian.model import CheckerSpec, Claim +from dorian.policy import executable_kind + +# Truth-strength rank (higher = can falsify more). `shell_executable` is opaque — +# it may be strong or vacuous — so it is ranked low and flagged, never trusted as +# the strongest backing on reputation alone. +_RANK = { + "unbacked": 0, + "shell_executable": 1, + "existence": 2, + "raw_text": 3, + "semantic_text": 4, + "snapshot": 5, + "data": 6, + "structural": 7, + "behavioral": 8, +} + +_C3_STRENGTH = { + "path": "existence", + "symbol": "existence", + "string": "raw_text", + "regex": "raw_text", + "code": "semantic_text", + "py-signature": "structural", + "py-const": "structural", +} + +# claim kinds and the WEAK strengths that under-verify them +_WEAK_FOR_BEHAVIOR = {"existence", "raw_text", "semantic_text", "snapshot", "data"} + + +def checker_strength(spec: CheckerSpec) -> str: + """Classify a single checker's truth strength (see module docstring).""" + if spec.type == "C1": + return "snapshot" # exact span-hash: snapshot-grade content match + if spec.type == "C4": + return "behavioral" + if spec.type == "C5": + form = spec.program.partition(":")[0] + if form == "shell": + return "shell_executable" + if form == "snapshot": + return "snapshot" + return "data" # rowcount/schema/nullrate/domain/freshness/reconcile + if spec.type == "C3": + return _C3_STRENGTH.get(spec.program.partition(":")[0], "raw_text") + return "raw_text" # unknown checker type: conservative + + +def claim_strength(claim: Claim) -> str: + """The strongest backing across a claim's checkers; ``unbacked`` if none.""" + if not claim.checkers: + return "unbacked" + return max((checker_strength(s) for s in claim.checkers), key=lambda s: _RANK.get(s, 0)) + + +def c4_adequacy(repo: Path, spec: CheckerSpec) -> list[str]: + """Advisory: does a C4 ``pytest:`` node actually assert anything? Parses the test + file's AST (no execution). Returns at most one note. Silent (``[]``) when the + node cannot be located statically (the checker itself reports ``test_gone``) or + when assertions / assertion helpers / ``pytest.raises`` are present.""" + prefix, _, nodeid = spec.program.partition(":") + if prefix != "pytest": + return [] + file, _, rest = nodeid.partition("::") + if not file.strip() or not rest.strip(): + return [] # whole-file node or malformed: not the linter's call + path = repo / file.strip() + if not path.is_file(): + return [] + try: + text = path.read_text(encoding="utf-8", errors="replace") + except OSError: + return [] + tree = pyast._parse(text) + if tree is None: + return [] + fn = pyast._find_def(tree, rest.strip().replace("::", ".")) + if not isinstance(fn, ast.FunctionDef | ast.AsyncFunctionDef): + return [] # not found by static walk: do not guess + asserts = [n for n in ast.walk(fn) if isinstance(n, ast.Assert)] + if asserts: + if all(isinstance(a.test, ast.Constant) and bool(a.test.value) for a in asserts): + return [f"c4_adequacy: {rest.strip()} asserts only a constant (vacuous)"] + return [] + if _has_assertion_helper(fn): + return [] + return [f"c4_adequacy: {rest.strip()} has no assertions (may pass vacuously; low confidence)"] + + +def _has_assertion_helper(fn: ast.AST) -> bool: + """unittest ``self.assert*`` calls or a ``pytest.raises`` / bare ``raises`` context + count as assertions for the adequacy lint (conservative: avoid false warnings).""" + for node in ast.walk(fn): + if isinstance(node, ast.Attribute) and node.attr.startswith("assert"): + return True + if isinstance(node, ast.Call): + f = node.func + if isinstance(f, ast.Attribute) and f.attr == "raises": + return True + if isinstance(f, ast.Name) and f.id == "raises": + return True + return False + + +def adequacy_notes(repo: Path, claim: Claim) -> list[str]: + """Kind-vs-strength mismatches plus C4 adequacy notes for one claim.""" + if not claim.checkers: + return [] + notes: list[str] = [] + strongest = claim_strength(claim) + behavioral_backed = any(checker_strength(s) == "behavioral" for s in claim.checkers) + if claim.kind == "behavior" and not behavioral_backed and strongest in _WEAK_FOR_BEHAVIOR: + notes.append( + f"adequacy_mismatch: 'behavior' claim backed only by {strongest}" + " — only a C4 pytest checker proves behavior" + ) + if claim.kind == "quantity" and all(checker_strength(s) == "existence" for s in claim.checkers): + notes.append( + "adequacy_mismatch: 'quantity' claim backed only by an existence checker" + " — use py-const:/anchored regex:/typed C5 to verify the value" + ) + for spec in claim.checkers: + if spec.type == "C4": + notes.extend(c4_adequacy(repo, spec)) + return notes + + +def claim_risk( + claim: Claim, flags: Sequence[str], adequacy: Sequence[str] +) -> tuple[str, list[str]]: + """Roll strength + binding flags + adequacy into a level + reasons. Deterministic. + Non-load-bearing claims never score ``high`` (a soft claim is the author's call).""" + reasons: list[str] = [] + level = "low" + if not claim.checkers: + reasons.append("unbacked") + level = "high" if claim.load_bearing else "low" + else: + strongest = claim_strength(claim) + if adequacy: + reasons.append("adequacy_mismatch") + if claim.load_bearing: + level = "high" if strongest in ("existence", "raw_text") else "medium" + high_binding = { + "short-literal", + "ambiguous-mention", + "trigger-only-symbol", + "unwatched-mention", + } + for f in flags: + if f in high_binding: + reasons.append(f"binding:{f}") + if claim.load_bearing and level == "low": + level = "medium" + return level, reasons + + +def analyze( + repo: Path, + claims: Sequence[Claim], + flags_by_id: dict[str, Sequence[str]] | None = None, +) -> list[dict]: + """Per-claim strength/risk diagnostics, in claim order. ``flags_by_id`` (from + ``bindings.analyze``) feeds binding-flag risk reasons when available.""" + flags_by_id = flags_by_id or {} + out: list[dict] = [] + for c in claims: + adequacy = adequacy_notes(repo, c) + level, reasons = claim_risk(c, flags_by_id.get(c.id, ()), adequacy) + out.append( + { + "claim_id": c.id, + "kind": c.kind, + "load_bearing": c.load_bearing, + "strength": claim_strength(c), + "executes": sorted({k for s in c.checkers if (k := executable_kind(s))}), + "adequacy": adequacy, + "risk": level, + "reasons": reasons, + } + ) + return out + + +def summary_line(diags: list[dict]) -> str: + """One deterministic line: risk-level counts. For CLI summaries.""" + counts = {"high": 0, "medium": 0, "low": 0} + for d in diags: + counts[d["risk"]] = counts.get(d["risk"], 0) + 1 + return f"claim-risk: {counts['high']} high, {counts['medium']} medium, {counts['low']} low" diff --git a/tests/test_pystructural.py b/tests/test_pystructural.py new file mode 100644 index 0000000..76976e4 --- /dev/null +++ b/tests/test_pystructural.py @@ -0,0 +1,290 @@ +"""C3 Python structural checkers: `py-signature:` and `py-const:`. + +These close two documented weak-verdict ceilings of `symbol:`/`string:`/`regex:`: +- `symbol:` proves a name still EXISTS; it cannot see a signature change. +- `string:`/`regex:` search raw file text, so a fact surviving only in a comment, + docstring, or dead literal still passes. + +Both new forms parse the target file's AST (stdlib `ast`), so: +- they are tolerant of formatting (whitespace, quote style, int base) because the + comparison is over parsed structure / literal VALUES, never raw text; +- they cannot be fooled by a mention in a comment or docstring (the AST has no + such node for the claimed symbol); +- they FAIL on a real structural/value drift, and ERROR (never FAIL) when the + checker itself cannot run (unparseable target, malformed program, non-literal + constant), so a degenerate program never produces a vacuous PASS; +- they remain READ-ONLY and never execute user code. + +The honest ceiling stays visible: a body-only ("gutted body") change keeps the +signature identical, so `py-signature:` PASSes — only a behavior checker (C4) catches +that. The test for it asserts PASS and the docs state the ceiling. +""" + +from __future__ import annotations + +from pathlib import Path + +from dorian.checkers.base import CheckContext, Verdict, run_checker +from dorian.model import CheckerSpec, Claim + + +def _run(repo: Path, program: str) -> object: + claim = Claim( + id="c", + text="x", + kind="behavior", + load_bearing=False, + checkers=(CheckerSpec(type="C3", program=program),), + ) + return run_checker(CheckContext(repo=repo, claim=claim), 0) + + +def _w(repo: Path, rel: str, content: str) -> None: + p = repo / rel + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text(content, encoding="utf-8") + + +# --- py-signature ------------------------------------------------------------- + + +AUTH = '''"""mod.""" + + +def verify_token(token: str) -> bool: + """Verify an RS256 JWT.""" + return bool(token) +''' + + +def test_py_signature_unchanged_passes(tmp_path: Path) -> None: + _w(tmp_path, "m.py", AUTH) + assert ( + _run(tmp_path, "py-signature:m.py::verify_token::token: str -> bool").verdict + is Verdict.PASS + ) + + +def test_py_signature_names_only_ignores_unspecified_annotations(tmp_path: Path) -> None: + """A spec that lists only param names compares only names/order/kind — the + annotation and return are NOT specified, so they are not compared.""" + _w(tmp_path, "m.py", AUTH) + assert _run(tmp_path, "py-signature:m.py::verify_token::token").verdict is Verdict.PASS + + +def test_py_signature_param_rename_fails(tmp_path: Path) -> None: + _w(tmp_path, "m.py", AUTH) + res = _run(tmp_path, "py-signature:m.py::verify_token::tok") + assert res.verdict is Verdict.FAIL + assert "signature" in res.detail + + +def test_py_signature_param_count_and_order_fail(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "def f(a, b):\n return a\n") + assert _run(tmp_path, "py-signature:m.py::f::a").verdict is Verdict.FAIL # missing b + assert _run(tmp_path, "py-signature:m.py::f::b, a").verdict is Verdict.FAIL # reordered + assert _run(tmp_path, "py-signature:m.py::f::a, b").verdict is Verdict.PASS + + +def test_py_signature_default_change(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "def f(x=1):\n return x\n") + assert _run(tmp_path, "py-signature:m.py::f::x=2").verdict is Verdict.FAIL + assert _run(tmp_path, "py-signature:m.py::f::x=1").verdict is Verdict.PASS + # a spec that omits the default does not compare it (only names checked) + assert _run(tmp_path, "py-signature:m.py::f::x").verdict is Verdict.PASS + + +def test_py_signature_return_annotation_change(tmp_path: Path) -> None: + _w(tmp_path, "m.py", AUTH) + assert _run(tmp_path, "py-signature:m.py::verify_token::token -> int").verdict is Verdict.FAIL + assert _run(tmp_path, "py-signature:m.py::verify_token::token -> bool").verdict is Verdict.PASS + + +def test_py_signature_formatting_only_passes(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "def f( x ,y , z ):\n return x\n") + assert _run(tmp_path, "py-signature:m.py::f::x, y, z").verdict is Verdict.PASS + + +def test_py_signature_gutted_body_still_passes_documented_ceiling(tmp_path: Path) -> None: + """The signature is unchanged but the body is inverted: py-signature PASSes. + This is the documented trigger-vs-truth ceiling — only a behavior checker (C4) + catches a body-only change. The test pins the ceiling so it cannot regress + into a silent over-promise.""" + _w(tmp_path, "m.py", "def is_admin(user):\n return user.role == 'admin'\n") + before = _run(tmp_path, "py-signature:m.py::is_admin::user") + _w(tmp_path, "m.py", "def is_admin(user):\n return True # gutted\n") + after = _run(tmp_path, "py-signature:m.py::is_admin::user") + assert before.verdict is Verdict.PASS and after.verdict is Verdict.PASS + + +def test_py_signature_missing_function_fails(tmp_path: Path) -> None: + _w(tmp_path, "m.py", AUTH) + res = _run(tmp_path, "py-signature:m.py::nope::x") + assert res.verdict is Verdict.FAIL + assert "missing" in res.detail + + +def test_py_signature_async_flag(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "async def fetch(url):\n return url\n") + assert _run(tmp_path, "py-signature:m.py::fetch::async url").verdict is Verdict.PASS + # async not specified -> not compared (sync/async both accepted) + assert _run(tmp_path, "py-signature:m.py::fetch::url").verdict is Verdict.PASS + # require async on a sync function -> FAIL + _w(tmp_path, "m.py", "def fetch(url):\n return url\n") + assert _run(tmp_path, "py-signature:m.py::fetch::async url").verdict is Verdict.FAIL + + +def test_py_signature_method_via_dotted_qualname(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "class A:\n def login(self, user, pw):\n return True\n") + assert _run(tmp_path, "py-signature:m.py::A.login::self, user, pw").verdict is Verdict.PASS + assert _run(tmp_path, "py-signature:m.py::A.login::self, user").verdict is Verdict.FAIL + + +def test_py_signature_unparseable_target_is_error(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "def f(:::\n") # syntax error + assert _run(tmp_path, "py-signature:m.py::f::x").verdict is Verdict.ERROR + + +def test_py_signature_bad_spec_is_error(tmp_path: Path) -> None: + _w(tmp_path, "m.py", AUTH) + assert _run(tmp_path, "py-signature:m.py::verify_token::((bad").verdict is Verdict.ERROR + # empty needle (no qualname) is a bad program + assert _run(tmp_path, "py-signature:m.py::").verdict is Verdict.ERROR + + +def test_py_signature_missing_file_is_fail(tmp_path: Path) -> None: + assert _run(tmp_path, "py-signature:gone.py::f::x").verdict is Verdict.FAIL + + +def test_py_signature_path_escape_is_error(tmp_path: Path) -> None: + assert _run(tmp_path, "py-signature:../../etc/passwd::f::x").verdict is Verdict.ERROR + + +def test_py_signature_comment_cannot_create_false_pass(tmp_path: Path) -> None: + """A function that exists only as text inside a comment is not in the AST.""" + _w(tmp_path, "m.py", "# def ghost(a, b): pass\nVALUE = 1\n") + assert _run(tmp_path, "py-signature:m.py::ghost::a, b").verdict is Verdict.FAIL + + +# --- py-const ----------------------------------------------------------------- + +CONFIG = 'TIMEOUT = 30\nRETRIES = 3\nALGO = "RS256"\nPORT: int = 8080\n' + + +def test_py_const_unchanged_passes(tmp_path: Path) -> None: + _w(tmp_path, "c.py", CONFIG) + assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.PASS + + +def test_py_const_value_change_fails(tmp_path: Path) -> None: + _w(tmp_path, "c.py", CONFIG.replace("TIMEOUT = 30", "TIMEOUT = 10")) + res = _run(tmp_path, "py-const:c.py::TIMEOUT::30") + assert res.verdict is Verdict.FAIL + assert "value" in res.detail or "const" in res.detail + + +def test_py_const_formatting_and_base_tolerant(tmp_path: Path) -> None: + _w(tmp_path, "c.py", "TIMEOUT = 0x1E\n") # hex 30, extra spaces + assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.PASS + + +def test_py_const_string_quote_tolerant(tmp_path: Path) -> None: + _w(tmp_path, "c.py", CONFIG) + assert _run(tmp_path, 'py-const:c.py::ALGO::"RS256"').verdict is Verdict.PASS + assert _run(tmp_path, "py-const:c.py::ALGO::'RS256'").verdict is Verdict.PASS + assert _run(tmp_path, 'py-const:c.py::ALGO::"HS256"').verdict is Verdict.FAIL + + +def test_py_const_annassign(tmp_path: Path) -> None: + _w(tmp_path, "c.py", CONFIG) + assert _run(tmp_path, "py-const:c.py::PORT::8080").verdict is Verdict.PASS + assert _run(tmp_path, "py-const:c.py::PORT::9090").verdict is Verdict.FAIL + + +def test_py_const_class_attribute_via_dotted(tmp_path: Path) -> None: + _w(tmp_path, "c.py", "class C:\n LIMIT = 5\n") + assert _run(tmp_path, "py-const:c.py::C.LIMIT::5").verdict is Verdict.PASS + assert _run(tmp_path, "py-const:c.py::C.LIMIT::6").verdict is Verdict.FAIL + + +def test_py_const_missing_is_fail(tmp_path: Path) -> None: + _w(tmp_path, "c.py", CONFIG) + res = _run(tmp_path, "py-const:c.py::NOPE::1") + assert res.verdict is Verdict.FAIL + assert "missing" in res.detail + + +def test_py_const_non_literal_rhs_is_error(tmp_path: Path) -> None: + """A non-literal RHS cannot be compared to a literal value: ERROR (the checker + cannot determine the value), never a vacuous PASS or a false FAIL.""" + _w(tmp_path, "c.py", "TIMEOUT = compute_timeout()\n") + assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.ERROR + + +def test_py_const_bad_expected_value_is_error(tmp_path: Path) -> None: + _w(tmp_path, "c.py", CONFIG) + assert _run(tmp_path, "py-const:c.py::TIMEOUT::not-a-literal(").verdict is Verdict.ERROR + + +def test_py_const_comment_and_docstring_survival_does_not_pass(tmp_path: Path) -> None: + """The value surviving only in a comment or docstring is not an assignment.""" + _w(tmp_path, "c.py", '"""TIMEOUT = 30 in the docstring."""\n# TIMEOUT = 30\nOTHER = 1\n') + assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.FAIL + + +# --- end-to-end: the new forms bind, seal born-verifiable, and re-check --------- + + +def test_structural_forms_verify_seal_and_revalidate(fixture_repo: Path) -> None: + """A py-signature, a py-const, and a code: claim auto-capture their file, seal + born-verifiable (all hold now), and on the canonical drift commit: + - the renamed-but-unchanged signature stays VERIFIED (rename-resolved), + - the changed constant and the removed route fold BROKEN -> REVOKED (exit 4).""" + import json + + from conftest import apply_three_change_commit, git + from dorian import cli + + claims = { + "claims": [ + { + "id": "vt-sig", + "text": "verify_token(token) is defined in src/auth.py.", + "kind": "behavior", + "load_bearing": True, + "checkers": [ + { + "type": "C3", + "program": "py-signature:src/auth.py::verify_token::token: str -> bool", + } + ], + }, + { + "id": "timeout-30", + "text": "The default request timeout is 30 seconds.", + "kind": "quantity", + "load_bearing": True, + "checkers": [{"type": "C3", "program": "py-const:src/config.py::TIMEOUT::30"}], + }, + { + "id": "login-route", + "text": "Login is served at /v1/login.", + "kind": "reference", + "load_bearing": True, + "checkers": [{"type": "C3", "program": "code:src/routes.py::/v1/login"}], + }, + ] + } + cp = fixture_repo / "claims.json" + cp.write_text(json.dumps(claims), encoding="utf-8") + base = git(fixture_repo, "rev-parse", "HEAD") + + assert ( + cli.main(["--repo", str(fixture_repo), "verify", "docs/design.md", "--claims", str(cp)]) + == 0 + ) + assert (fixture_repo / "docs/design.md.warrant").is_file() + + apply_three_change_commit(fixture_repo) + rc = cli.main(["--repo", str(fixture_repo), "revalidate", "--since", base]) + assert rc == cli.EXIT_REVOKED # a load-bearing claim broke -> exit 4 diff --git a/tests/test_semantic_context.py b/tests/test_semantic_context.py new file mode 100644 index 0000000..d045988 --- /dev/null +++ b/tests/test_semantic_context.py @@ -0,0 +1,106 @@ +"""C3 `code:` — regex over comment/docstring-stripped Python. + +`string:`/`regex:` search raw file text, so a fact that survives only in a comment, +docstring, or dead literal passes. `code:` parses the file and blanks comments and +docstrings before matching, so the SAME fact in actual code (an assignment, a call +arg, a dict-key route string, a decorator) still matches, but a comment/docstring +survival does not. It is Python-only (the comment/docstring model is Python's); +a non-Python target ERRORs (cannot run), never silently passes. + +These pin the WP4 acceptance matrix and the contrast against raw `string:`/`regex:`. +""" + +from __future__ import annotations + +from pathlib import Path + +from dorian.checkers.base import CheckContext, Verdict, run_checker +from dorian.model import CheckerSpec, Claim + + +def _run(repo: Path, program: str) -> object: + claim = Claim( + id="c", + text="x", + kind="reference", + load_bearing=False, + checkers=(CheckerSpec(type="C3", program=program),), + ) + return run_checker(CheckContext(repo=repo, claim=claim), 0) + + +def _w(repo: Path, rel: str, content: str) -> None: + p = repo / rel + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text(content, encoding="utf-8") + + +def test_code_matches_real_assignment(tmp_path: Path) -> None: + _w(tmp_path, "config.py", "TIMEOUT = 30\n") + assert _run(tmp_path, r"code:config.py::TIMEOUT\s*=\s*30").verdict is Verdict.PASS + + +def test_code_formatting_tolerant(tmp_path: Path) -> None: + _w(tmp_path, "config.py", "TIMEOUT=30\n") + assert _run(tmp_path, r"code:config.py::TIMEOUT\s*=\s*30").verdict is Verdict.PASS + + +def test_code_ignores_comment_survival(tmp_path: Path) -> None: + """The fact lives ONLY in a comment: raw string:/regex: pass, code: FAILs.""" + src = "# TIMEOUT = 30 (old default)\nTIMEOUT = 10\n" + _w(tmp_path, "config.py", src) + # raw text search still sees the comment -> PASS (the false-pass class) + assert _run(tmp_path, r"regex:config.py::TIMEOUT\s*=\s*30").verdict is Verdict.PASS + # code: ignores the comment -> FAIL (no real assignment of 30) + res = _run(tmp_path, r"code:config.py::TIMEOUT\s*=\s*30") + assert res.verdict is Verdict.FAIL + assert "code_missing" in res.detail + + +def test_code_ignores_docstring_survival(tmp_path: Path) -> None: + src = '"""The default TIMEOUT = 30 historically."""\nTIMEOUT = 10\n' + _w(tmp_path, "config.py", src) + assert _run(tmp_path, r"string:config.py::TIMEOUT = 30").verdict is Verdict.PASS # raw + assert _run(tmp_path, r"code:config.py::TIMEOUT\s*=\s*30").verdict is Verdict.FAIL # semantic + + +def test_code_keeps_real_string_literals_route(tmp_path: Path) -> None: + """A route path lives inside a real string literal (dict key) — kept, not a + docstring. code: must still find it.""" + _w(tmp_path, "routes.py", 'ROUTES = {\n "/v1/login": "auth.login",\n}\n') + assert _run(tmp_path, "code:routes.py::/v1/login").verdict is Verdict.PASS + + +def test_code_keeps_decorator_argument(tmp_path: Path) -> None: + _w(tmp_path, "api.py", '@app.route("/health")\ndef health():\n return "ok"\n') + assert _run(tmp_path, "code:api.py::/health").verdict is Verdict.PASS + + +def test_code_keeps_call_argument(tmp_path: Path) -> None: + _w(tmp_path, "api.py", "connect(timeout=30, retries=3)\n") + assert _run(tmp_path, r"code:api.py::timeout\s*=\s*30").verdict is Verdict.PASS + + +def test_code_non_python_target_is_error(tmp_path: Path) -> None: + _w(tmp_path, "notes.txt", "this is not python: def x(::\n") + res = _run(tmp_path, "code:notes.txt::def x") + assert res.verdict is Verdict.ERROR + assert "unparseable" in res.detail + + +def test_code_missing_file_is_fail(tmp_path: Path) -> None: + assert _run(tmp_path, "code:gone.py::anything").verdict is Verdict.FAIL + + +def test_code_over_length_pattern_is_error(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "x = 1\n") + assert _run(tmp_path, "code:m.py::" + ("a" * 600)).verdict is Verdict.ERROR + + +def test_code_bad_regex_is_error(tmp_path: Path) -> None: + _w(tmp_path, "m.py", "x = 1\n") + assert _run(tmp_path, "code:m.py::(unclosed").verdict is Verdict.ERROR + + +def test_code_path_escape_is_error(tmp_path: Path) -> None: + assert _run(tmp_path, "code:../../etc/passwd::root").verdict is Verdict.ERROR diff --git a/tests/test_strength.py b/tests/test_strength.py new file mode 100644 index 0000000..4977199 --- /dev/null +++ b/tests/test_strength.py @@ -0,0 +1,219 @@ +"""Checker-strength and claim-risk diagnostics (advisory; no execution). + +Pins the WP2 acceptance matrix: a deterministic classification of each checker's +TRUTH strength (distinct from trigger coverage), kind-vs-strength adequacy +mismatches, an advisory C4 test-adequacy lint (WP6: zero/constant assertions), and +a per-claim risk level. Everything here is advisory — it never runs a checker, +changes a verdict/trust state, or moves an exit code. +""" + +from __future__ import annotations + +from pathlib import Path + +from dorian import strength +from dorian.model import CheckerSpec, Claim + + +def _claim(kind: str, load_bearing: bool, *programs: tuple[str, str]) -> Claim: + return Claim( + id="c", + text="x", + kind=kind, + load_bearing=load_bearing, + checkers=tuple(CheckerSpec(type=t, program=p) for t, p in programs), + ) + + +# --- checker strength classification ------------------------------------------ + + +def test_checker_strength_classification() -> None: + f = strength.checker_strength + assert f(CheckerSpec(type="C3", program="path:src/x.py")) == "existence" + assert f(CheckerSpec(type="C3", program="symbol:src/x.py::F")) == "existence" + assert f(CheckerSpec(type="C3", program="string:src/x.py::lit")) == "raw_text" + assert f(CheckerSpec(type="C3", program="regex:src/x.py::p")) == "raw_text" + assert f(CheckerSpec(type="C3", program="code:src/x.py::p")) == "semantic_text" + assert f(CheckerSpec(type="C3", program="py-signature:src/x.py::F::a")) == "structural" + assert f(CheckerSpec(type="C3", program="py-const:src/x.py::K::1")) == "structural" + assert f(CheckerSpec(type="C4", program="pytest:t.py::test_a")) == "behavioral" + assert f(CheckerSpec(type="C5", program="rowcount:d.csv::>0")) == "data" + assert f(CheckerSpec(type="C5", program="snapshot:d.csv")) == "snapshot" + assert f(CheckerSpec(type="C5", program="shell:grep x f")) == "shell_executable" + assert f(CheckerSpec(type="C1", program="rs-0")) == "snapshot" + + +def test_claim_strength_is_the_strongest_backing(tmp_path: Path) -> None: + c = _claim("behavior", True, ("C3", "symbol:a.py::F"), ("C4", "pytest:t.py::test_a")) + assert strength.claim_strength(c) == "behavioral" # the C4 dominates the symbol + assert strength.claim_strength(_claim("fact", True)) == "unbacked" + + +# --- adequacy mismatch (kind vs strength) ------------------------------------- + + +def test_behavior_claim_backed_only_by_symbol_warns(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("behavior", True, ("C3", "symbol:a.py::F"))]) + assert rec["strength"] == "existence" + assert any("adequacy_mismatch" in n for n in rec["adequacy"]) + assert rec["risk"] == "high" + + +def test_behavior_claim_backed_only_by_regex_warns(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("behavior", True, ("C3", "regex:a.py::F\\("))]) + assert any("adequacy_mismatch" in n for n in rec["adequacy"]) # raw_text is weak for behavior + + +def test_behavior_claim_backed_by_pytest_has_no_mismatch(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("behavior", True, ("C4", "pytest:t.py::test_a"))]) + # the test file does not exist here, so the C4 adequacy lint stays silent (the + # checker itself reports test_gone) — and there is no kind/strength mismatch + assert not any("adequacy_mismatch" in n for n in rec["adequacy"]) + assert rec["strength"] == "behavioral" + + +def test_quantity_claim_with_value_checker_has_no_mismatch(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("quantity", True, ("C3", "py-const:c.py::T::30"))]) + assert rec["adequacy"] == [] + assert rec["risk"] == "low" + + +def test_quantity_claim_backed_only_by_existence_warns(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("quantity", True, ("C3", "symbol:c.py::T"))]) + assert any("adequacy_mismatch" in n for n in rec["adequacy"]) + + +def test_data_claim_with_typed_c5_has_no_mismatch(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("quantity", True, ("C5", "rowcount:d.csv::>0"))]) + assert rec["adequacy"] == [] + + +# --- unbacked risk ------------------------------------------------------------ + + +def test_unbacked_load_bearing_is_high_risk(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("fact", True)]) + assert rec["risk"] == "high" + assert "unbacked" in rec["reasons"] + + +def test_unbacked_non_load_bearing_is_low(tmp_path: Path) -> None: + (rec,) = strength.analyze(tmp_path, [_claim("fact", False)]) + assert rec["risk"] in ("low", "medium") # advisory, not high for a non-load-bearing claim + assert rec["risk"] != "high" + + +# --- C4 test adequacy lint (WP6) ---------------------------------------------- + + +def _w(repo: Path, rel: str, content: str) -> None: + p = repo / rel + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text(content, encoding="utf-8") + + +def test_c4_zero_assertion_test_warns(tmp_path: Path) -> None: + _w(tmp_path, "t.py", "def test_a():\n do_something()\n") + notes = strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:t.py::test_a")) + assert any("no assertion" in n.lower() for n in notes) + + +def test_c4_assert_constant_warns(tmp_path: Path) -> None: + _w(tmp_path, "t.py", "def test_a():\n assert True\n") + notes = strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:t.py::test_a")) + assert any("constant" in n.lower() for n in notes) + + +def test_c4_normal_asserting_test_is_silent(tmp_path: Path) -> None: + _w(tmp_path, "t.py", "def test_a():\n x = f()\n assert x == 42\n") + notes = strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:t.py::test_a")) + assert notes == [] + + +def test_c4_assertion_helper_is_not_flagged(tmp_path: Path) -> None: + """unittest-style assertion methods and pytest.raises count as assertions.""" + _w(tmp_path, "t.py", "class T:\n def test_a(self):\n self.assertEqual(f(), 42)\n") + notes = strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:t.py::T::test_a")) + assert notes == [] + _w( + tmp_path, + "t2.py", + "import pytest\ndef test_b():\n with pytest.raises(ValueError):\n f()\n", + ) + notes = strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:t2.py::test_b")) + assert notes == [] + + +def test_c4_missing_or_unfound_node_is_silent(tmp_path: Path) -> None: + """The checker itself reports test_gone; the linter does not guess.""" + assert strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:gone.py::t")) == [] + _w(tmp_path, "t.py", "def test_a():\n assert 1\n") + assert ( + strength.c4_adequacy(tmp_path, CheckerSpec(type="C4", program="pytest:t.py::test_z")) == [] + ) + + +def test_c4_adequacy_surfaces_in_behavior_claim(tmp_path: Path) -> None: + _w(tmp_path, "t.py", "def test_a():\n f()\n") + (rec,) = strength.analyze(tmp_path, [_claim("behavior", True, ("C4", "pytest:t.py::test_a"))]) + assert any("assertion" in n.lower() for n in rec["adequacy"]) + + +# --- executes field + determinism --------------------------------------------- + + +def test_executes_field(tmp_path: Path) -> None: + (rec,) = strength.analyze( + tmp_path, [_claim("behavior", True, ("C4", "pytest:t.py::t"), ("C5", "shell:echo hi"))] + ) + assert set(rec["executes"]) == {"pytest", "shell"} + + +def test_analyze_is_deterministic(tmp_path: Path) -> None: + claims = [ + _claim("behavior", True, ("C3", "symbol:a.py::F")), + _claim("quantity", True, ("C3", "py-const:c.py::T::30")), + ] + assert strength.analyze(tmp_path, claims) == strength.analyze(tmp_path, claims) + + +# --- CLI surface: `dorian bindings` carries strength (JSON + human) ------------ + + +def test_cmd_bindings_surfaces_strength(fixture_repo: Path, capsys) -> None: + """A behavior claim backed only by symbol: seals (exit 0) but `dorian bindings` + must surface its weak strength, high risk, and adequacy mismatch — JSON + human.""" + import json + + from dorian import cli + + claims = { + "claims": [ + { + "id": "vt-behavior", + "text": "verify_token authenticates the request.", + "kind": "behavior", + "load_bearing": True, + "checkers": [{"type": "C3", "program": "symbol:src/auth.py::verify_token"}], + } + ] + } + cp = fixture_repo / "claims.json" + cp.write_text(json.dumps(claims), encoding="utf-8") + assert ( + cli.main(["--repo", str(fixture_repo), "verify", "docs/design.md", "--claims", str(cp)]) + == 0 + ) + capsys.readouterr() + + assert cli.main(["--repo", str(fixture_repo), "--json", "bindings", "docs/design.md"]) == 0 + payload = json.loads(capsys.readouterr().out) + (diag,) = payload["claims"] + assert diag["strength"]["strength"] == "existence" + assert diag["strength"]["risk"] == "high" + assert any("adequacy_mismatch" in n for n in diag["strength"]["adequacy"]) + + assert cli.main(["--repo", str(fixture_repo), "bindings", "docs/design.md"]) == 0 + out = capsys.readouterr().out + assert "strength: existence" in out and "risk: high" in out From 6a8298c5ef2deeb1e8fd3b70bb399706201caf28 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 17:40:21 +0530 Subject: [PATCH 02/13] feat(v1): trusted-base checker-source mode for public/fork PR safety (WP7) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit revalidate --checker-source {head,base} (default head) + Action checker_trust input. base mode resolves each candidate claim's checker SPEC from the --since (base) ref and runs it against PR-head sources, so a PR-added or PR-modified executable (C4/C5 shell) checker is never executed and a rewritten checker cannot self-attest a verdict (base spec wins; the change is surfaced as a trust-root note). Fail-closed: a missing/tampered base sidecar ERRORs (never executed), never BROKEN, never green. Composes with deny-exec. NOT a sandbox — a base-approved pytest checker can still run head code, stated in every surface. - revalidate.py: checker_source param, _load_base_warrant (integrity-checked base sidecar via gitio.file_at_ref), RevalResult.notes, text/md rendering of notes. - cli.py/commands.py: --checker-source flag + DORIAN_CHECKER_SOURCE env fallback; base requires --since. - action.yml: checker_trust input (default head) -> DORIAN_CHECKER_SOURCE; README + Inputs table updated (also documents the pre-existing deny_exec/deny_shell inputs). - docs: TRUSTED_BASE_ACTION_DESIGN status -> IMPLEMENTED; SECURITY_BOUNDARY public-fork checklist updated (trust-root conditions met; sandboxing still out of scope). - tests/test_trusted_base.py: the §6 exploit matrix (10 cases) — each "executed?" case proven by a sentinel touch that must NOT appear under base mode. Co-Authored-By: Claude Opus 4.8 (1M context) --- V1_IMPLEMENTATION_TRACKER.md | 12 +- action/README.md | 47 ++-- action/action.yml | 14 ++ docs/SECURITY_BOUNDARY.md | 44 ++-- docs/TRUSTED_BASE_ACTION_DESIGN.md | 15 +- src/dorian/cli.py | 10 + src/dorian/commands.py | 16 ++ src/dorian/revalidate.py | 93 +++++++- tests/test_action_security_defaults.py | 12 + tests/test_render_md.py | 11 +- tests/test_trusted_base.py | 292 +++++++++++++++++++++++++ 11 files changed, 519 insertions(+), 47 deletions(-) create mode 100644 tests/test_trusted_base.py diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md index 3cf4254..44db552 100644 --- a/V1_IMPLEMENTATION_TRACKER.md +++ b/V1_IMPLEMENTATION_TRACKER.md @@ -100,12 +100,14 @@ Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-documen | WP | Title | Status | |---|---|---| | WP1 | docs/evidence hygiene | TODO | -| WP2 | checker-strength / claim-risk linter | TODO | -| WP3 | Python structural checkers (py-signature, py-const) | TODO | -| WP4 | semantic-context source search | TODO | +| WP2 | checker-strength / claim-risk linter | DONE (strength.py; surfaced in `bindings` + binding-gate warn; 19 tests) | +| WP3 | Python structural checkers (py-signature, py-const) | DONE (pyast.py + C3 subgrammars; 27 tests incl. e2e) | +| WP4 | semantic-context source search (`code:`) | DONE (pyast.code_only_python + C3 `code:`; 12 tests) | | WP5 | multi-index binding (config-key) | TODO | -| WP6 | C4 test-adequacy lint | TODO | -| WP7 | trusted-base checker-source mode | TODO | +| WP6 | C4 test-adequacy lint | DONE (strength.c4_adequacy; folded into WP2 tests) | +| WP7 | trusted-base checker-source mode | DONE (revalidate --checker-source base + Action checker_trust; 10-case exploit matrix) | | WP8 | warrant-quality mutation harness | TODO | | WP9 | current-version benchmark results | TODO | | WP10 | V1 release prep / decision | TODO | + +Commits so far: `58b39e2` (WP3/4/2/6), trusted-base (WP7) next. diff --git a/action/README.md b/action/README.md index a4eafc4..232cc58 100644 --- a/action/README.md +++ b/action/README.md @@ -81,16 +81,30 @@ claim so a broken fact re-verifies). See `SECURITY.md` and deny_exec: "true" # C4/C5 ERROR instead of executing ``` -**Current recommendation: trusted/internal repositories.** Until a -trusted-base mode exists (execute only checker specs already present on the -base branch; parse/lint — never execute — new or changed PR sidecars; skip -C5 `shell:` and other executable checkers in untrusted mode — designed in -[`docs/TRUSTED_BASE_ACTION_DESIGN.md`](../docs/TRUSTED_BASE_ACTION_DESIGN.md), -not yet implemented), this Action is recommended for repositories where -everyone who can open a PR is already trusted to run code in CI, or with -`deny_exec: true` for untrusted PRs. For public repositories, treat any PR that -touches a `.warrant` file as a code change requiring the same review as a CI -change. +**trusted-base mode (`checker_trust: base`).** This is the trust-root fix for the +self-attested-verdict problem. With `checker_trust: base`, the Action resolves each +claim's checker SPEC from the **base ref** and runs it against the PR-head sources, so +a PR-added or PR-modified executable checker is never executed and a rewritten checker +cannot self-attest a verdict — the base-approved spec wins, and the change is surfaced +in the PR comment. A missing or tampered base sidecar **fails closed** (ERRORED, never +executed). Implemented and proven by the +[trusted-base test matrix](../docs/TRUSTED_BASE_ACTION_DESIGN.md). + +```yaml +# public / forked-PR posture: trusted checker specs + no code execution +- uses: ajaysurya1221/dorian/action@main + with: + checker_trust: base # run only base-approved checker specs + deny_exec: "true" # and refuse to execute even those (belt and braces) +``` + +**It is a checker-source trust root, not a sandbox.** A base-approved `pytest:` checker +can still import and execute PR-head code, so for fully untrusted forks combine +`checker_trust: base` **with** `deny_exec: true` (or external isolation). Default +`checker_trust: head` is unchanged and correct for trusted/internal repositories, where +everyone who can open a PR is already trusted to run code in CI. For public repositories, +treat any PR that touches a `.warrant` file as a code change requiring the same review as +a CI change. Hard rules either way: @@ -107,11 +121,14 @@ Hard rules either way: ## Inputs -| input | default | meaning | -| --------- | -------------------------------------------- | ------------------------------------------------------------------------ | -| `fail_on` | `revoked` | when to fail the step: `revoked` (exit 4 only), `degraded` (3 or 4), `never` | -| `base` | `${{ github.event.pull_request.base.sha }}` | git ref passed to `dorian revalidate --since` | -| `install` | `dorian-vwp` | pip spec; pin `dorian-vwp==0.6.*`, or `.` for checkout installs | +| input | default | meaning | +| --------------- | -------------------------------------------- | ------------------------------------------------------------------------ | +| `fail_on` | `revoked` | when to fail the step: `revoked` (exit 4 only), `degraded` (3 or 4), `never` | +| `base` | `${{ github.event.pull_request.base.sha }}` | git ref passed to `dorian revalidate --since` | +| `install` | `dorian-vwp` | pip spec; pin `dorian-vwp==0.6.*`, or `.` for checkout installs | +| `deny_exec` | `false` | refuse to run executable checkers (C4 pytest, C5 shell): they ERROR. For untrusted/fork PRs; fail-closed, not a sandbox | +| `deny_shell` | `false` | narrower than `deny_exec`: block only C5 shell, still allow C4 pytest | +| `checker_trust` | `head` | `head` runs the checked-out checker spec (trusted repos); `base` runs the base-ref spec so PR-authored executable checkers never run (public/fork PRs) | Until the first PyPI release of `dorian-vwp`, set `install` to a source spec: `install: 'dorian-vwp @ git+https://github.com/ajaysurya1221/dorian.git'`. diff --git a/action/action.yml b/action/action.yml index 7683d64..77fabdc 100644 --- a/action/action.yml +++ b/action/action.yml @@ -41,6 +41,17 @@ inputs: Narrower than deny_exec: block only C5 shell, still allow C4 pytest. required: false default: "false" + checker_trust: + description: >- + Which sidecar a claim's checker SPEC is read from (the sources checked are + always the PR head). 'head' (default) runs the checked-out spec — correct for + trusted/internal repos. 'base' resolves each spec from the base ref, so a + PR-added or PR-modified executable checker is never executed and a rewritten + checker cannot self-attest a verdict — for public repos taking forked PRs. + Fail-closed, NOT a sandbox: a base-approved pytest checker can still execute + PR-head code. See docs/TRUSTED_BASE_ACTION_DESIGN.md. + required: false + default: head runs: using: composite @@ -64,6 +75,9 @@ runs: # unchanged. Set deny_exec: true for untrusted/fork PRs. DORIAN_DENY_EXEC: ${{ inputs.deny_exec }} DORIAN_DENY_SHELL: ${{ inputs.deny_shell }} + # checker_trust=base resolves checker specs from the base ref so a + # PR-authored executable checker never runs. 'head' (default) is unchanged. + DORIAN_CHECKER_SOURCE: ${{ inputs.checker_trust }} run: | set +e dorian sync diff --git a/docs/SECURITY_BOUNDARY.md b/docs/SECURITY_BOUNDARY.md index 6cff620..28e2bb3 100644 --- a/docs/SECURITY_BOUNDARY.md +++ b/docs/SECURITY_BOUNDARY.md @@ -58,10 +58,8 @@ dorian revalidate --since HEAD Do **not** run `verify` / `seal` / `revalidate` / `rebind` on claims from a source you do not trust without `--deny-exec` (`rebind` re-runs every checker to -re-seal, so it executes code too). Do **not** market or wire up public-fork-PR CI -as safe: the trusted-base design ([docs/TRUSTED_BASE_ACTION_DESIGN.md](TRUSTED_BASE_ACTION_DESIGN.md)) -that would make it safe is not implemented or tested yet. Do not use -`pull_request_target` with an untrusted-head checkout. +re-seal, so it executes code too). Do not use `pull_request_target` with an +untrusted-head checkout. ```bash # untrusted context: remove the ability to execute code @@ -72,14 +70,30 @@ DORIAN_DENY_EXEC=1 dorian revalidate --since origin/main A blocked checker ERRORs, so a blocked load-bearing claim cannot seal and cannot silently pass revalidation — deny-exec fails closed. -## What must be true before public-fork CI can be recommended - -1. Checker programs are taken from the **trusted base ref**, never from untrusted - head, unless explicitly allowlisted. -2. deny-exec (or stronger) is the **default** for fork PRs. -3. There are tests that simulate a fork/head sidecar trying to execute shell and - prove the Action blocks it. -4. No `pull_request_target` footgun in the documented workflow. - -Until all four hold, the honest statement is: **trusted/internal repositories, -or `--deny-exec` everywhere else.** +## Public-fork CI: `--checker-source base` (a trust root, not a sandbox) + +`dorian revalidate --checker-source base` (Action input `checker_trust: base`) +resolves each claim's checker SPEC from the trusted **base ref**, then runs it +against the PR-head sources. A PR-added or PR-modified executable (C4/C5 `shell:`) +checker is therefore never executed; a rewritten checker cannot self-attest a +verdict (the base spec wins, and the change is surfaced); a missing or tampered +base sidecar **fails closed** (ERRORED, never executed). The +[trusted-base test matrix](TRUSTED_BASE_ACTION_DESIGN.md) proves each case with a +filesystem side effect that must not appear. + +This is a **checker-source trust root, not a sandbox.** A base-approved `pytest:` +checker can still import and execute PR-head code, so the honest recommendation for +public forks is `checker_trust: base` **with `deny_exec: true`** (or stronger +external isolation), never "safe for arbitrary fork PRs". The four conditions below +now hold for the *trust-root* threat; sandboxing executed code remains out of scope. + +1. ✅ Checker specs are taken from the **trusted base ref** (`--checker-source base`). +2. ✅ deny-exec is available and recommended for fork PRs (`deny_exec: true`). +3. ✅ Tests simulate a fork/head sidecar trying to execute shell/pytest and prove it + is not run (`tests/test_trusted_base.py`). +4. ✅ No `pull_request_target` in the documented workflow. + +The residual, stated plainly: even in base mode a base-approved code-executing +checker runs PR-head code, so **without deny-exec or external sandboxing this is +not safe for fully untrusted code.** For trusted/internal repos, `head` mode +remains correct and unchanged. diff --git a/docs/TRUSTED_BASE_ACTION_DESIGN.md b/docs/TRUSTED_BASE_ACTION_DESIGN.md index c1a3446..dde610b 100644 --- a/docs/TRUSTED_BASE_ACTION_DESIGN.md +++ b/docs/TRUSTED_BASE_ACTION_DESIGN.md @@ -1,8 +1,13 @@ -# Trusted-base Action mode — design (not implemented) - -> **HUMAN REVIEW REQUIRED.** This is a design only. It changes the Action's security model and adds -> checker-execution gating — it must not be implemented without explicit human sign-off and the test -> matrix in §6. No code in this repo implements it yet. +# Trusted-base Action mode — design + status + +> **STATUS: IMPLEMENTED (V1).** `dorian revalidate --checker-source {head,base}` and the Action +> `checker_trust: head|base` input now implement this design. Default is `head` (today's behavior, +> unchanged). The §6 test matrix is implemented in `tests/test_trusted_base.py` (PR-added/modified +> executable checkers never execute — proven with a sentinel side effect; missing/tampered base +> sidecar fails closed; a rewritten checker is surfaced as a trust-root change; deny-exec composes). +> The non-sandbox caveat in §2/§7 still holds and is stated in user docs: a base-approved +> `pytest:`/`shell:` checker can still execute PR-head code, so `base` mode is a *checker-source trust +> root*, not a sandbox. ## 1. Problem diff --git a/src/dorian/cli.py b/src/dorian/cli.py index 0ddd756..e257fac 100644 --- a/src/dorian/cli.py +++ b/src/dorian/cli.py @@ -182,6 +182,16 @@ def build_parser() -> argparse.ArgumentParser: help="output format; md is a PR-comment body for the GitHub Action", ) rv.add_argument("--enable-c2lite", action="store_true") + rv.add_argument( + "--checker-source", + choices=["head", "base"], + default=None, + help="which sidecar a claim's checker SPEC is read from (sources checked are" + " always the working tree). 'head' (default) runs the checked-out spec —" + " trusted/internal repos. 'base' resolves each spec from the --since (base) ref" + " so a PR-added or PR-modified executable checker is never executed — for" + " public/fork PRs; fail-closed, NOT a sandbox. Env: DORIAN_CHECKER_SOURCE.", + ) _add_exec_policy_flags(rv) rp = sub.add_parser("report", help="event-log digest") diff --git a/src/dorian/commands.py b/src/dorian/commands.py index dff6848..0e23f72 100644 --- a/src/dorian/commands.py +++ b/src/dorian/commands.py @@ -623,6 +623,21 @@ def cmd_revalidate(args: argparse.Namespace) -> int: return EXIT_USAGE if _missing_repo(repo, "revalidate"): return EXIT_USAGE + # flag wins; env DORIAN_CHECKER_SOURCE is the Action's fallback (head|base) + checker_source = args.checker_source or os.environ.get("DORIAN_CHECKER_SOURCE", "head").strip() + if checker_source not in ("head", "base"): + print( + f"dorian revalidate: --checker-source must be head|base (got {checker_source!r})", + file=sys.stderr, + ) + return EXIT_USAGE + if checker_source == "base" and args.since is None: + print( + "dorian revalidate: --checker-source base requires --since " + " (the trusted checker spec is read from the base ref)", + file=sys.stderr, + ) + return EXIT_USAGE try: result = revalidate( repo, @@ -632,6 +647,7 @@ def cmd_revalidate(args: argparse.Namespace) -> int: policy=ExecutionPolicy.from_flags_and_env( deny_exec=args.deny_exec, deny_shell=args.deny_shell ), + checker_source=checker_source, ) # user-input failures, before the broader ValueError in _SIDECAR_ERRORS: # an unresolvable --since ref or an unreadable --changed-paths listing diff --git a/src/dorian/revalidate.py b/src/dorian/revalidate.py index f60b953..55fab14 100644 --- a/src/dorian/revalidate.py +++ b/src/dorian/revalidate.py @@ -21,7 +21,7 @@ import json import sqlite3 -from dataclasses import asdict, dataclass, field +from dataclasses import asdict, dataclass, field, replace from datetime import UTC, datetime from pathlib import Path @@ -58,6 +58,9 @@ class RevalResult: # claim/trust states are untouched): {warrant_id, artifact_uri, depth, via} # where via is the newly broken upstream warrant recalled: list[dict] = field(default_factory=list) + # checker-source=base advisories: a checker spec that changed on the PR (so the + # base-approved spec was run instead), or a claim/sidecar skipped fail-closed + notes: list[str] = field(default_factory=list) candidates: int = 0 exit_code: int = 0 @@ -69,17 +72,36 @@ def revalidate( changed_paths_file: Path | None = None, enable_c2lite: bool = False, policy: ExecutionPolicy | None = None, + checker_source: str = "head", ) -> RevalResult: """Re-check claims bound to the changed paths; one of `since` (git ref to diff from) or `changed_paths_file` (one path per line) is required. If both are given, `changed_paths_file` takes precedence and `since` is ignored - (the CLI rejects the combination).""" + (the CLI rejects the combination). + + checker_source (head | base; default head) selects which sidecar a candidate + claim's checker SPEC is read from — orthogonal to which SOURCES are checked + (always the working tree / PR head). `head` is today's behavior exactly. `base` + is the public/fork-PR hardening: each claim's checker spec is resolved from the + `since` (base) ref's sidecar, so a PR-added or PR-modified executable checker is + never executed — only maintainer-approved (base) checker specs run. It fails + closed (a missing/tampered base sidecar, or a claim absent on base, ERRORs and + runs nothing) and it is NOT a sandbox: a base-approved C4 `pytest:` checker can + still import and execute PR-head code (see docs/TRUSTED_BASE_ACTION_DESIGN.md).""" if since is None and changed_paths_file is None: raise ValueError("provide since= or changed_paths_file=") + if checker_source not in ("head", "base"): + raise ValueError(f"checker_source must be 'head' or 'base', got {checker_source!r}") + if checker_source == "base" and since is None: + raise ValueError( + "checker-source=base needs --since : the trusted checker spec is" + " resolved from the base ref, which --changed-paths does not provide" + ) repo = repo.resolve() # under deny-exec/deny-shell a blocked C4/C5-shell recheck ERRORs (exit 5), # never silently PASSes and never folds to BROKEN — trigger-vs-truth intact exec_policy = policy if policy is not None else ExecutionPolicy() + base_cache: dict[str, Warrant | None] = {} # checker-source=base: per-artifact base sidecar if changed_paths_file is not None: # read exactly once, before any store work: a failure here is bad caller # input (distinct ChangedPathsError), never a sidecar integrity error @@ -137,11 +159,42 @@ def revalidate( kind="claim.stale", cause={"changed": cause}, ) - if not claim.checkers: + if not claim.checkers and checker_source != "base": continue # unbacked claim: stale is recorded, nothing to re-check - state, detail, relocated = _check_claim( - repo, claim, entries, renames, enable_c2lite, exec_policy - ) + # checker-source=base: run the BASE-approved checker spec (resolved from + # the `since` ref) against head sources, never the PR's spec. Fail closed + # (ERRORED, never executed) when the base spec cannot be trusted. + eff_claim = claim + skip_reason: str | None = None + if checker_source == "base": + base_w = _load_base_warrant(repo, since, warrant.artifact_uri, base_cache) + if base_w is None: + skip_reason = ( + "checker-source=base: no readable base sidecar for this artifact" + " (fail-closed; not executed)" + ) + else: + base_claim = next((c for c in base_w.claims if c.id == cid), None) + if base_claim is None: + skip_reason = ( + "checker-source=base: claim not present on base ref" + " (PR-added checker; not executed)" + ) + else: + if base_claim.checkers != claim.checkers: + result.notes.append( + f"{warrant.artifact_uri}: {cid}: checker spec changed on PR" + " — ran base-approved spec (checker-source=base)" + ) + eff_claim = replace(claim, checkers=base_claim.checkers) + if skip_reason is not None: + state, detail, relocated = "ERRORED", skip_reason, False + elif not eff_claim.checkers: + continue # nothing to run (head unbacked, or base claim unbacked) + else: + state, detail, relocated = _check_claim( + repo, eff_claim, entries, renames, enable_c2lite, exec_policy + ) changed_state = fold_mod.apply_claim_state( conn, wid, cid, state, actor=actor, cause={"detail": detail} ) @@ -187,6 +240,28 @@ def revalidate( conn.close() +def _load_base_warrant( + repo: Path, base_ref: str, artifact_uri: str, cache: dict[str, Warrant | None] +) -> Warrant | None: + """The artifact's sidecar AS IT EXISTS ON THE BASE REF (checker-source=base), or + None if it is absent, unreadable, or its content-addressed id does not verify (a + tampered base sidecar). Fail-closed by construction: None makes the caller skip, + never execute the PR's checker. Cached per artifact for the run.""" + if artifact_uri in cache: + return cache[artifact_uri] + warrant: Warrant | None = None + data = gitio.file_at_ref(repo, base_ref, artifact_uri + ".warrant") + if data is not None: + try: + candidate = Warrant.from_dict(json.loads(data.decode("utf-8"))) + if Warrant.compute_id(candidate.body_dict()) == candidate.id: + warrant = candidate # integrity-valid base sidecar + except (ValueError, KeyError, TypeError, UnicodeDecodeError): + warrant = None # malformed/tampered base sidecar: fail closed + cache[artifact_uri] = warrant + return warrant + + def _claim_paths( claim: Claim, entries: dict[str, ReadSetEntry], renames: dict[str, str] ) -> set[str]: @@ -281,6 +356,8 @@ def render_text(result: RevalResult) -> str: for e in result.recalled: wid, uri = e["warrant_id"], e["artifact_uri"] lines.append(f"recalled {wid[:23]} {uri} depth={e['depth']}") + for note in result.notes: + lines.append(f"note {note}") return "\n".join(lines) + "\n" @@ -358,6 +435,10 @@ def render_md(result: RevalResult) -> str: lines += ["", "Recalled downstream (flagged, not re-checked):"] for e in result.recalled: lines.append(f"- `{e['artifact_uri']}` (depth {e['depth']})") + if result.notes: # checker-source=base advisories (PR-changed / skipped specs) + lines += ["", "Checker-source notes (trusted-base mode):"] + for note in result.notes: + lines.append(f"- {_md_cell(note)}") checks = sum(map(len, (result.broken, result.relocated, result.errored, result.passed))) meaning = _EXIT_MEANINGS.get(result.exit_code, "unknown") diff --git a/tests/test_action_security_defaults.py b/tests/test_action_security_defaults.py index 7f3b47a..2bee860 100644 --- a/tests/test_action_security_defaults.py +++ b/tests/test_action_security_defaults.py @@ -35,6 +35,18 @@ def test_action_wires_deny_flags_through_env_fallback() -> None: assert "DORIAN_DENY_SHELL: ${{ inputs.deny_shell }}" in text +def test_action_exposes_checker_trust_defaulting_head() -> None: + inputs = _action()["inputs"] + assert "checker_trust" in inputs + # default head = today's behavior; trusted repos are unchanged unless they opt in + assert str(inputs["checker_trust"]["default"]) == "head" + + +def test_action_wires_checker_trust_through_env() -> None: + text = ACTION_YML.read_text(encoding="utf-8") + assert "DORIAN_CHECKER_SOURCE: ${{ inputs.checker_trust }}" in text + + def test_action_does_not_recommend_pull_request_target() -> None: readme = ACTION_README.read_text(encoding="utf-8") # it is named only to forbid it diff --git a/tests/test_render_md.py b/tests/test_render_md.py index 4667562..1a6be12 100644 --- a/tests/test_render_md.py +++ b/tests/test_render_md.py @@ -246,13 +246,22 @@ def test_action_yml_is_valid_composite() -> None: assert data["runs"]["using"] == "composite" inputs = data["inputs"] - assert set(inputs) == {"fail_on", "base", "install", "deny_exec", "deny_shell"} + assert set(inputs) == { + "fail_on", + "base", + "install", + "deny_exec", + "deny_shell", + "checker_trust", + } assert inputs["fail_on"]["default"] == "revoked" assert inputs["base"]["default"] == "${{ github.event.pull_request.base.sha }}" assert inputs["install"]["default"] == "dorian-vwp" # deny-exec/deny-shell default OFF so trusted/internal repos are unchanged assert str(inputs["deny_exec"]["default"]) == "false" assert str(inputs["deny_shell"]["default"]) == "false" + # checker_trust defaults to head (today's behavior); base is the opt-in fork mode + assert str(inputs["checker_trust"]["default"]) == "head" for name in inputs: assert inputs[name]["description"].strip(), f"input {name} must be documented" diff --git a/tests/test_trusted_base.py b/tests/test_trusted_base.py new file mode 100644 index 0000000..fd8aa08 --- /dev/null +++ b/tests/test_trusted_base.py @@ -0,0 +1,292 @@ +"""Trusted-base checker-source mode (WP7): `revalidate --checker-source base`. + +The exploit class: the Action runs the checker programs found in the CHECKED-OUT +(PR/head) `.warrant` sidecars. On a forked PR, the attacker controls those sidecars, +so a PR-added or PR-modified C4 `pytest:`/C5 `shell:` checker would execute attacker +code on the runner, and a rewritten non-executable checker could self-attest a green +verdict. `--checker-source base` resolves every claim's checker SPEC from the trusted +base ref instead, runs it against the PR-head SOURCES, and fails closed when the base +spec cannot be trusted. + +This is the security matrix from docs/TRUSTED_BASE_ACTION_DESIGN.md §6. Each +"executed?" case proves it with a filesystem side effect (a sentinel `touch`): the +sentinel must NOT appear under base mode. It is NOT a sandbox — a base-approved +pytest checker can still execute head code — which the design and these tests state. +""" + +from __future__ import annotations + +from pathlib import Path + +import pytest + +from conftest import commit_all, git, write +from dorian import cli +from dorian.model import ( + CheckerSpec, + Claim, + FoldPolicy, + ProducedBy, + ReadSet, + Warrant, + sha256_hex, +) +from dorian.policy import ExecutionPolicy +from dorian.revalidate import revalidate +from dorian.seal import seal_artifact + +PB = ProducedBy(runner="manual", captured_at="2026-01-01T00:00:00Z") + + +def _repo(tmp_path: Path) -> Path: + repo = tmp_path / "repo" + repo.mkdir() + git(repo, "init", "-q", "-b", "main") + write(repo, "src/config.py", "TIMEOUT = 30\n") + write(repo, "src/auth.py", "def verify_token(t):\n return bool(t)\n") + write(repo, "note.md", "# note\n\nThe timeout is 30.\n") + commit_all(repo, "init") # seal needs a resolvable HEAD + return repo + + +def _forge_head_warrant(repo: Path, artifact_uri: str, claims: list[Claim]) -> Warrant: + """Write a content-addressed `.warrant` to the working tree directly, simulating a + sidecar a forked PR fully controls (a real attacker computes the same valid id).""" + data = (repo / artifact_uri).read_bytes() + w = Warrant.create( + artifact_uri=artifact_uri, + artifact_hash=sha256_hex(data), + git_ref=git(repo, "rev-parse", "HEAD"), + produced_by=PB, + read_set=(), + claims=tuple(claims), + fold_policy=FoldPolicy(), + sealed_at="2026-01-01T00:00:00Z", + ) + w.dump(w.sidecar_path(repo)) + return w + + +def _claim(cid: str, program: str, ctype: str = "C3", *, watch=(), load_bearing=True) -> Claim: + return Claim( + id=cid, + text=f"claim {cid}", + kind="reference", + load_bearing=load_bearing, + checkers=(CheckerSpec(type=ctype, program=program, watch=tuple(watch)),), + ) + + +# --- matrix #2: a base-unchanged checker still runs and catches real drift ---- + + +def test_base_unchanged_checker_catches_drift(tmp_path: Path) -> None: + repo = _repo(tmp_path) + seal_artifact( + repo, + "note.md", + ReadSet(entries=(), produced_by=PB), + [_claim("t", r"regex:src/config.py::TIMEOUT\s*=\s*30")], + ) + base = commit_all(repo, "base: sealed warrant") + write(repo, "src/config.py", "TIMEOUT = 10\n") # head drift + res = revalidate(repo, since=base, checker_source="base") + assert {cid for _, cid, _ in res.broken} == {"t"} # base spec ran, caught the drift + assert res.exit_code == cli.EXIT_REVOKED + + +# --- matrix #3: a PR-ADDED executable checker is never executed --------------- + + +def test_pr_added_shell_checker_does_not_execute(tmp_path: Path) -> None: + repo = _repo(tmp_path) + # base warrant carries only a benign non-executing claim + seal_artifact( + repo, "note.md", ReadSet(entries=(), produced_by=PB), [_claim("ok", "path:src/config.py")] + ) + base = commit_all(repo, "base") + + base_sentinel = tmp_path / "BASE_PWNED" + head_sentinel = tmp_path / "HEAD_PWNED" + write(repo, "src/auth.py", "def verify_token(t):\n return t\n") # head source change + + def forge(sentinel: Path) -> None: + _forge_head_warrant( + repo, + "note.md", + [ + _claim("ok", "path:src/config.py"), + _claim("evil", f"shell:touch {sentinel}", "C5", watch=("src/auth.py",)), + ], + ) + + # base mode: the PR-added 'evil' claim is absent on base -> skipped, never executed + forge(base_sentinel) + res = revalidate(repo, since=base, checker_source="base") + assert not base_sentinel.exists(), "PR-added shell checker MUST NOT execute under base mode" + assert "evil" in {cid for _, cid, _ in res.errored} + + # head mode (the unsafe default for forks): proves the checker is genuinely live + forge(head_sentinel) + revalidate(repo, since=base, checker_source="head") + assert head_sentinel.exists(), "sanity: head mode does run the PR's shell checker" + + +# --- matrix #4: a PR-MODIFIED executable checker is never executed ------------ + + +def test_pr_modified_executable_checker_uses_base_spec(tmp_path: Path) -> None: + repo = _repo(tmp_path) + seal_artifact( + repo, + "note.md", + ReadSet(entries=(), produced_by=PB), + [_claim("c", r"regex:src/config.py::TIMEOUT\s*=\s*30", watch=("src/config.py",))], + ) + base = commit_all(repo, "base: regex checker") + + sentinel = tmp_path / "MOD_PWNED" + write(repo, "src/config.py", "TIMEOUT = 10\n") # base regex would FAIL on this + write(repo, "src/auth.py", "def verify_token(t):\n return t\n") + # PR rewrites claim 'c' from the base regex to a shell command + _forge_head_warrant( + repo, "note.md", [_claim("c", f"shell:touch {sentinel}", "C5", watch=("src/auth.py",))] + ) + + res = revalidate(repo, since=base, checker_source="base") + assert not sentinel.exists(), "PR-modified executable checker MUST NOT execute" + assert {cid for _, cid, _ in res.broken} == {"c"} # the BASE regex ran (and failed) + assert any("changed on PR" in n for n in res.notes) # trust-root change surfaced + + +# --- trust-root change: a PR-weakened NON-executable checker ----------------- + + +def test_pr_weakened_nonexec_checker_is_reported_and_base_spec_wins(tmp_path: Path) -> None: + repo = _repo(tmp_path) + seal_artifact( + repo, + "note.md", + ReadSet(entries=(), produced_by=PB), + [_claim("c", r"regex:src/config.py::TIMEOUT\s*=\s*30", watch=("src/config.py",))], + ) + base = commit_all(repo, "base: strict regex") + write(repo, "src/config.py", "TIMEOUT = 10\n") # the fact is now false + # PR weakens the checker to a mere existence check that always passes + _forge_head_warrant( + repo, "note.md", [_claim("c", "path:src/config.py", watch=("src/config.py",))] + ) + + # head mode: the weakening attack succeeds — the existence check passes + head = revalidate(repo, since=base, checker_source="head") + assert not head.broken and {cid for _, cid, _ in head.passed} == {"c"} + + # base mode: the base regex spec wins -> BROKEN, and the spec change is surfaced + res = revalidate(repo, since=base, checker_source="base") + assert {cid for _, cid, _ in res.broken} == {"c"} + assert any("changed on PR" in n for n in res.notes) + + +# --- matrix #6: a missing/unreadable base sidecar fails closed ---------------- + + +def test_missing_base_sidecar_fails_closed(tmp_path: Path) -> None: + repo = _repo(tmp_path) + base = git(repo, "rev-parse", "HEAD") # base has the source but NO warrant for note.md + write(repo, "src/auth.py", "def verify_token(t):\n return t\n") + sentinel = tmp_path / "NOBASE_PWNED" + _forge_head_warrant( + repo, "note.md", [_claim("c", f"shell:touch {sentinel}", "C5", watch=("src/auth.py",))] + ) + res = revalidate(repo, since=base, checker_source="base") + assert not sentinel.exists(), "no base sidecar -> must not execute the PR checker" + assert "c" in {cid for _, cid, _ in res.errored} + assert res.exit_code == cli.EXIT_ERRORED # ERRORED, never BROKEN, never green + + +# --- deny-exec composes with base mode ---------------------------------------- + + +def test_base_mode_with_deny_shell_errors_base_executable(tmp_path: Path) -> None: + """Even a BASE-approved executable checker is blocked under deny-exec/deny-shell: + the two controls compose. base mode picks the trusted spec; the policy still + refuses to run it -> ERRORED, never executed.""" + repo = _repo(tmp_path) + sentinel = tmp_path / "DENY_PWNED" + seal_artifact( + repo, + "note.md", + ReadSet(entries=(), produced_by=PB), + [_claim("c", f"shell:touch {sentinel}", "C5", watch=("src/config.py",))], + ) + sentinel.unlink(missing_ok=True) # seal-time run created it; reset for the assertion + base = commit_all(repo, "base: shell checker") + write(repo, "src/config.py", "TIMEOUT = 31\n") + res = revalidate( + repo, + since=base, + checker_source="base", + policy=ExecutionPolicy(allow_exec=True, allow_shell=False), + ) + assert not sentinel.exists() + assert "c" in {cid for _, cid, _ in res.errored} + + +# --- head is the default; base requires --since ------------------------------- + + +def test_head_is_default_and_unchanged(tmp_path: Path) -> None: + repo = _repo(tmp_path) + seal_artifact( + repo, + "note.md", + ReadSet(entries=(), produced_by=PB), + [_claim("t", r"regex:src/config.py::TIMEOUT\s*=\s*30")], + ) + base = commit_all(repo, "base") + write(repo, "src/config.py", "TIMEOUT = 10\n") + default = revalidate(repo, since=base) + head = revalidate(repo, since=base, checker_source="head") + assert {c for _, c, _ in default.broken} == {c for _, c, _ in head.broken} == {"t"} + assert default.notes == [] and head.notes == [] + + +def test_base_mode_requires_since(tmp_path: Path) -> None: + repo = _repo(tmp_path) + changed = repo / "changed.txt" + changed.write_text("src/config.py\n", encoding="utf-8") + with pytest.raises(ValueError, match="base"): + revalidate(repo, changed_paths_file=changed, checker_source="base") + + +def test_cli_base_without_since_is_usage_error(tmp_path: Path) -> None: + repo = _repo(tmp_path) + changed = repo / "changed.txt" + changed.write_text("src/config.py\n", encoding="utf-8") + rc = cli.main( + [ + "--repo", + str(repo), + "revalidate", + "--changed-paths", + str(changed), + "--checker-source", + "base", + ] + ) + assert rc == cli.EXIT_USAGE + + +def test_cli_env_fallback_selects_base(tmp_path: Path, monkeypatch) -> None: + repo = _repo(tmp_path) + seal_artifact( + repo, + "note.md", + ReadSet(entries=(), produced_by=PB), + [_claim("t", r"regex:src/config.py::TIMEOUT\s*=\s*30", watch=("src/config.py",))], + ) + base = commit_all(repo, "base") + write(repo, "src/config.py", "TIMEOUT = 10\n") + monkeypatch.setenv("DORIAN_CHECKER_SOURCE", "base") + rc = cli.main(["--repo", str(repo), "revalidate", "--since", base]) + assert rc == cli.EXIT_REVOKED # base spec ran and caught the drift From 04ab60bb4bde4947cffc2a77f46ce421aeb889e3 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 17:51:29 +0530 Subject: [PATCH 03/13] =?UTF-8?q?feat(v1):=20multi-index=20binding=20?= =?UTF-8?q?=E2=80=94=20config-key=20index=20for=20TOML/JSON=20(WP5)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extends binding beyond Python definers/console-scripts to config keys in tracked .toml/.json files: a claim mentioning a config key is re-checked when the defining config file changes. Conservative and trigger-only (never proves truth): - symbol_index.config_key_index: key -> tracked .toml/.json files + unparseable list. YAML deliberately excluded (parsing needs a third-party dep; core stays zero-dep). - claim_config_watch_paths + claim_watch_paths (unified symbol+script+config union); verify/rebind now widen with the merged watch set. - ambiguous_config_mentions: a key in >1 file is skipped (a wrong watch is a false alarm) and surfaced via verify warnings + bind-suggest, never guessed. - unparseable supported config files are surfaced as a diagnostic, never silent. - bind-suggest gains provenance (bind (symbol) vs bind (config)) + ambiguous-config + unparseable-config lines; JSON adds bind_config/ambiguous_config/unparseable_config. - config_key_index degrades to empty on a non-git repo (never blocks). Updated test_symbol_index pyproject-script expectation (a script-name claim now also watches pyproject.toml where the script is declared) and the trusted-base design doc-guard (now IMPLEMENTED). 639 non-slow tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) --- V1_IMPLEMENTATION_TRACKER.md | 2 +- src/dorian/commands.py | 60 ++++++++++-- src/dorian/symbol_index.py | 106 +++++++++++++++++++++ tests/test_claude_code_docs.py | 10 +- tests/test_config_binding.py | 168 +++++++++++++++++++++++++++++++++ tests/test_symbol_index.py | 5 +- 6 files changed, 336 insertions(+), 15 deletions(-) create mode 100644 tests/test_config_binding.py diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md index 44db552..db61176 100644 --- a/V1_IMPLEMENTATION_TRACKER.md +++ b/V1_IMPLEMENTATION_TRACKER.md @@ -103,7 +103,7 @@ Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-documen | WP2 | checker-strength / claim-risk linter | DONE (strength.py; surfaced in `bindings` + binding-gate warn; 19 tests) | | WP3 | Python structural checkers (py-signature, py-const) | DONE (pyast.py + C3 subgrammars; 27 tests incl. e2e) | | WP4 | semantic-context source search (`code:`) | DONE (pyast.code_only_python + C3 `code:`; 12 tests) | -| WP5 | multi-index binding (config-key) | TODO | +| WP5 | multi-index binding (config-key) | DONE (symbol_index.config_key_index + claim_watch_paths; TOML/JSON only, YAML excluded = zero-dep; provenance in bind-suggest; ambiguity + unparseable surfaced; 9 tests) | | WP6 | C4 test-adequacy lint | DONE (strength.c4_adequacy; folded into WP2 tests) | | WP7 | trusted-base checker-source mode | DONE (revalidate --checker-source base + Action checker_trust; 10-case exploit matrix) | | WP8 | warrant-quality mutation harness | TODO | diff --git a/src/dorian/commands.py b/src/dorian/commands.py index 0e23f72..d6feb9b 100644 --- a/src/dorian/commands.py +++ b/src/dorian/commands.py @@ -247,14 +247,17 @@ def cmd_verify(args: argparse.Namespace) -> int: # claims mention (even when no checker named them): the symbol-definer watch # the seal adds is then also captured + hashed + scope-linted honestly paths = referenced_paths(claims) - symbol_watch = symbol_index.claim_symbol_watch_paths(repo, claims) + # multi-index binding: Python symbol-definers + pyproject scripts + config keys + symbol_watch = symbol_index.claim_watch_paths(repo, claims) for path in sorted({p for ps in symbol_watch.values() for p in ps}): if path not in paths: paths.append(path) readset = parse_manual(paths, repo) - # a load-bearing claim naming an AMBIGUOUS symbol (>1 definer) is left unbound; do - # not let that skip be silent — warn so the author binds it explicitly (see A3) + # a load-bearing claim naming an AMBIGUOUS symbol/config key (>1 definer) is left + # unbound; do not let that skip be silent — warn so the author binds it explicitly ambiguous = symbol_index.ambiguous_symbol_mentions(repo, claims) + ambiguous_config = symbol_index.ambiguous_config_mentions(repo, claims) + _, unparseable_config = symbol_index.config_key_index(repo) except (ValueError, OSError, gitio.GitError) as exc: print(f"dorian verify: {exc}", file=sys.stderr) return EXIT_USAGE @@ -293,6 +296,20 @@ def cmd_verify(args: argparse.Namespace) -> int: "checker or qualify the reference", file=sys.stderr, ) + for cid, cfg in ambiguous_config.items(): + for key, files in cfg.items(): + print( + f"dorian verify: warning: load-bearing claim {cid!r} mentions config key " + f"{key!r} (defined in {len(files)} config files); left unbound — name the file " + "in a checker", + file=sys.stderr, + ) + for cfg_path in unparseable_config: + print( + f"dorian verify: warning: config file {cfg_path!r} could not be parsed; its keys " + "are not indexed for binding (a claim mentioning them may be silently unbound)", + file=sys.stderr, + ) backed = sum(1 for c in claims if c.backed) print(warrant.id) print( @@ -482,8 +499,12 @@ def cmd_bind_suggest(args: argparse.Namespace) -> int: except (ValueError, OSError) as exc: print(f"dorian bind-suggest: {exc}", file=sys.stderr) return EXIT_USAGE + # multi-index binding with provenance: symbol-definer/script vs config-key watch = symbol_index.claim_symbol_watch_paths(repo, claims) + config_watch = symbol_index.claim_config_watch_paths(repo, claims) ambiguous = symbol_index.ambiguous_symbol_mentions(repo, claims) + ambiguous_config = symbol_index.ambiguous_config_mentions(repo, claims) + _, unparseable_config = symbol_index.config_key_index(repo) suggestions: list[dict] = [] for c in claims: try: @@ -491,17 +512,38 @@ def cmd_bind_suggest(args: argparse.Namespace) -> int: except ValueError: covered = set() # C1 span / C5 shell: no auto-derivable read-set to compare bind = [f for f in watch.get(c.id, ()) if f not in covered] + bind_config = [f for f in config_watch.get(c.id, ()) if f not in covered] amb = {s: list(files) for s, files in ambiguous.get(c.id, {}).items()} - if bind or amb: - suggestions.append({"claim_id": c.id, "bind": bind, "ambiguous": amb}) + amb_cfg = {k: list(files) for k, files in ambiguous_config.get(c.id, {}).items()} + if bind or bind_config or amb or amb_cfg: + suggestions.append( + { + "claim_id": c.id, + "bind": bind, # symbol-definer / console-script provenance + "bind_config": bind_config, # config-key provenance + "ambiguous": amb, + "ambiguous_config": amb_cfg, + } + ) if args.json: - print(json.dumps({"suggestions": suggestions}, sort_keys=True)) + print( + json.dumps( + {"suggestions": suggestions, "unparseable_config": list(unparseable_config)}, + sort_keys=True, + ) + ) return EXIT_OK for s in suggestions: if s["bind"]: - print(f"{s['claim_id']} bind: {', '.join(s['bind'])}") + print(f"{s['claim_id']} bind (symbol): {', '.join(s['bind'])}") + if s["bind_config"]: + print(f"{s['claim_id']} bind (config): {', '.join(s['bind_config'])}") for sym, files in sorted(s["ambiguous"].items()): - print(f"{s['claim_id']} ambiguous: {sym} ({len(files)} definers, unbound)") + print(f"{s['claim_id']} ambiguous symbol: {sym} ({len(files)} definers, unbound)") + for key, files in sorted(s["ambiguous_config"].items()): + print(f"{s['claim_id']} ambiguous config: {key} ({len(files)} files, unbound)") + for cfg in unparseable_config: + print(f"unparseable config (keys not indexed for binding): {cfg}") print(f"{len(suggestions)} claim(s) with binding suggestions") return EXIT_OK @@ -543,7 +585,7 @@ def cmd_rebind(args: argparse.Namespace) -> int: file=sys.stderr, ) return EXIT_USAGE - symbol_watch = symbol_index.claim_symbol_watch_paths(repo, claims) + symbol_watch = symbol_index.claim_watch_paths(repo, claims) # symbol-definer + config-key new_paths = {p for ps in symbol_watch.values() for p in ps} already_watched = {w for c in claims for spec in c.checkers for w in spec.watch} if new_paths <= already_watched: diff --git a/src/dorian/symbol_index.py b/src/dorian/symbol_index.py index 8ad57c9..f35aff4 100644 --- a/src/dorian/symbol_index.py +++ b/src/dorian/symbol_index.py @@ -26,6 +26,7 @@ class from docs/NEXT_ALGORITHMIC_BETS.md #1 — where a claim about a symbol from __future__ import annotations import ast +import json import tomllib from pathlib import Path @@ -36,6 +37,11 @@ class from docs/NEXT_ALGORITHMIC_BETS.md #1 — where a claim about a symbol _MAX_FILE_BYTES = 1 << 20 # skip files > 1 MiB (mirrors bindings); parsing them is wasteful _DEF_NODES = (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef) +# config-key index: stdlib-parseable formats only. YAML is deliberately excluded — +# parsing it needs a third-party dep and dorian's core has zero runtime deps. +_CONFIG_SUFFIXES = (".toml", ".json") +_MIN_KEY_LEN = 4 # mirror bindings._MIN_IDENT: shorter keys are noise + def python_symbol_definers(repo: Path) -> dict[str, tuple[str, ...]]: """Symbol name -> the sorted, unique git-tracked `.py` files that define it @@ -152,6 +158,106 @@ def claim_symbol_watch_paths(repo: Path, claims: list[Claim]) -> dict[str, tuple return out +def _walk_keys(obj: object): + """Yield every string dict-key in a nested TOML/JSON structure (recursively).""" + if isinstance(obj, dict): + for k, v in obj.items(): + if isinstance(k, str): + yield k + yield from _walk_keys(v) + elif isinstance(obj, list): + for item in obj: + yield from _walk_keys(item) + + +def config_key_index(repo: Path) -> tuple[dict[str, tuple[str, ...]], tuple[str, ...]]: + """(key -> sorted tracked .toml/.json files defining it, sorted unparseable files). + + Keys shorter than _MIN_KEY_LEN are dropped as noise. A supported config file that + fails to parse is returned in the second element — a LOUD diagnostic, never a + silent skip that would hide a missed binding. YAML is not indexed (no runtime dep). + """ + repo = repo.resolve() + keys: dict[str, set[str]] = {} + unparseable: list[str] = [] + try: + tracked = gitio.ls_files(repo) + except gitio.GitError: + return ({}, ()) # not a git checkout: degrade to no index (never blocks) + for rel in tracked: + if not rel.endswith(_CONFIG_SUFFIXES): + continue + path = repo / rel + try: + if not path.is_file() or path.stat().st_size > _MAX_FILE_BYTES: + continue + raw = path.read_text(encoding="utf-8") + data = tomllib.loads(raw) if rel.endswith(".toml") else json.loads(raw) + except ( + OSError, + UnicodeDecodeError, + tomllib.TOMLDecodeError, + json.JSONDecodeError, + RecursionError, + ): + unparseable.append(rel) + continue + for key in _walk_keys(data): + if len(key) >= _MIN_KEY_LEN: + keys.setdefault(key, set()).add(rel) + return ({k: tuple(sorted(v)) for k, v in sorted(keys.items())}, tuple(sorted(unparseable))) + + +def claim_config_watch_paths(repo: Path, claims: list[Claim]) -> dict[str, tuple[str, ...]]: + """claim id -> the config file(s) to add to its watch set: for every identifier-shaped + token in the claim text that is a config key defined in EXACTLY ONE tracked .toml/.json. + Ambiguous keys (>1 file) are skipped (see ambiguous_config_mentions). Additive and + trigger-only — a config change re-checks the claim; the checker still decides truth.""" + claim_tokens = {c.id: _tokens(c.text) for c in claims if isinstance(c.text, str)} + if not any(claim_tokens.values()): + return {} + index, _ = config_key_index(repo) + out: dict[str, tuple[str, ...]] = {} + for claim in claims: + paths: set[str] = set() + for token in claim_tokens.get(claim.id, ()): + files = index.get(token) + if files is not None and len(files) == 1: + paths.add(files[0]) + if paths: + out[claim.id] = tuple(sorted(paths)) + return out + + +def claim_watch_paths(repo: Path, claims: list[Claim]) -> dict[str, tuple[str, ...]]: + """All deterministic re-check watches dorian binds per claim: Python symbol-definer + files + pyproject console scripts (claim_symbol_watch_paths) UNION config-key files + (claim_config_watch_paths). Union, sorted, deduped. Conservative and additive — it + only ever widens the re-check trigger set; it never proves a claim true.""" + merged: dict[str, set[str]] = {} + for source in (claim_symbol_watch_paths(repo, claims), claim_config_watch_paths(repo, claims)): + for cid, paths in source.items(): + merged.setdefault(cid, set()).update(paths) + return {cid: tuple(sorted(paths)) for cid, paths in merged.items()} + + +def ambiguous_config_mentions( + repo: Path, claims: list[Claim] +) -> dict[str, dict[str, tuple[str, ...]]]: + """claim id -> {config key: defining files} for keys a LOAD-BEARING claim mentions + that are defined in MORE THAN ONE tracked config file — the ambiguous case binding + skips. Lets verify/bind-suggest surface the skip rather than guess. {} if none.""" + index, _ = config_key_index(repo) + out: dict[str, dict[str, tuple[str, ...]]] = {} + for claim in claims: + if not claim.load_bearing or not isinstance(claim.text, str): + continue + ambiguous = {tok: index[tok] for tok in _tokens(claim.text) if len(index.get(tok, ())) > 1} + if ambiguous: + out[claim.id] = ambiguous + return out + + def ambiguous_symbol_mentions( repo: Path, claims: list[Claim] ) -> dict[str, dict[str, tuple[str, ...]]]: diff --git a/tests/test_claude_code_docs.py b/tests/test_claude_code_docs.py index d75ab02..26c428b 100644 --- a/tests/test_claude_code_docs.py +++ b/tests/test_claude_code_docs.py @@ -141,13 +141,15 @@ def test_public_benchmark_manifest_contains_only_public_repos() -> None: assert "genai-core" not in path.read_text(encoding="utf-8").lower() -# --- Slice E: trusted-base Action design (HUMAN REVIEW REQUIRED, no code) ----------- +# --- Slice E: trusted-base Action design (IMPLEMENTED in V1, with caveats) ---------- -def test_trusted_base_action_design_is_flagged_and_safe_by_default() -> None: +def test_trusted_base_action_design_states_implemented_and_keeps_caveats() -> None: doc = _read("docs/TRUSTED_BASE_ACTION_DESIGN.md") - assert "HUMAN REVIEW REQUIRED" in doc - assert "not implemented" in doc.lower() + # the design is now implemented (revalidate --checker-source base + Action input) + assert "IMPLEMENTED" in doc + assert "tests/test_trusted_base.py" in doc # the §6 matrix is realized in tests + # the hard safety constraints must remain stated even after implementation assert "pull_request_target" in doc # the never-use constraint is stated assert "does not sandbox PR-head code" in doc diff --git a/tests/test_config_binding.py b/tests/test_config_binding.py new file mode 100644 index 0000000..b97e010 --- /dev/null +++ b/tests/test_config_binding.py @@ -0,0 +1,168 @@ +"""Multi-index binding (WP5): config-key index for TOML/JSON. + +Binding widens the set of source changes that RE-CHECK a claim. v0.11 indexed only +Python definers and pyproject console scripts; this adds config keys in tracked +`.toml`/`.json` files, so a claim mentioning a config key is re-checked when the +defining config file changes. It stays conservative and trigger-only: + +- only UNAMBIGUOUS keys (defined in exactly one tracked config file) are bound; a key + in more than one file is surfaced and left unwatched (a wrong watch is a false alarm); +- an unparseable supported config file is surfaced as a diagnostic, never a silent skip; +- YAML is intentionally NOT indexed — parsing it needs a third-party dependency and + dorian's core has zero runtime deps; +- binding never proves truth — it only decides WHEN the claim is re-checked. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +from conftest import commit_all, git, write +from dorian import cli, symbol_index +from dorian.model import Claim + + +def _repo(tmp_path: Path) -> Path: + repo = tmp_path / "repo" + repo.mkdir() + git(repo, "init", "-q", "-b", "main") + return repo + + +def _claim(cid: str, text: str, program: str) -> Claim: + from dorian.model import CheckerSpec + + return Claim( + id=cid, + text=text, + kind="quantity", + load_bearing=True, + checkers=(CheckerSpec(type="C3", program=program),), + ) + + +def test_config_key_index_toml_and_json(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "settings.toml", "[database]\nmax_connections = 5\n") + write(repo, "feature.json", '{"flags": {"new_login": true}}\n') + commit_all(repo, "config") + index, unparseable = symbol_index.config_key_index(repo) + assert index["max_connections"] == ("settings.toml",) + assert index["database"] == ("settings.toml",) + assert index["new_login"] == ("feature.json",) + assert unparseable == () + + +def test_claim_mentioning_config_key_binds_the_file(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "settings.toml", "[database]\nmax_connections = 5\n") + commit_all(repo, "config") + claims = [_claim("c", "The `max_connections` pool size is 5.", "path:settings.toml")] + watch = symbol_index.claim_config_watch_paths(repo, claims) + assert watch == {"c": ("settings.toml",)} + + +def test_ambiguous_config_key_is_skipped_and_surfaced(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "a.toml", "max_connections = 5\n") + write(repo, "b.toml", "max_connections = 9\n") + commit_all(repo, "two configs") + claims = [_claim("c", "The `max_connections` value matters.", "path:a.toml")] + # ambiguous (2 files) -> no guessed watch + assert symbol_index.claim_config_watch_paths(repo, claims) == {} + # ...but surfaced for the author to disambiguate + amb = symbol_index.ambiguous_config_mentions(repo, claims) + assert set(amb["c"]["max_connections"]) == {"a.toml", "b.toml"} + + +def test_unparseable_config_is_a_diagnostic_not_silent(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "broken.json", "{not valid json\n") + write(repo, "ok.toml", "key_name = 1\n") + commit_all(repo, "configs") + index, unparseable = symbol_index.config_key_index(repo) + assert "broken.json" in unparseable # surfaced, not silently dropped + assert index["key_name"] == ("ok.toml",) # the parseable one still indexes + + +def test_yaml_is_not_indexed_zero_runtime_dep(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "conf.yaml", "max_connections: 5\n") + commit_all(repo, "yaml") + index, unparseable = symbol_index.config_key_index(repo) + assert "max_connections" not in index # YAML excluded by design + assert "conf.yaml" not in unparseable # not even attempted + + +def test_claim_watch_paths_merges_symbol_and_config(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "src/app.py", "def make_pool():\n return 1\n") + write(repo, "settings.toml", "[server]\nmax_workers = 4\n") + commit_all(repo, "code + config") + claims = [_claim("c", "`make_pool` reads `max_workers` from settings.", "path:settings.toml")] + watch = symbol_index.claim_watch_paths(repo, claims) + assert set(watch["c"]) == {"src/app.py", "settings.toml"} # symbol + config union + + +# --- end to end: a config-key claim re-checks when the config file changes ----- + + +def test_verify_binds_config_and_revalidate_rechecks(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "settings.toml", "[server]\nmax_workers = 4\n") + write(repo, "note.md", "# note\n\nThe server uses 4 max_workers.\n") + commit_all(repo, "init") + base = git(repo, "rev-parse", "HEAD") + # the claim's CHECKER names note-adjacent text but the binding watches settings.toml + claims = { + "claims": [ + { + "id": "workers", + "text": "The server `max_workers` pool is 4.", + "kind": "quantity", + "load_bearing": True, + "checkers": [{"type": "C3", "program": "py-const:settings.toml::placeholder::0"}], + } + ] + } + # use a config-only checker that can't parse TOML as python -> would ERROR; instead + # bind via a real checker on the toml. Simpler: a regex checker on the toml value. + claims["claims"][0]["checkers"] = [ + {"type": "C3", "program": r"regex:settings.toml::max_workers\s*=\s*4"} + ] + cp = repo / "claims.json" + cp.write_text(json.dumps(claims), encoding="utf-8") + assert cli.main(["--repo", str(repo), "verify", "note.md", "--claims", str(cp)]) == 0 + + # the symbol/config binding is moot here (checker already names settings.toml), so + # assert the config index would bind it independently of the checker: + parsed = [Claim.from_dict(c) for c in claims["claims"]] + assert "settings.toml" in symbol_index.claim_watch_paths(repo, parsed).get("workers", ()) + + write(repo, "settings.toml", "[server]\nmax_workers = 8\n") # drift + assert cli.main(["--repo", str(repo), "revalidate", "--since", base]) == cli.EXIT_REVOKED + + +def test_bind_suggest_shows_config_provenance(tmp_path: Path, capsys) -> None: + repo = _repo(tmp_path) + write(repo, "settings.toml", "[server]\nmax_workers = 4\n") + write(repo, "broken.json", "{bad\n") + commit_all(repo, "configs") + claims = { + "claims": [ + { + "id": "w", + "text": "The `max_workers` pool is 4.", + "kind": "quantity", + "load_bearing": True, + "checkers": [{"type": "C3", "program": "path:settings.toml"}], + } + ] + } + cp = repo / "claims.json" + cp.write_text(json.dumps(claims), encoding="utf-8") + assert cli.main(["--repo", str(repo), "bind-suggest", "--claims", str(cp)]) == 0 + out = capsys.readouterr().out + assert "config" in out # provenance label for the config-key binding + assert "broken.json" in out # unparseable config surfaced as a diagnostic diff --git a/tests/test_symbol_index.py b/tests/test_symbol_index.py index 4b28eb3..dc5af15 100644 --- a/tests/test_symbol_index.py +++ b/tests/test_symbol_index.py @@ -459,7 +459,10 @@ def test_pyproject_script_binds_target_file(fixture_repo: Path) -> None: commit_all(fixture_repo, "add a console-script entry point") assert _verify(fixture_repo, [_SCRIPT_CLAIM]) == 0 w = _warrant(fixture_repo) - assert w.claims[0].checkers[0].watch == ("src/routes.py", "pkg/cli.py") + # symbol-definer binding adds the script target (pkg/cli.py); the multi-index + # config-key binding also watches pyproject.toml, where `mytool` is declared — + # so a change to the script entry itself re-checks the claim too. + assert w.claims[0].checkers[0].watch == ("src/routes.py", "pkg/cli.py", "pyproject.toml") assert "pkg/cli.py" in {e.uri for e in w.read_set} From 2a66a49eee7b8aa069d7fb9222572b272493856d Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 17:58:45 +0530 Subject: [PATCH 04/13] =?UTF-8?q?feat(v1):=20warrant-quality=20mutation=20?= =?UTF-8?q?harness=20=E2=80=94=20dorian=20bench=20warrant-quality=20(WP8)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Offline, per-claim evidence: for each claim it derives deterministic mutations from the checker grammar and records whether the verdict matched expectation — - falsify (rename symbol / reassign const / change param): expect FAIL; a PASS is a MISS; - benign (trailing comment): expect PASS; a FAIL is BRITTLE (false alarm); - ceiling (content drift keeping an existence symbol): expect PASS, recorded as the documented trigger-vs-truth ceiling, never a penalty. ERROR (e.g. an executable checker under --deny-exec) is its own bucket, never a miss. Output is deterministic (no timestamps/randomness) and never mutates the real repo — each mutation runs against a throwaway copy of only the file the checker reads. Honest scope: structural/existence C3 forms are mutation-scored; string/regex/code, typed C5, C1, C4 are reported with strength and `mutation: unsupported` (no fabricated mutation). Registered as the `warrant-quality` bench subcommand. 7 tests; 645 non-slow pass. Co-Authored-By: Claude Opus 4.8 (1M context) --- V1_IMPLEMENTATION_TRACKER.md | 2 +- bench/warrant_quality.py | 233 ++++++++++++++++++++++++++++++++++ src/dorian/commands.py | 1 + tests/test_warrant_quality.py | 142 +++++++++++++++++++++ 4 files changed, 377 insertions(+), 1 deletion(-) create mode 100644 bench/warrant_quality.py create mode 100644 tests/test_warrant_quality.py diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md index db61176..d9bf12c 100644 --- a/V1_IMPLEMENTATION_TRACKER.md +++ b/V1_IMPLEMENTATION_TRACKER.md @@ -106,7 +106,7 @@ Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-documen | WP5 | multi-index binding (config-key) | DONE (symbol_index.config_key_index + claim_watch_paths; TOML/JSON only, YAML excluded = zero-dep; provenance in bind-suggest; ambiguity + unparseable surfaced; 9 tests) | | WP6 | C4 test-adequacy lint | DONE (strength.c4_adequacy; folded into WP2 tests) | | WP7 | trusted-base checker-source mode | DONE (revalidate --checker-source base + Action checker_trust; 10-case exploit matrix) | -| WP8 | warrant-quality mutation harness | TODO | +| WP8 | warrant-quality mutation harness | DONE (bench/warrant_quality.py; `dorian bench warrant-quality`; deterministic, offline, never mutates real repo; trigger vs verdict; ERROR bucket distinct; honest scope = structural/existence forms scored, others reported strength-only; 7 tests) | | WP9 | current-version benchmark results | TODO | | WP10 | V1 release prep / decision | TODO | diff --git a/bench/warrant_quality.py b/bench/warrant_quality.py new file mode 100644 index 0000000..3b0862d --- /dev/null +++ b/bench/warrant_quality.py @@ -0,0 +1,233 @@ +"""`dorian bench warrant-quality` — per-claim warrant quality by mutation testing. + +Repo-level benchmarks answer "does the mechanism work on this suite?". This answers +the question a USER actually has about THEIR warrant: *for this claim, would the +checker catch the drift it is supposed to?* It is an OFFLINE evidence generator, not +runtime revalidation, and it never mutates the real repo (each mutation runs against a +throwaway copy of only the file the checker reads). + +For each claim it derives deterministic mutations from the checker grammar and records, +per mutation, whether the checker's verdict matched the expectation: + +- **falsify** — a change that SHOULD make the claim false (rename the symbol, change the + constant's value, change a parameter). Expect FAIL. A PASS here is a **miss** — the + checker cannot see the very drift it implies it guards. +- **benign** — a formatting-only change the checker promises to tolerate. Expect PASS. A + FAIL here is **brittle** — a false alarm. +- **ceiling** — a content change that the checker is KNOWN to be blind to (an existence + checker cannot see a body/content change). Expect PASS, recorded as the documented + trigger-vs-truth ceiling, never counted against the claim. + +Scope (honest): mutations are generated for the C3 structural/existence forms +(`symbol:`, `py-signature:`, `py-const:`) where a falsifying edit is mechanically +unambiguous. Other forms (`string:`/`regex:`/`code:`, typed C5, C1, C4) are reported +with their checker strength and `mutation: unsupported` — the harness does not fabricate +a mutation it cannot make deterministically. ERROR (e.g. an executable checker blocked by +`--deny-exec`) is its own bucket, never conflated with a miss or a FAIL. Output is +deterministic (no timestamps, no randomness) and stable for golden tests. +""" + +from __future__ import annotations + +import argparse +import json +import re +import sys +import tempfile +from collections.abc import Iterator +from pathlib import Path + +from dorian.checkers.base import CheckContext, Verdict, run_checker +from dorian.model import CheckerSpec, Claim, Warrant +from dorian.policy import ExecutionPolicy, executable_kind +from dorian.strength import claim_strength + +SCHEMA = "dorian-warrant-quality-v1" + +# (mutation name, expectation) — expectation is the verdict that means "as designed". +_FALSIFY = "falsify" # expect FAIL (catch the drift) +_BENIGN = "benign" # expect PASS (tolerate formatting) +_CEILING = "ceiling" # expect PASS (known-blind; never a penalty) + + +def _c3_parts(spec: CheckerSpec) -> tuple[str, str, str]: + """(prefix, file, operand) for a `:::` C3 program.""" + prefix, _, rest = spec.program.partition(":") + file, _, operand = rest.partition("::") + return prefix, file, operand + + +def _mutations(spec: CheckerSpec) -> Iterator[tuple[str, str, str, object]]: + """Yield (name, file, expectation, mutate) where mutate(text)->text. Empty for + forms the harness does not deterministically mutate.""" + if spec.type != "C3": + return + prefix, file, operand = _c3_parts(spec) + if not file: + return + if prefix == "symbol": + name = operand + yield ( + "rename_definition", + file, + _FALSIFY, + lambda t: re.sub(rf"\b(def|class)\s+{re.escape(name)}\b", r"\1 dorian_mut", t), + ) + # an existence checker is blind to any content change that keeps the name: + yield ("append_content", file, _CEILING, lambda t: t + "\n# dorian: content drift\n") + elif prefix == "py-const": + qual = operand.partition("::")[0] + if "." in qual: + return # dotted (class attr): a top-level reassignment would not shadow it + yield ("reassign_value", file, _FALSIFY, lambda t: t + f'\n{qual} = "__dorian_mut__"\n') + yield ("append_comment", file, _BENIGN, lambda t: t + "\n# dorian: benign comment\n") + elif prefix == "py-signature": + qual = operand.partition("::")[0] + if "." in qual: + return # dotted (method): a top-level def would not shadow it + yield ( + "change_param", + file, + _FALSIFY, + lambda t: t + f"\n\ndef {qual}(dorian_mut_param):\n return None\n", + ) + yield ("append_comment", file, _BENIGN, lambda t: t + "\n# dorian: benign comment\n") + + +def _run_mutated(repo: Path, claim: Claim, spec_index: int, file: str, mutate, policy) -> Verdict: + """Run one checker against a throwaway copy of `file` with `mutate` applied. Only the + one file the checker reads is materialized — the real repo is never touched.""" + original = (repo / file).read_text(encoding="utf-8", errors="replace") + with tempfile.TemporaryDirectory() as td: + work = Path(td) + target = work / file + target.parent.mkdir(parents=True, exist_ok=True) + target.write_text(mutate(original), encoding="utf-8") + ctx = CheckContext(repo=work, claim=claim, policy=policy) + return run_checker(ctx, spec_index).verdict + + +def _score_mutation(verdict: Verdict, expectation: str) -> str: + """Classify a mutation outcome. ERROR is always its own bucket.""" + if verdict is Verdict.ERROR: + return "errored" + if expectation == _FALSIFY: + return "caught" if verdict is Verdict.FAIL else "missed" + if expectation == _BENIGN: + return "brittle" if verdict is Verdict.FAIL else "ok" + return "ceiling" if verdict is Verdict.PASS else "ceiling_caught" # _CEILING + + +def score_claim(repo: Path, claim: Claim, policy: ExecutionPolicy) -> dict: + """Per-claim quality record: strength, trigger watch, and mutation outcomes.""" + strongest = claim_strength(claim) # rank-aware (existence < ... < behavioral) + mutations: list[dict] = [] + for i, spec in enumerate(claim.checkers): + any_mutation = False + for name, file, expectation, mutate in _mutations(spec): + any_mutation = True + verdict = _run_mutated(repo, claim, i, file, mutate, policy) + mutations.append( + { + "checker": spec.type, + "program": spec.program, + "mutation": name, + "expectation": expectation, + "verdict": verdict.value, + "outcome": _score_mutation(verdict, expectation), + } + ) + if not any_mutation: + mutations.append( + { + "checker": spec.type, + "program": spec.program, + "mutation": "unsupported", + "expectation": "n/a", + "verdict": "n/a", + "outcome": "unsupported", + "executes": executable_kind(spec), + } + ) + outcomes = [m["outcome"] for m in mutations] + quality = ( + "weak" + if "missed" in outcomes + else "brittle" + if "brittle" in outcomes + else "strong" + if "caught" in outcomes + else "unscored" + ) + return { + "claim_id": claim.id, + "kind": claim.kind, + "load_bearing": claim.load_bearing, + "strongest_strength": strongest, + "watch": sorted({w for s in claim.checkers for w in s.watch}), + "mutations": mutations, + "quality": quality, + } + + +def summarize(records: list[dict]) -> dict: + counts: dict[str, int] = {} + for r in records: + counts[r["quality"]] = counts.get(r["quality"], 0) + 1 + mut: dict[str, int] = {} + for r in records: + for m in r["mutations"]: + mut[m["outcome"]] = mut.get(m["outcome"], 0) + 1 + return { + "claims": len(records), + "by_quality": dict(sorted(counts.items())), + "by_outcome": dict(sorted(mut.items())), + } + + +def render(report: dict) -> str: + lines = [f"# warrant quality: {report['artifact_uri']}", ""] + for r in report["claims"]: + lines.append(f"- `{r['claim_id']}` [{r['quality']}] strongest={r['strongest_strength']}") + for m in r["mutations"]: + lines.append(f" {m['mutation']}: {m['outcome']} ({m['verdict']})") + s = report["summary"] + lines += ["", f"{s['claims']} claim(s): {s['by_quality']}", f"mutations: {s['by_outcome']}"] + return "\n".join(lines) + "\n" + + +def build(repo: Path, artifact_uri: str, policy: ExecutionPolicy) -> dict: + warrant = Warrant.load(repo / (artifact_uri + ".warrant")) + records = [score_claim(repo, c, policy) for c in warrant.claims] + return { + "schema": SCHEMA, + "artifact_uri": artifact_uri, + "claims": records, + "summary": summarize(records), + } + + +def main(argv: list[str]) -> int: + ap = argparse.ArgumentParser(prog="dorian bench warrant-quality") + ap.add_argument("artifact", help="warranted artifact (its .warrant is scored)") + ap.add_argument("--repo", default=".") + ap.add_argument("--json", action="store_true") + ap.add_argument("--deny-exec", action="store_true") + ap.add_argument("--deny-shell", action="store_true") + args = ap.parse_args(argv) + repo = Path(args.repo).resolve() + artifact_uri = Path(args.artifact).as_posix() + sidecar = repo / (artifact_uri + ".warrant") + if not sidecar.is_file(): + print(f"dorian bench warrant-quality: no warrant for {artifact_uri}", file=sys.stderr) + return 2 + policy = ExecutionPolicy.from_flags_and_env( + deny_exec=args.deny_exec, deny_shell=args.deny_shell + ) + report = build(repo, artifact_uri, policy) + print(json.dumps(report, sort_keys=True, indent=2) if args.json else render(report), end="") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main(sys.argv[1:])) diff --git a/src/dorian/commands.py b/src/dorian/commands.py index d6feb9b..5e0e80a 100644 --- a/src/dorian/commands.py +++ b/src/dorian/commands.py @@ -732,6 +732,7 @@ def cmd_report(args: argparse.Namespace) -> int: "large-mutation": ("bench.large_mutation", False), "binding-lifecycle": ("bench.binding_lifecycle", False), "realworld-usecases": ("bench.realworld_usecases", False), + "warrant-quality": ("bench.warrant_quality", False), "churn": ("bench.churn", False), } diff --git a/tests/test_warrant_quality.py b/tests/test_warrant_quality.py new file mode 100644 index 0000000..b3bebe7 --- /dev/null +++ b/tests/test_warrant_quality.py @@ -0,0 +1,142 @@ +"""`dorian bench warrant-quality` (WP8): per-claim mutation scoring. + +Pins the WP8 acceptance matrix: deterministic output; a weak (existence) claim scores +its ceiling and never falsely "strong"; a strong structural claim catches its falsifying +mutation; a benign formatting mutation does not lower the score; ERROR (policy-blocked) +is its own bucket, never a miss. Offline and side-effect-free (the real repo is never +mutated).""" + +from __future__ import annotations + +import importlib +import sys +from pathlib import Path + +import pytest + +from conftest import commit_all, git, write +from dorian.model import ProducedBy, ReadSet +from dorian.policy import ExecutionPolicy +from dorian.seal import seal_artifact + +PB = ProducedBy(runner="manual", captured_at="2026-01-01T00:00:00Z") + + +@pytest.fixture +def wq(): + """The repo-local bench module (loaded the way `dorian bench` loads it).""" + sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + return importlib.import_module("bench.warrant_quality") + + +def _repo(tmp_path: Path) -> Path: + repo = tmp_path / "repo" + repo.mkdir() + git(repo, "init", "-q", "-b", "main") + write(repo, "src/auth.py", "def verify_token(token):\n return bool(token)\n") + write(repo, "src/config.py", "TIMEOUT = 30\n") + write(repo, "note.md", "# note\n\nstuff\n") + commit_all(repo, "init") + return repo + + +def _seal(repo: Path, claims: list[dict]) -> None: + from dorian.model import CheckerSpec, Claim + + objs = [ + Claim( + id=c["id"], + text=c.get("text", c["id"]), + kind=c["kind"], + load_bearing=c.get("load_bearing", True), + checkers=(CheckerSpec(type="C3", program=c["program"]),), + ) + for c in claims + ] + seal_artifact(repo, "note.md", ReadSet(entries=(), produced_by=PB), objs) + + +def _score(wq, repo: Path) -> dict: + return wq.build(repo, "note.md", ExecutionPolicy()) + + +def test_structural_claim_catches_falsifying_mutation(wq, tmp_path: Path) -> None: + repo = _repo(tmp_path) + _seal( + repo, + [{"id": "timeout", "kind": "quantity", "program": "py-const:src/config.py::TIMEOUT::30"}], + ) + rec = _score(wq, repo)["claims"][0] + assert rec["quality"] == "strong" + falsify = next(m for m in rec["mutations"] if m["expectation"] == "falsify") + assert falsify["outcome"] == "caught" # reassigning the value -> FAIL + + +def test_signature_claim_catches_param_change(wq, tmp_path: Path) -> None: + repo = _repo(tmp_path) + _seal( + repo, + [ + { + "id": "sig", + "kind": "behavior", + "program": "py-signature:src/auth.py::verify_token::token", + } + ], + ) + rec = _score(wq, repo)["claims"][0] + falsify = next(m for m in rec["mutations"] if m["expectation"] == "falsify") + assert falsify["outcome"] == "caught" + benign = next(m for m in rec["mutations"] if m["expectation"] == "benign") + assert benign["outcome"] == "ok" # a trailing comment must not false-alarm + + +def test_existence_claim_shows_its_ceiling(wq, tmp_path: Path) -> None: + repo = _repo(tmp_path) + _seal( + repo, [{"id": "exists", "kind": "behavior", "program": "symbol:src/auth.py::verify_token"}] + ) + rec = _score(wq, repo)["claims"][0] + outcomes = {m["mutation"]: m["outcome"] for m in rec["mutations"]} + assert outcomes["rename_definition"] == "caught" # renaming the def IS caught + assert outcomes["append_content"] == "ceiling" # content drift keeping the name is NOT + assert rec["strongest_strength"] == "existence" + + +def test_unsupported_form_is_reported_not_faked(wq, tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "data.csv", "a,b\n1,2\n") + commit_all(repo, "data") + _seal(repo, [{"id": "rows", "kind": "quantity", "program": "string:src/config.py::TIMEOUT"}]) + rec = _score(wq, repo)["claims"][0] + # string: is not deterministically mutated by the harness -> reported, never faked + assert any(m["outcome"] == "unsupported" for m in rec["mutations"]) + + +def test_deterministic_output(wq, tmp_path: Path) -> None: + repo = _repo(tmp_path) + _seal( + repo, + [{"id": "timeout", "kind": "quantity", "program": "py-const:src/config.py::TIMEOUT::30"}], + ) + assert _score(wq, repo) == _score(wq, repo) + # and the real repo file was never mutated + assert (repo / "src/config.py").read_text() == "TIMEOUT = 30\n" + + +def test_cli_smoke_json(wq, tmp_path: Path, capsys) -> None: + from dorian import cli + + repo = _repo(tmp_path) + _seal( + repo, + [{"id": "timeout", "kind": "quantity", "program": "py-const:src/config.py::TIMEOUT::30"}], + ) + # --repo after the bench subcommand: it flows to the bench module's own parser + rc = cli.main(["bench", "warrant-quality", "note.md", "--repo", str(repo), "--json"]) + assert rc == 0 + import json + + out = json.loads(capsys.readouterr().out) + assert out["schema"] == "dorian-warrant-quality-v1" + assert out["summary"]["by_quality"].get("strong") == 1 From 4e586a7fe08557056548c228a8467b7e62fa9add Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 18:08:22 +0530 Subject: [PATCH 05/13] docs(v1): current-version benchmark results + evidence hygiene (WP9, WP1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docs/BENCHMARK_CURRENT.md: version- and commit-stamped reruns of the reproducible suites on current code — large-mutation (240 pairs, P=R=0.93, 11.6x/10.4x FP reduction), binding-lifecycle (808 pairs, selection recall 0.54->1.00, alarm precision/recall 1.00, 0 errored), realworld (5 cases 2/1/2), and the new warrant-quality harness. The reruns MATCH the historical runs (same content-derived run_id), proving the V1 changes are additive and do not regress the benchmarks. Includes a what-this-does-NOT-prove block. - HISTORICAL banners on docs/BENCHMARK_v0.7.0.md (v0.7.0) and docs/BENCHMARK_BINDING_LIFECYCLE.md (0.9.0), each pointing to BENCHMARK_CURRENT.md; the historical numbers are preserved verbatim. - docs/V1_SCOPE.md: what V1 strengthening means and does NOT mean (no universal semantic correctness; trusted-base is a trust root not a sandbox; config binding is TOML/JSON only; code:/structural are Python-only; extractor stays draft; carried-forward limitations). - README: trust-state legend (WARRANTED born -> TRUSTED/DEGRADED/REVOKED/UNKNOWN), historical labels on the benchmark citations, command-surface entries for the new C3 forms, checker-strength in bindings, config provenance in bind-suggest, checker-source base, and bench warrant-quality. - tests/test_benchmark_evidence.py: wording guards (historical docs labeled; current doc version/commit-stamped with a non-overclaim block; README links current; V1_SCOPE boundary). Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 38 ++++++++++---- V1_IMPLEMENTATION_TRACKER.md | 2 +- docs/BENCHMARK_BINDING_LIFECYCLE.md | 6 +++ docs/BENCHMARK_CURRENT.md | 77 +++++++++++++++++++++++++++++ docs/BENCHMARK_v0.7.0.md | 5 ++ docs/V1_SCOPE.md | 64 ++++++++++++++++++++++++ tests/test_benchmark_evidence.py | 61 +++++++++++++++++++++++ 7 files changed, 243 insertions(+), 10 deletions(-) create mode 100644 docs/BENCHMARK_CURRENT.md create mode 100644 docs/V1_SCOPE.md create mode 100644 tests/test_benchmark_evidence.py diff --git a/README.md b/README.md index ee88db8..06c59cf 100644 --- a/README.md +++ b/README.md @@ -103,6 +103,12 @@ fold sha256:7920c71b5a6a9c8e WARRANTED -> REVOKED The summary still reads perfectly. Its portrait flipped to **REVOKED** — and every artifact whose warrant was built on it is flagged `recalled`, so nobody builds on a claim that silently went false. +> **Trust states.** A warrant is born **WARRANTED**. Each `revalidate` folds it to **TRUSTED** +> (all re-checked claims hold), **DEGRADED** or **REVOKED** (a claim broke — DEGRADED for a +> non-load-bearing break, REVOKED for a load-bearing one), or **UNKNOWN** (a checker could not +> run — ERROR is never silently green and never counted as broken). So `WARRANTED -> REVOKED` +> above is the born state folding on its first revalidation. + ## We ran this on dorian itself The `verify` and `revalidate` output above is exactly what dorian prints, shown for an illustrative @@ -169,7 +175,10 @@ path-scope watcher (58 → 5 false alarms) and **10.4x** versus the stronger lin 1.00 by construction here; the meaningful axis is their precision.) These numbers describe a synthetic fixture suite, not your repository, and are not a universal -performance claim. See [`docs/BENCHMARK_v0.7.0.md`](docs/BENCHMARK_v0.7.0.md) (protocol: +performance claim. The headline figures were **measured at v0.7.0** and are **historical**; the +current version reproduces them unchanged (240 pairs, P=R=0.93) — see the version-stamped +[`docs/BENCHMARK_CURRENT.md`](docs/BENCHMARK_CURRENT.md). See +[`docs/BENCHMARK_v0.7.0.md`](docs/BENCHMARK_v0.7.0.md) (protocol: [`docs/BENCHMARK_PROTOCOL_v0.7.0.md`](docs/BENCHMARK_PROTOCOL_v0.7.0.md)); reproduce with `dorian bench large-mutation`, and measure your own repos with the harness in `bench/`. @@ -207,6 +216,8 @@ trigger-vs-truth ceiling, on a real class (**partial**). Two further cases (docu sources, not reproduced) are honest misses (**not_solved**). These are scoped reproductions of public problem classes — not universal validation. +The 808-pair figures above were **measured at dorian 0.9.0** and are **historical**; the +current-version rerun (same protocol) is in [`docs/BENCHMARK_CURRENT.md`](docs/BENCHMARK_CURRENT.md). See [`docs/BENCHMARK_BINDING_LIFECYCLE.md`](docs/BENCHMARK_BINDING_LIFECYCLE.md) and [`docs/REALWORLD_USECASES.md`](docs/REALWORLD_USECASES.md) (protocols alongside each); reproduce with `dorian bench binding-lifecycle` and `dorian bench realworld-usecases`. @@ -359,8 +370,10 @@ A warrant is worth only what its checkers actually catch. The full authoring con load-bearing claim, **bind** the file that would change if the claim went false, **prefer** shape-tolerant checks like `regex:`/`symbol:`/typed-C5 over brittle `string:`) — lives in [`docs/AGENT_CLAIMS.md`](docs/AGENT_CLAIMS.md). Checker program grammars (C1 span, C3 -path/symbol/string/regex, C4 `pytest:`, C5 typed data) are documented in -[`spec/checkers.md`](spec/checkers.md). +path/symbol/string/regex plus the V1 structural forms `py-signature:`/`py-const:` and the +comment/docstring-stripped `code:`, C4 `pytest:`, C5 typed data) are documented in +[`spec/checkers.md`](spec/checkers.md). What V1 strengthening does and does not promise is in +[`docs/V1_SCOPE.md`](docs/V1_SCOPE.md). > **Checker programs are executable.** `dorian verify` *runs* every checker at seal time. C3 and typed > C5 only inspect files, but C4 (`pytest:`) and C5 `shell:` execute code — review an agent-emitted @@ -386,12 +399,16 @@ claims. event: a flag only — downstream is never re-checked and its states are untouched. Re-seal with `seal --supersede ` so downstream warrants sealed against the old id stay reachable. - `dorian bindings ` — binding-quality diagnostics (unbacked, single-file, short-literal, - ambiguous-mention, trigger-only-symbol, unwatched-mention). Informational, never a gate; output - carries file paths only, never matched content. `ambiguous-mention` surfaces a load-bearing claim - whose symbol is defined in more than one file (so no definer is auto-watched); `trigger-only-symbol` - marks a watch added only as a re-check *trigger* that no checker actually exercises. -- `dorian bind-suggest --claims claims.json` — read-only preview of the symbol-definer files `verify` - would auto-bind for each claim (and the ambiguous symbols it would skip). Writes nothing, never a gate. + ambiguous-mention, trigger-only-symbol, unwatched-mention) **plus per-claim checker-strength and + claim-risk** (it classifies each checker's *truth strength* and flags adequacy mismatches — a + `behavior` claim backed only by an existence checker, a vacuous pytest node). Informational, never a + gate; output carries file paths only, never matched content. +- `dorian bind-suggest --claims claims.json` — read-only preview of the files `verify` would auto-bind + for each claim, **with provenance** (symbol-definer vs config-key), the ambiguous symbols/keys it + would skip, and any unparseable config file. Writes nothing, never a gate. +- `dorian revalidate --checker-source base` (also Action `checker_trust: base`; default `head`) — + resolve each claim's checker spec from the `--since` base ref so a PR-added or PR-modified executable + checker is never executed (public/fork PRs). Fail-closed, **not a sandbox** — pair with `--deny-exec`. - `dorian rebind ` — re-derive a warrant's symbol-definer watches with the current binding logic and re-seal it (born-verifiable, superseding the old id), so a warrant sealed before the symbol index existed gains the wider watches. The watch only ever widens; a claim that has since become false @@ -420,6 +437,9 @@ claims. benchmark for symbol binding ([`docs/BENCHMARK_BINDING_LIFECYCLE.md`](docs/BENCHMARK_BINDING_LIFECYCLE.md)). `dorian bench realworld-usecases` runs the offline public-case reproductions ([`docs/REALWORLD_USECASES.md`](docs/REALWORLD_USECASES.md)). +- `dorian bench warrant-quality ` — offline per-claim mutation scoring: for each claim, does + its checker catch the drift it implies (caught / missed / brittle / ceiling)? Deterministic, never + mutates the real repo. Separates trigger from verdict; see [`docs/V1_SCOPE.md`](docs/V1_SCOPE.md). Exit codes: `0` ok/TRUSTED · `2` usage/infra (incl. a C1 or C5 `shell:` claim handed to `verify`) · `3` DEGRADED · `4` REVOKED/integrity · `5` ERRORED-only (checkers could not run — never conflated with diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md index d9bf12c..24bcfb7 100644 --- a/V1_IMPLEMENTATION_TRACKER.md +++ b/V1_IMPLEMENTATION_TRACKER.md @@ -99,7 +99,7 @@ Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-documen | WP | Title | Status | |---|---|---| -| WP1 | docs/evidence hygiene | TODO | +| WP1 | docs/evidence hygiene | DONE (trust-state legend; historical banners on v0.7.0/0.9.0 benchmark docs; docs/V1_SCOPE.md; README command-surface + new-forms + historical labels; benchmark-evidence wording tests) | | WP2 | checker-strength / claim-risk linter | DONE (strength.py; surfaced in `bindings` + binding-gate warn; 19 tests) | | WP3 | Python structural checkers (py-signature, py-const) | DONE (pyast.py + C3 subgrammars; 27 tests incl. e2e) | | WP4 | semantic-context source search (`code:`) | DONE (pyast.code_only_python + C3 `code:`; 12 tests) | diff --git a/docs/BENCHMARK_BINDING_LIFECYCLE.md b/docs/BENCHMARK_BINDING_LIFECYCLE.md index 6a21eab..6c51b96 100644 --- a/docs/BENCHMARK_BINDING_LIFECYCLE.md +++ b/docs/BENCHMARK_BINDING_LIFECYCLE.md @@ -1,5 +1,11 @@ # dorian binding-lifecycle benchmark +> **HISTORICAL — measured at dorian 0.9.0** (see the run header below; the preserved 808-pair +> full run). Evidence about the 0.9.0 implementation, not current behavior. The current-version +> rerun (0.11.0, identical results — see [`BENCHMARK_CURRENT.md`](BENCHMARK_CURRENT.md)) confirms +> the V1 changes did not regress it. NOTE: `dorian bench binding-lifecycle` REGENERATES this file; +> restore it from git after a rerun so the historical record survives. + > Generated from machine output by `bench.binding_lifecycle`. Known-truth labels, > in-fixture results — a reproducible demonstration of the MECHANISM on this suite, > not evidence about any real repository. diff --git a/docs/BENCHMARK_CURRENT.md b/docs/BENCHMARK_CURRENT.md new file mode 100644 index 0000000..daa5f5d --- /dev/null +++ b/docs/BENCHMARK_CURRENT.md @@ -0,0 +1,77 @@ +# Current-version benchmark results + +Version-stamped reruns of dorian's reproducible benchmark suites on the **current** code, +so the published numbers track the implementation rather than lagging behind it. The older +result docs ([`BENCHMARK_v0.7.0.md`](BENCHMARK_v0.7.0.md) = v0.7.0, +[`BENCHMARK_BINDING_LIFECYCLE.md`](BENCHMARK_BINDING_LIFECYCLE.md) = 0.9.0) are **historical** +and are kept as-is for provenance. + +## Measurement environment + +| field | value | +| --- | --- | +| dorian version | `0.11.0` (V1 candidate) | +| measured commit | `2a66a49eee7b8aa069d7fb9222572b272493856d` | +| Python | 3.12.4 | +| platform | darwin (CI matrix: 3.11 / 3.12 / 3.13) | +| reproduce | `dorian bench large-mutation` · `dorian bench binding-lifecycle` · `dorian bench realworld-usecases` | + +## Results + +### Large controlled-mutation (240 pairs, 6 synthetic domains) + +``` +dorian: precision 0.93 / recall 0.93 +file-change watchers: recall 1.00 / precision 0.34 (naive), 0.56 (path-scope), 0.59 (line-aware) +false-positive reduction: 11.6x vs path-scope (58 -> 5), 10.4x vs line-aware (52 -> 5) +``` + +**Identical to the v0.7.0 historical figures** — the V1 additions (structural checkers, +semantic-context search, config-key binding, checker-strength diagnostics, trusted-base mode) +are additive and do **not** regress this suite. + +### Binding-lifecycle (808 pairs, 63 synthetic domains, two mechanically-frozen labels) + +``` +selection (trigger) recall: checker_path_watcher 0.54 -> bound_dorian_candidate 1.00 + (286 trigger-stale pairs re-checked that the pre-binding checker-path watcher silently skips) +selection precision: bound_dorian_candidate 1.00 (vs 0.92 for the rejected "watch any file with the token") +verdict (alarm) precision/recall: 1.00 / 1.00 (174/174 fact-stale pairs), 0 false BROKEN over all 808 +errored pairs: 0 (ERRORED is reported separately, never an alarm) +gutted-body ceiling: existence checker fires the trigger but yields 0 BROKEN; only a C4 test catches it +``` + +**Identical to the 0.9.0 historical run** (same content-derived `run_id 168b50d9aa631d52`) — again +confirming the V1 changes did not move the binding-lifecycle numbers. + +### Real-world public-case reproductions (5 cases, offline hermetic fixtures) + +``` +solved 2 · partial 1 · not_solved 2 +``` + +Scoped reproductions of public problem *classes* (the public issue is the template; the fixture +is invented), **not** broad real-world validation. + +### Warrant-quality harness (new in V1) + +`dorian bench warrant-quality ` scores, per claim and offline, whether the checker catches +the drift it implies (caught / missed / brittle / ceiling), separating the trigger layer from the +verdict layer and keeping ERROR distinct from a miss. It is an evidence generator about *a specific +warrant*, not a repo-level metric; see [`V1_SCOPE.md`](V1_SCOPE.md). + +## What these results prove — and what they do not + +**Allowed (per [`VALIDATION_HONESTY.md`](VALIDATION_HONESTY.md)):** + +- the mechanism **reproduces** on the named synthetic suites at the stamped version and commit; +- on those inputs, claim-level revalidation has far fewer false re-checks than a file watcher, and + binding's trigger recall is near-complete with zero false BROKEN; +- the V1 changes did **not regress** the prior numbers (the reruns match the historical runs). + +**NOT supported:** + +- "works on real repos in general" / "validated" / "production-grade" — these are synthetic suites; +- that the numbers transfer to your codebase; +- that binding proves semantic behavior — it widens the re-check trigger; the checker decides truth + (the gutted-body ceiling is shown, not solved). diff --git a/docs/BENCHMARK_v0.7.0.md b/docs/BENCHMARK_v0.7.0.md index 5789f0f..da0be6c 100644 --- a/docs/BENCHMARK_v0.7.0.md +++ b/docs/BENCHMARK_v0.7.0.md @@ -1,5 +1,10 @@ # dorian large controlled-mutation benchmark (v0.7.0) +> **HISTORICAL — measured at v0.7.0.** These numbers are evidence about the v0.7.0 +> implementation, not current behavior. For the current-version rerun (same protocol, +> stamped with the measured commit) see [`BENCHMARK_CURRENT.md`](BENCHMARK_CURRENT.md). +> Reproduce this suite at any version with `dorian bench large-mutation`. + Numbers only. Labels are **known-truth**: each mutation's stale / not-stale outcome for a claim is a mechanical consequence of the edit (e.g. changing `TIMEOUT = 30` to `10` falsifies the claim "the default timeout is 30 seconds"). diff --git a/docs/V1_SCOPE.md b/docs/V1_SCOPE.md new file mode 100644 index 0000000..29504c8 --- /dev/null +++ b/docs/V1_SCOPE.md @@ -0,0 +1,64 @@ +# What V1 means — and what it does not + +dorian's V1 strengthening is **deterministic strengthening on supported domains**, not a +promise of universal correctness. This page states the boundary so no feature or +benchmark can be read as more than it is. It is the companion to +[`VALIDATION_HONESTY.md`](VALIDATION_HONESTY.md) (evidence wording) and +[`SECURITY_BOUNDARY.md`](SECURITY_BOUNDARY.md) (execution/trust). + +## What V1 adds + +All additive and backward-compatible; default behavior is unchanged unless you opt in. + +- **Python structural checkers** — `py-signature:` and `py-const:` (C3 subgrammars) compare + parsed AST structure and literal **values**, closing the `symbol:` existence ceiling and the + `string:`/`regex:` comment-survival false-pass for Python signatures and constants. +- **Semantic-context search** — `code:` runs a regex over comment/docstring-stripped Python, + so a fact surviving only in a comment or docstring FAILs while the same fact in real code + passes. (`spec/checkers.md`.) +- **Checker-strength / claim-risk diagnostics** — `dorian bindings` and the `--binding-gate` + warn output now classify each checker's *truth strength* and flag kind-vs-strength + adequacy mismatches (a `behavior` claim backed only by an existence checker; a vacuous + pytest node). Advisory; it never changes a verdict, trust state, or exit code. +- **Multi-index binding** — config keys in tracked `.toml`/`.json` files now widen a claim's + re-check trigger set (with provenance in `bind-suggest`). Conservative and trigger-only. +- **Trusted-base checker-source mode** — `revalidate --checker-source base` / Action + `checker_trust: base` runs only base-approved checker specs, for public/fork PRs. +- **Warrant-quality harness** — `dorian bench warrant-quality` scores per-claim whether a + checker catches the drift it implies, offline and deterministically. + +## What V1 does NOT mean + +- **Not universal semantic correctness.** dorian verifies *stated claims against the source* + with deterministic checkers. It cannot prove arbitrary prose, runtime behavior without a + test, external-system state, or anything outside a supported checker/binding domain. +- **The trigger-vs-truth ceiling is real and visible, not removed.** Binding decides WHEN a + claim is re-checked; the checker decides WHETHER it is false. A `symbol:`/`py-signature:` + checker is blind to a body-only ("gutted body") change — only a `pytest:` test catches that. + The checker-strength diagnostics and the warrant-quality harness *surface* this; they do not + eliminate it. +- **No public-fork safety beyond the trust root.** `checker_trust: base` stops PR-authored + executable checkers from running, but a base-approved `pytest:` checker can still execute + PR-head code. It is a checker-source trust root, **not a sandbox**; for untrusted forks + combine it with `deny_exec: true` (or external isolation). `--deny-exec`/`--deny-shell` are + fail-closed, not sandboxes. +- **Config binding is TOML/JSON only.** YAML is not indexed — parsing it needs a third-party + dependency and dorian's core has zero runtime deps. An unparseable supported config file is + surfaced as a diagnostic, never silently skipped, but a key dorian cannot index is an honest + miss, not a guarantee. +- **`code:`/structural forms are Python-only.** Other languages still rely on raw `string:`/ + `regex:` text search, which retains the comment/docstring survival class. +- **The LLM extractor stays draft/experimental.** V1 does not promote `--extract`; emit claims + directly (`docs/AGENT_CLAIMS.md`). +- **Benchmarks prove reproducibility on named inputs**, never "works on real repos" — see + `VALIDATION_HONESTY.md`. Historical result docs (v0.7.0, 0.9.0) are labeled historical; + current-version numbers live in `BENCHMARK_CURRENT.md`. + +## Known limitations carried into V1 (documented, not fixed) + +- **Audit/state atomicity** — a claim/trust-state change and its audit event commit in + separate transactions; a crash between the two can leave the event missing (`fold.py`). +- **Ambiguous bindings are skipped, not resolved** — a symbol or config key defined in more + than one file is left unwatched and surfaced for manual binding, never guessed. +- **ERROR is never BROKEN** — a checker that cannot run (bad program, missing engine, blocked + by policy, unresolved base sidecar) is ERRORED, never a staleness verdict, end to end. diff --git a/tests/test_benchmark_evidence.py b/tests/test_benchmark_evidence.py new file mode 100644 index 0000000..24aabcc --- /dev/null +++ b/tests/test_benchmark_evidence.py @@ -0,0 +1,61 @@ +"""Benchmark-evidence hygiene (WP1/WP9): historical vs current, honestly labeled. + +The repo's published numbers must not silently imply that older results describe the +current implementation. These pin: historical result docs carry a HISTORICAL banner; the +current-version results doc is version- and commit-stamped and carries a what-it-does-NOT- +prove block; the README labels the older numbers historical and points to the current doc; +and the V1-scope doc states the boundary. Wording tests, no network. +""" + +from __future__ import annotations + +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent + + +def _read(rel: str) -> str: + return (ROOT / rel).read_text(encoding="utf-8") + + +def test_historical_benchmark_docs_are_labeled_historical() -> None: + for rel in ("docs/BENCHMARK_v0.7.0.md", "docs/BENCHMARK_BINDING_LIFECYCLE.md"): + doc = _read(rel) + assert "HISTORICAL" in doc, f"{rel} must be labeled HISTORICAL" + assert "BENCHMARK_CURRENT.md" in doc, f"{rel} must point to the current-version doc" + + +def test_current_benchmark_doc_is_version_and_commit_stamped() -> None: + doc = _read("docs/BENCHMARK_CURRENT.md") + assert "0.11.0" in doc # dorian version stamp + assert "measured commit" in doc.lower() + assert "Python" in doc # environment summary + assert "reproduce" in doc.lower() + # the mandatory non-overclaim block + low = doc.lower() + assert "not supported" in low or "do not" in low + assert "synthetic" in low + assert "binding proves" in low or "binding" in low # trigger-vs-truth caveat present + + +def test_current_doc_does_not_claim_real_world_validation() -> None: + doc = _read("docs/BENCHMARK_CURRENT.md").lower() + # the doc may NEGATE these phrases, but must never assert them + assert "works on real repos in general" not in doc.replace( + '"works on real repos in general"', "" + ) + + +def test_readme_labels_benchmarks_historical_and_links_current() -> None: + readme = _read("README.md") + assert "historical" in readme.lower() + assert "docs/BENCHMARK_CURRENT.md" in readme + + +def test_v1_scope_doc_states_the_boundary() -> None: + doc = _read("docs/V1_SCOPE.md") + low = doc.lower() + assert "not universal semantic correctness" in low + assert "not a sandbox" in low + assert "gutted body" in low or "gutted-body" in low # the ceiling is named + assert "yaml" in low # the config-binding boundary is stated From 2a4befaff04e5a4551284f2b91ffa00b0c7d9d6c Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 18:19:36 +0530 Subject: [PATCH 06/13] fix(v1): keep BENCHMARK_v0.7.0.md byte-identical to its generator test_large_mutation::test_committed_doc_matches_render asserts docs/BENCHMARK_v0.7.0.md == lm.render_markdown(summary), so the generated doc cannot carry a hand-added banner. Drop the HISTORICAL banner from it (the title already version-stamps it "(v0.7.0)"); its historical status is conveyed by README + BENCHMARK_CURRENT.md (which names it as the historical source). The binding-lifecycle doc has no byte-match guard, so it keeps its banner. Updated test_benchmark_evidence to match: binding-lifecycle by banner, v0.7.0 by version-stamped title + the current doc's cross-reference. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/BENCHMARK_v0.7.0.md | 5 ----- tests/test_benchmark_evidence.py | 16 ++++++++++++---- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/BENCHMARK_v0.7.0.md b/docs/BENCHMARK_v0.7.0.md index da0be6c..5789f0f 100644 --- a/docs/BENCHMARK_v0.7.0.md +++ b/docs/BENCHMARK_v0.7.0.md @@ -1,10 +1,5 @@ # dorian large controlled-mutation benchmark (v0.7.0) -> **HISTORICAL — measured at v0.7.0.** These numbers are evidence about the v0.7.0 -> implementation, not current behavior. For the current-version rerun (same protocol, -> stamped with the measured commit) see [`BENCHMARK_CURRENT.md`](BENCHMARK_CURRENT.md). -> Reproduce this suite at any version with `dorian bench large-mutation`. - Numbers only. Labels are **known-truth**: each mutation's stale / not-stale outcome for a claim is a mechanical consequence of the edit (e.g. changing `TIMEOUT = 30` to `10` falsifies the claim "the default timeout is 30 seconds"). diff --git a/tests/test_benchmark_evidence.py b/tests/test_benchmark_evidence.py index 24aabcc..af4eee6 100644 --- a/tests/test_benchmark_evidence.py +++ b/tests/test_benchmark_evidence.py @@ -19,10 +19,18 @@ def _read(rel: str) -> str: def test_historical_benchmark_docs_are_labeled_historical() -> None: - for rel in ("docs/BENCHMARK_v0.7.0.md", "docs/BENCHMARK_BINDING_LIFECYCLE.md"): - doc = _read(rel) - assert "HISTORICAL" in doc, f"{rel} must be labeled HISTORICAL" - assert "BENCHMARK_CURRENT.md" in doc, f"{rel} must point to the current-version doc" + # the binding-lifecycle doc is NOT byte-matched to its generator, so it carries an + # explicit HISTORICAL banner pointing to the current-version doc. + bl = _read("docs/BENCHMARK_BINDING_LIFECYCLE.md") + assert "HISTORICAL" in bl, "binding-lifecycle doc must be labeled HISTORICAL" + assert "BENCHMARK_CURRENT.md" in bl + # the large-mutation doc IS byte-matched to its generator (test_large_mutation), so it + # cannot carry a hand banner; its historical status is its version-stamped title plus + # the current-results doc naming it as the historical source. + v07 = _read("docs/BENCHMARK_v0.7.0.md") + assert "(v0.7.0)" in v07, "the large-mutation doc title must carry its version stamp" + cur = _read("docs/BENCHMARK_CURRENT.md") + assert "BENCHMARK_v0.7.0.md" in cur and "historical" in cur.lower() def test_current_benchmark_doc_is_version_and_commit_stamped() -> None: From a6595baaf02f285ea21472caa48b85c5eeddf3a0 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 18:31:59 +0530 Subject: [PATCH 07/13] fix(v1): resolve all 6 adversarial-review BLOCK findings + 2 hygiene items MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A 5-lens adversarial review (BLOCK verdict) reproduced 6 real defects; all fixed red-green, plus two hygiene items that contradicted stated invariants: 1. Config-key over-binding broke "default unchanged unless opt-in": a claim backticking a common config word (e.g. `dependencies`) bound pyproject.toml and could newly refuse a clean `verify` with exit 6. Fix: _CONFIG_KEY_STOPWORDS (PEP 621 / common keys) on the config axis; specific keys (max_workers) still bind. Regression test reproduces the exit-6. 2/3. SECURITY.md + action/README.md still said trusted-base was "not yet implemented" — false on this branch. Updated both to describe checker_trust: base as shipped (with the non-sandbox residual); added a guard test so the drift can't recur. 4. `code:` false PASS on an f-string docstring — code_only_python now recognises ast.JoinedStr docstrings. 5. `code:` false FAIL on a real string co-located on a docstring's line — docstrings are now blanked by AST node SPAN, not whole line. 6. `py-const` PASSed on value-TYPE drift (30/30.0, 1/True, 0/False) via Python == — now requires matching type before ==. Documented + red-green tested. Hygiene: warrant-quality _run_mutated refuses a `../`-escaping file operand (its docstring promised containment); check_signature wraps comparison in _PARSE_ERRORS so a pathological signature ERRORs within pyast. Added an end-to-end ERROR-never-BROKEN test for the new C3 forms (non-literal RHS -> ERRORED, exit 5, never BROKEN). 658 non-slow tests pass; lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- SECURITY.md | 23 ++++++--- action/README.md | 29 ++++++----- bench/warrant_quality.py | 16 ++++-- src/dorian/pyast.py | 71 ++++++++++++++++---------- src/dorian/symbol_index.py | 66 +++++++++++++++++++++++- tests/test_action_security_defaults.py | 13 +++++ tests/test_config_binding.py | 50 ++++++++++++++++++ tests/test_pystructural.py | 59 +++++++++++++++++++++ tests/test_semantic_context.py | 18 +++++++ 9 files changed, 290 insertions(+), 55 deletions(-) diff --git a/SECURITY.md b/SECURITY.md index c746999..b1d973b 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -64,14 +64,21 @@ inside your own sandbox (container, restricted user, no secrets in env). ## Public fork PR CI -dorian does **not** currently advertise a safe public-fork-PR mode. A trusted-base -Action design (run checkers from the base ref, never from untrusted head, deny-exec -by default for forks) is documented in -[docs/TRUSTED_BASE_ACTION_DESIGN.md](docs/TRUSTED_BASE_ACTION_DESIGN.md) but is **not -yet implemented or tested**. Until it is, the safe answer for public forks is -`--deny-exec` plus the standard caution that any executed checker still runs with -the runner's privileges. Do not wire dorian into `pull_request_target` with a -checkout of untrusted head. +For public/forked-PR CI, use **trusted-base checker-source mode**: +`dorian revalidate --checker-source base` (Action input `checker_trust: base`). It +resolves each claim's checker SPEC from the trusted base ref and runs it against the +PR-head sources, so a PR-added or PR-modified executable checker is never executed and +a PR rewriting a checker spec cannot self-attest a verdict (the base-approved spec +wins). A missing or tampered base sidecar **fails closed** (ERRORED, never executed). +This is implemented and proven by the test matrix in +[docs/TRUSTED_BASE_ACTION_DESIGN.md](docs/TRUSTED_BASE_ACTION_DESIGN.md) §6 +(`tests/test_trusted_base.py`). + +It is a **checker-source trust root, not a sandbox**: a base-approved `pytest:` checker +can still import and execute PR-head code. So for fully untrusted forks, combine +`checker_trust: base` **with `deny_exec: true`** (or external isolation) — any executed +checker still runs with the runner's privileges. Do not wire dorian into +`pull_request_target` with a checkout of untrusted head. ## Supported versions diff --git a/action/README.md b/action/README.md index 232cc58..558c19e 100644 --- a/action/README.md +++ b/action/README.md @@ -59,20 +59,21 @@ caveats: 1. A `.warrant` file is a **non-obvious executable input**. Reviewers who would scrutinize a workflow or `conftest.py` change may wave through a "docs-only" diff that swaps a checker `program`. -2. The verdict is **self-attested by the PR tree**. A PR can rewrite a - sidecar so a broken claim re-verifies; the trust root for what "should" - be checked is not yet the base branch. - -**deny-exec input (partial mitigation, available now).** Set `deny_exec: true` -(or `deny_shell: true`) on the Action to refuse the executable checker families -during revalidation: C4 pytest and C5 shell ERROR instead of executing, so a -PR-authored sidecar cannot make this Action run its code. It flows through the -`DORIAN_DENY_EXEC` env fallback; the default `false` preserves today's behavior -for trusted/internal repos. This is fail-closed but **not a sandbox** and **not -yet a full public-fork story**: it removes code execution but does not address -the self-attested-verdict problem (a PR can still rewrite a *non-executable* C3 -claim so a broken fact re-verifies). See `SECURITY.md` and -`docs/SECURITY_BOUNDARY.md`. +2. In the default `head` mode the verdict is **self-attested by the PR tree** — a + PR can rewrite a sidecar so a broken claim re-verifies. **`checker_trust: base` + fixes exactly this** (see below): it sources every checker spec from the base + ref, so a PR rewriting a spec can no longer weaken the verdict. Use `head` only + for trusted/internal repos. + +**deny-exec input.** Set `deny_exec: true` (or `deny_shell: true`) on the Action to +refuse the executable checker families during revalidation: C4 pytest and C5 shell +ERROR instead of executing, so a PR-authored sidecar cannot make this Action run its +code. It flows through the `DORIAN_DENY_EXEC` env fallback; the default `false` +preserves today's behavior for trusted/internal repos. This is fail-closed but **not +a sandbox**: on its own it removes code execution but does not address the +self-attested-verdict problem for *non-executable* checkers — that is what +`checker_trust: base` adds, and the two compose (use both for untrusted forks). See +`SECURITY.md` and `docs/SECURITY_BOUNDARY.md`. ```yaml # untrusted / public-fork posture diff --git a/bench/warrant_quality.py b/bench/warrant_quality.py index 3b0862d..e7ab4de 100644 --- a/bench/warrant_quality.py +++ b/bench/warrant_quality.py @@ -96,11 +96,19 @@ def _mutations(spec: CheckerSpec) -> Iterator[tuple[str, str, str, object]]: def _run_mutated(repo: Path, claim: Claim, spec_index: int, file: str, mutate, policy) -> Verdict: """Run one checker against a throwaway copy of `file` with `mutate` applied. Only the - one file the checker reads is materialized — the real repo is never touched.""" - original = (repo / file).read_text(encoding="utf-8", errors="replace") + one file the checker reads is materialized — the real repo is never touched, and a + warrant-controlled `file` operand that escapes the repo (e.g. `../`) is refused so the + harness cannot read or write outside its sandbox (the checker would ERROR on it anyway).""" + repo = repo.resolve() + src = (repo / file).resolve() + if not src.is_relative_to(repo) or not src.is_file(): + return Verdict.ERROR # path escape or missing: do not read/write outside the repo + original = src.read_text(encoding="utf-8", errors="replace") with tempfile.TemporaryDirectory() as td: - work = Path(td) - target = work / file + work = Path(td).resolve() + target = (work / file).resolve() + if not target.is_relative_to(work): + return Verdict.ERROR # write would escape the temp sandbox target.parent.mkdir(parents=True, exist_ok=True) target.write_text(mutate(original), encoding="utf-8") ctx = CheckContext(repo=work, claim=claim, policy=policy) diff --git a/src/dorian/pyast.py b/src/dorian/pyast.py index fb2c4b0..6a7ab4e 100644 --- a/src/dorian/pyast.py +++ b/src/dorian/pyast.py @@ -43,23 +43,10 @@ def code_only_python(text: str) -> str | None: tree = _parse(text) if tree is None: return None - doc_start_lines: set[int] = set() - for node in ast.walk(tree): - if isinstance(node, _SCOPE_NODES): - body = getattr(node, "body", None) - if ( - isinstance(body, list) - and body - and isinstance(body[0], ast.Expr) - and isinstance(body[0].value, ast.Constant) - and isinstance(body[0].value.value, str) - ): - doc_start_lines.add(body[0].value.lineno) - buf = [list(line) for line in text.split("\n")] - def blank(start: tuple[int, int], end: tuple[int, int]) -> None: - (sl, sc), (el, ec) = start, end + def blank(sl: int, sc: int, el: int, ec: int) -> None: + """Blank the half-open span (sl,sc)..(el,ec) to spaces. Lines 1-based, cols 0-based.""" for ln in range(sl, el + 1): if ln - 1 >= len(buf): break @@ -69,12 +56,28 @@ def blank(start: tuple[int, int], end: tuple[int, int]) -> None: for i in range(lo, min(hi, len(row))): row[i] = " " + # Docstrings: the first body statement of a module/class/function that is a bare string + # OR f-string expression. Blank by the NODE's span (not the whole line), so a real string + # literal co-located on the docstring's physical line is preserved; an f-string docstring + # (ast.JoinedStr) is blanked just like a plain one. + for node in ast.walk(tree): + if not isinstance(node, _SCOPE_NODES): + continue + body = getattr(node, "body", None) + if not (isinstance(body, list) and body and isinstance(body[0], ast.Expr)): + continue + val = body[0].value + is_doc = (isinstance(val, ast.Constant) and isinstance(val.value, str)) or isinstance( + val, ast.JoinedStr + ) + if is_doc and val.end_lineno is not None and val.end_col_offset is not None: + blank(val.lineno, val.col_offset, val.end_lineno, val.end_col_offset) + + # Comments: char-accurate via tokenize. try: for tok in tokenize.generate_tokens(io.StringIO(text).readline): - if tok.type == tokenize.COMMENT or ( - tok.type == tokenize.STRING and tok.start[0] in doc_start_lines - ): - blank(tok.start, tok.end) + if tok.type == tokenize.COMMENT: + blank(tok.start[0], tok.start[1], tok.end[0], tok.end[1]) except (tokenize.TokenError, IndentationError, SyntaxError): pass # ast parsed cleanly; a tokenizer hiccup leaves best-effort blanking return "\n".join("".join(row) for row in buf) @@ -251,14 +254,23 @@ def check_signature(text: str, needle: str) -> tuple[str, str]: if async_required and not isinstance(fn, ast.AsyncFunctionDef): return ("FAIL", f"signature_mismatch: {qualname} is not async") - mismatch = _compare_params(_params(pfn), _params(fn)) - if mismatch: - return ("FAIL", f"signature_mismatch: {qualname}: {mismatch}") - if arrow: - want_ret = ast.unparse(pfn.returns) if pfn.returns else None - got_ret = ast.unparse(fn.returns) if fn.returns else None - if want_ret != got_ret: - return ("FAIL", f"signature_mismatch: {qualname}: return {got_ret!r} != {want_ret!r}") + # normalization/comparison can hit RecursionError/MemoryError on a pathological + # (but parseable) signature, e.g. a deeply nested annotation — honor pyast's own + # ERROR contract here rather than relying on the run_checker safety net. + try: + mismatch = _compare_params(_params(pfn), _params(fn)) + if mismatch: + return ("FAIL", f"signature_mismatch: {qualname}: {mismatch}") + if arrow: + want_ret = ast.unparse(pfn.returns) if pfn.returns else None + got_ret = ast.unparse(fn.returns) if fn.returns else None + if want_ret != got_ret: + return ( + "FAIL", + f"signature_mismatch: {qualname}: return {got_ret!r} != {want_ret!r}", + ) + except _PARSE_ERRORS: + return ("ERROR", f"signature_uncomparable: {qualname} has a pathological signature") return ("PASS", f"signature ok: {qualname}") @@ -288,6 +300,9 @@ def check_const(text: str, needle: str) -> tuple[str, str]: got = ast.literal_eval(rhs) except _PARSE_ERRORS: return ("ERROR", f"non_literal: {qualname} is not a literal constant") - if got == want: + # value AND type must match: Python `==` conflates 30/30.0, 1/True, 0/False, so a + # type-only drift (a bool flag becoming an int, an int becoming a float) would wrongly + # PASS the tier sold as the strong value verifier. Compare type first. + if type(got) is type(want) and got == want: return ("PASS", f"const ok: {qualname} == {expected}") return ("FAIL", f"const_value_mismatch: {qualname} != {expected}") diff --git a/src/dorian/symbol_index.py b/src/dorian/symbol_index.py index f35aff4..ad2ba19 100644 --- a/src/dorian/symbol_index.py +++ b/src/dorian/symbol_index.py @@ -42,6 +42,64 @@ class from docs/NEXT_ALGORITHMIC_BETS.md #1 — where a claim about a symbol _CONFIG_SUFFIXES = (".toml", ".json") _MIN_KEY_LEN = 4 # mirror bindings._MIN_IDENT: shorter keys are noise +# Common PEP 621 / packaging / generic config keys are ordinary English words that appear +# in claim prose constantly. Binding the config file every time one is mentioned is noise — +# and worse, it can pull a restricted config file (e.g. pyproject.toml) into the scope-linted +# read-set and newly refuse a previously-clean seal. So the config axis (like the symbol +# axis's _BACKTICK_STOPWORDS) skips these common keys; specific keys (max_workers, new_login) +# still bind. Found by adversarial review: a backticked `dependencies` made verify exit 6. +_CONFIG_KEY_STOPWORDS = frozenset( + { + "name", + "version", + "description", + "readme", + "license", + "authors", + "maintainers", + "keywords", + "classifiers", + "dependencies", + "scripts", + "urls", + "homepage", + "repository", + "documentation", + "changelog", + "requires", + "optional", + "project", + "build", + "tool", + "include", + "exclude", + "packages", + "source", + "target", + "default", + "type", + "format", + "title", + "summary", + "value", + "values", + "enabled", + "disabled", + "options", + "settings", + "config", + "email", + "data", + "files", + "module", + "modules", + "dependency", + "group", + "groups", + "extras", + } +) + def python_symbol_definers(repo: Path) -> dict[str, tuple[str, ...]]: """Symbol name -> the sorted, unique git-tracked `.py` files that define it @@ -221,6 +279,8 @@ def claim_config_watch_paths(repo: Path, claims: list[Claim]) -> dict[str, tuple for claim in claims: paths: set[str] = set() for token in claim_tokens.get(claim.id, ()): + if token.lower() in _CONFIG_KEY_STOPWORDS: + continue # common config word: prose, not a key to bind (over-binding/scope) files = index.get(token) if files is not None and len(files) == 1: paths.add(files[0]) @@ -252,7 +312,11 @@ def ambiguous_config_mentions( for claim in claims: if not claim.load_bearing or not isinstance(claim.text, str): continue - ambiguous = {tok: index[tok] for tok in _tokens(claim.text) if len(index.get(tok, ())) > 1} + ambiguous = { + tok: index[tok] + for tok in _tokens(claim.text) + if tok.lower() not in _CONFIG_KEY_STOPWORDS and len(index.get(tok, ())) > 1 + } if ambiguous: out[claim.id] = ambiguous return out diff --git a/tests/test_action_security_defaults.py b/tests/test_action_security_defaults.py index 2bee860..f369aff 100644 --- a/tests/test_action_security_defaults.py +++ b/tests/test_action_security_defaults.py @@ -61,3 +61,16 @@ def test_security_docs_state_public_fork_limitation() -> None: assert "--deny-exec" in doc assert "not a sandbox" in low assert "fork" in low # public-fork posture is addressed explicitly + + +def test_security_docs_reflect_trusted_base_as_implemented() -> None: + """Regression (adversarial review): trusted-base SHIPPED, so the docs users are routed + to must not still say it is unimplemented, and must name the actual surface.""" + sec = (REPO_ROOT / "SECURITY.md").read_text(encoding="utf-8") + action_readme = ACTION_README.read_text(encoding="utf-8") + for doc in (sec, action_readme): + low = doc.lower() + assert "not yet implemented" not in low + assert "not yet a full public-fork story" not in low + # the actual feature is named (Action input and/or CLI flag) + assert "checker_trust" in doc or "checker-source" in doc diff --git a/tests/test_config_binding.py b/tests/test_config_binding.py index b97e010..7831b88 100644 --- a/tests/test_config_binding.py +++ b/tests/test_config_binding.py @@ -144,6 +144,56 @@ def test_verify_binds_config_and_revalidate_rechecks(tmp_path: Path) -> None: assert cli.main(["--repo", str(repo), "revalidate", "--since", base]) == cli.EXIT_REVOKED +def test_common_config_key_does_not_bind(tmp_path: Path) -> None: + """A common PEP 621 / config word (dependencies, version, name, ...) is English prose, + not a specific key to bind — it must NOT auto-watch the config file (over-binding noise, + and it can pull a restricted config file into the scope-linted read-set).""" + repo = _repo(tmp_path) + write(repo, "pyproject.toml", '[project]\nname = "x"\nversion = "0"\ndependencies = []\n') + commit_all(repo, "pyproject") + claims = [_claim("c", "Updated the `dependencies` and `version`.", "path:pyproject.toml")] + assert symbol_index.claim_config_watch_paths(repo, claims) == {} + assert symbol_index.ambiguous_config_mentions(repo, claims) == {} + + +def test_specific_config_key_still_binds(tmp_path: Path) -> None: + repo = _repo(tmp_path) + write(repo, "settings.toml", "[server]\nmax_workers = 4\n") + commit_all(repo, "cfg") + claims = [_claim("c", "`max_workers` is 4.", "path:settings.toml")] + assert symbol_index.claim_config_watch_paths(repo, claims) == {"c": ("settings.toml",)} + + +def test_common_config_word_does_not_newly_refuse_a_clean_seal(tmp_path: Path) -> None: + """Backward-compat regression (found by adversarial review): on a repo whose pyproject is + under a restricted scope glob, a claim merely backticking `dependencies` must NOT newly + refuse `verify` with exit 6 — the checker names note.md, not pyproject.toml.""" + repo = _repo(tmp_path) + write( + repo, + "pyproject.toml", + '[project]\nname = "x"\nversion = "0"\ndependencies = []\n\n' + '[tool.dorian.scopes]\nrestricted = ["pyproject.toml"]\n', + ) + write(repo, "note.md", "# n\n\nUpdated the dependencies list.\n") + commit_all(repo, "restricted pyproject") + claims = { + "claims": [ + { + "id": "c", + "text": "Updated the `dependencies` list.", + "kind": "reference", + "load_bearing": False, + "checkers": [{"type": "C3", "program": "path:note.md"}], + } + ] + } + cp = repo / "claims.json" + cp.write_text(json.dumps(claims), encoding="utf-8") + rc = cli.main(["--repo", str(repo), "verify", "note.md", "--claims", str(cp)]) + assert rc == 0 # `dependencies` must not pull restricted pyproject.toml into the read-set + + def test_bind_suggest_shows_config_provenance(tmp_path: Path, capsys) -> None: repo = _repo(tmp_path) write(repo, "settings.toml", "[server]\nmax_workers = 4\n") diff --git a/tests/test_pystructural.py b/tests/test_pystructural.py index 76976e4..6fd94fe 100644 --- a/tests/test_pystructural.py +++ b/tests/test_pystructural.py @@ -232,6 +232,21 @@ def test_py_const_comment_and_docstring_survival_does_not_pass(tmp_path: Path) - assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.FAIL +def test_py_const_rejects_value_type_drift(tmp_path: Path) -> None: + """The 'strong value verifier' must not let a value's TYPE drift past on Python ==: + 30 != 30.0, 1 != True, 0 != False. Otherwise a bool flag silently becoming an int, or + an int timeout becoming a float, would re-verify green.""" + _w(tmp_path, "c.py", "TIMEOUT = 30.0\nFLAG = 1\nZERO = 0\n") + assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.FAIL # int vs float + assert _run(tmp_path, "py-const:c.py::FLAG::True").verdict is Verdict.FAIL # int vs bool + assert _run(tmp_path, "py-const:c.py::ZERO::False").verdict is Verdict.FAIL # int vs bool + # same value AND type still passes + _w(tmp_path, "c.py", "TIMEOUT = 30\nRATE = 0.5\nFLAG = True\n") + assert _run(tmp_path, "py-const:c.py::TIMEOUT::30").verdict is Verdict.PASS + assert _run(tmp_path, "py-const:c.py::RATE::0.5").verdict is Verdict.PASS + assert _run(tmp_path, "py-const:c.py::FLAG::True").verdict is Verdict.PASS + + # --- end-to-end: the new forms bind, seal born-verifiable, and re-check --------- @@ -288,3 +303,47 @@ def test_structural_forms_verify_seal_and_revalidate(fixture_repo: Path) -> None apply_three_change_commit(fixture_repo) rc = cli.main(["--repo", str(fixture_repo), "revalidate", "--since", base]) assert rc == cli.EXIT_REVOKED # a load-bearing claim broke -> exit 4 + + +def test_new_form_error_folds_to_errored_not_broken(fixture_repo: Path) -> None: + """A py-const claim whose RHS becomes NON-LITERAL on a later edit ERRORs (the value + cannot be determined) — it must fold to ERRORED (exit 5), never BROKEN. Pins the + ERROR-never-BROKEN invariant end-to-end for the new C3 forms.""" + import json + + from conftest import commit_all, git, write + from dorian import cli + from dorian.revalidate import revalidate + + claims = { + "claims": [ + { + "id": "timeout", + "text": "The default request timeout is 30 seconds.", + "kind": "quantity", + "load_bearing": True, + "checkers": [{"type": "C3", "program": "py-const:src/config.py::TIMEOUT::30"}], + } + ] + } + (fixture_repo / "claims.json").write_text(json.dumps(claims), encoding="utf-8") + base = git(fixture_repo, "rev-parse", "HEAD") + assert ( + cli.main( + [ + "--repo", + str(fixture_repo), + "verify", + "docs/design.md", + "--claims", + str(fixture_repo / "claims.json"), + ] + ) + == 0 + ) + write(fixture_repo, "src/config.py", "TIMEOUT = compute_timeout()\nRETRIES = 3\n") + commit_all(fixture_repo, "timeout becomes a non-literal") + res = revalidate(fixture_repo, since=base) + assert {cid for _, cid, _ in res.errored} == {"timeout"} + assert res.broken == [] # ERROR is never BROKEN + assert res.exit_code == cli.EXIT_ERRORED diff --git a/tests/test_semantic_context.py b/tests/test_semantic_context.py index d045988..80949ad 100644 --- a/tests/test_semantic_context.py +++ b/tests/test_semantic_context.py @@ -104,3 +104,21 @@ def test_code_bad_regex_is_error(tmp_path: Path) -> None: def test_code_path_escape_is_error(tmp_path: Path) -> None: assert _run(tmp_path, "code:../../etc/passwd::root").verdict is Verdict.ERROR + + +def test_code_ignores_fstring_docstring_survival(tmp_path: Path) -> None: + """A fact surviving only in an f-string used as a (dead) doc statement must FAIL — + docstring detection must not be fooled by the f-string (ast.JoinedStr) form.""" + _w( + tmp_path, + "m.py", + 'def handler():\n f"""serves /v1/login historically {1}."""\n return 200\n', + ) + assert _run(tmp_path, "code:m.py::/v1/login").verdict is Verdict.FAIL + + +def test_code_keeps_real_string_co_located_on_docstring_line(tmp_path: Path) -> None: + """A genuine string literal sharing a physical line with a docstring must be KEPT — + docstring blanking is by AST node span, not by whole line.""" + _w(tmp_path, "m.py", 'class C:\n """DOC"""; ROUTE = "/v1/keepme"\n') + assert _run(tmp_path, "code:m.py::/v1/keepme").verdict is Verdict.PASS From b7376e7762571e7c802c220aa50c241d2dae7e39 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 18:37:16 +0530 Subject: [PATCH 08/13] release: bump to 1.0.0rc1 (V1 release candidate) All V1 strengthening work packages (WP1-WP9) are implemented, tested, and documented; the 5-lens adversarial review's BLOCK findings are all resolved with regression tests; 733 tests pass (incl. slow); lint clean. Bump the three version surfaces (pyproject / __init__ / uv.lock) to the V1 release candidate. No tag, push, or publish. rc1 (not final 1.0.0) is honest: the candidate invites real-repo benchmark validation and the explicitly-deferred post-V1 items (declarative-structural checkers, route/SQL binding indices, YAML config binding, audit-event atomicity) documented in docs/V1_SCOPE.md. Co-Authored-By: Claude Opus 4.8 (1M context) --- pyproject.toml | 2 +- src/dorian/__init__.py | 2 +- uv.lock | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index 6806410..a762cc4 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "dorian-vwp" -version = "0.11.0" +version = "1.0.0rc1" description = "Hold AI agents to what they said they did: deterministic, token-free verification of claims about a change." readme = "README.md" requires-python = ">=3.11" diff --git a/src/dorian/__init__.py b/src/dorian/__init__.py index 4dcd06e..df260c7 100644 --- a/src/dorian/__init__.py +++ b/src/dorian/__init__.py @@ -3,4 +3,4 @@ PyPI distribution: `dorian-vwp`; import package: `dorian`; CLI: `dorian`. """ -__version__ = "0.11.0" +__version__ = "1.0.0rc1" diff --git a/uv.lock b/uv.lock index 17f1af9..5803a73 100644 --- a/uv.lock +++ b/uv.lock @@ -184,7 +184,7 @@ wheels = [ [[package]] name = "dorian-vwp" -version = "0.11.0" +version = "1.0.0rc1" source = { editable = "." } [package.optional-dependencies] From 47106042305a64458ec1cd3d02bddf43765fab29 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 18:40:25 +0530 Subject: [PATCH 09/13] docs: re-stamp BENCHMARK_CURRENT at 1.0.0rc1 (numbers re-confirmed post-fix) Re-ran large-mutation / binding-lifecycle / realworld at commit b7376e7 (1.0.0rc1), after the adversarial-review fixes: figures identical (large-mutation P=R=0.93, 11.6x/10.4x; binding-lifecycle 808 pairs 0.54->1.00 selection, 1.00 alarm; realworld 2/1/2), confirming the fixes don't touch the benchmarked paths. Version/commit stamps updated; the version-stamp evidence test now reads the live pyproject version. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/BENCHMARK_CURRENT.md | 9 +++++++-- tests/test_benchmark_evidence.py | 5 ++++- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/BENCHMARK_CURRENT.md b/docs/BENCHMARK_CURRENT.md index daa5f5d..48cf898 100644 --- a/docs/BENCHMARK_CURRENT.md +++ b/docs/BENCHMARK_CURRENT.md @@ -10,12 +10,17 @@ and are kept as-is for provenance. | field | value | | --- | --- | -| dorian version | `0.11.0` (V1 candidate) | -| measured commit | `2a66a49eee7b8aa069d7fb9222572b272493856d` | +| dorian version | `1.0.0rc1` (V1 release candidate) | +| measured commit | `b7376e7762571e7c802c220aa50c241d2dae7e39` | | Python | 3.12.4 | | platform | darwin (CI matrix: 3.11 / 3.12 / 3.13) | | reproduce | `dorian bench large-mutation` · `dorian bench binding-lifecycle` · `dorian bench realworld-usecases` | +These numbers were re-run at the `1.0.0rc1` commit *after* the adversarial-review fixes +landed, confirming those fixes (py-const type check, `code:` docstring handling, config-key +stopwords) did not move the benchmark figures — expected, since the suites exercise C1/C3 +(symbol/regex/string/path)/C5, not the new structural/config-binding paths. + ## Results ### Large controlled-mutation (240 pairs, 6 synthetic domains) diff --git a/tests/test_benchmark_evidence.py b/tests/test_benchmark_evidence.py index af4eee6..640e955 100644 --- a/tests/test_benchmark_evidence.py +++ b/tests/test_benchmark_evidence.py @@ -34,8 +34,11 @@ def test_historical_benchmark_docs_are_labeled_historical() -> None: def test_current_benchmark_doc_is_version_and_commit_stamped() -> None: + import tomllib + doc = _read("docs/BENCHMARK_CURRENT.md") - assert "0.11.0" in doc # dorian version stamp + version = tomllib.loads(_read("pyproject.toml"))["project"]["version"] + assert version in doc, f"current benchmark doc must stamp the live version {version!r}" assert "measured commit" in doc.lower() assert "Python" in doc # environment summary assert "reproduce" in doc.lower() From 334910ceaa81f3779523902fa1d2f679c7e9cdd4 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 18:50:21 +0530 Subject: [PATCH 10/13] docs: V1 alignment report + tracker (WP10 release decision) Final evidence-backed report: version gate, per-WP status, commands+results, verification evidence, trigger-vs-truth preservation, security posture, benchmark posture, remaining risks/non-goals, and the release decision (1.0.0rc1 candidate; no tag/push/publish). Co-Authored-By: Claude Opus 4.8 (1M context) --- V1_ALIGNMENT_REPORT.md | 141 +++++++++++++++++++++++++++++++++++ V1_IMPLEMENTATION_TRACKER.md | 10 ++- 2 files changed, 149 insertions(+), 2 deletions(-) create mode 100644 V1_ALIGNMENT_REPORT.md diff --git a/V1_ALIGNMENT_REPORT.md b/V1_ALIGNMENT_REPORT.md new file mode 100644 index 0000000..fa4da7b --- /dev/null +++ b/V1_ALIGNMENT_REPORT.md @@ -0,0 +1,141 @@ +# V1 alignment report + +Final report for the v0.11.0 → V1 strengthening program driven by +`RESEARCH_REPORT_DORIAN_0_11_0.md`. Every completion claim below is backed by a file +path and a command/test result. Behavior was verified against the current code; where +the report and code disagreed, code won (recorded in `V1_IMPLEMENTATION_TRACKER.md`). + +## 1. Version gate result + +| surface | start | final | +|---|---|---| +| `pyproject.toml` `[project].version` | `0.11.0` | `1.0.0rc1` | +| `src/dorian/__init__.py` `__version__` | `0.11.0` | `1.0.0rc1` | +| `dorian --version` | `dorian 0.11.0` | `dorian 1.0.0rc1` | +| branch | `main` @ `78dcd1a` | `dorian-v1-strengthening` @ `4710604` | + +Version gate **PASSED** at start (both surfaces `0.11.0`). No tag, push, publish, or +remote change performed. + +## 2. Executive result + +**V1 release candidate ready** (`1.0.0rc1`). All ten work packages are implemented (or +explicitly deferred with reasons), tested, and documented; a 5-lens adversarial review +returned BLOCK and all six must-fix findings are resolved with regression tests; the full +733-test suite and lint pass at the release commit. + +## 3. Work completed + +| WP | Status | Files | Tests | Caveat | +|---|---|---|---|---| +| WP1 docs/evidence hygiene | complete | README, docs/V1_SCOPE.md, BENCHMARK_v0.7.0/BINDING_LIFECYCLE banners, BENCHMARK_CURRENT.md | test_benchmark_evidence (5) | trust-state legend + historical labels | +| WP2 checker-strength / claim-risk | complete | src/dorian/strength.py, commands.py (bindings + binding-gate) | test_strength (20) | advisory only; never changes verdict/exit | +| WP3 Python structural checkers | complete | src/dorian/pyast.py, checkers/c3_ref.py, seal.py, spec/checkers.md | test_pystructural (29) | gutted-body is the documented ceiling | +| WP4 semantic-context `code:` | complete | pyast.code_only_python, c3_ref.py | test_semantic_context (14) | Python-only (documented) | +| WP5 multi-index binding (config-key) | complete | symbol_index.py (config_key_index, claim_watch_paths), commands.py | test_config_binding (12) | TOML/JSON only; YAML excluded (zero-dep) | +| WP6 C4 test-adequacy lint | complete | strength.c4_adequacy | (in test_strength) | advisory; conservative on helpers | +| WP7 trusted-base checker-source | complete | revalidate.py, cli.py, commands.py, action/action.yml | test_trusted_base (10) | trust root, NOT a sandbox | +| WP8 warrant-quality harness | complete | bench/warrant_quality.py, commands.py | test_warrant_quality (7) | structural/existence forms scored; others strength-only | +| WP9 current-version benchmarks | complete | docs/BENCHMARK_CURRENT.md | docs wording tests | synthetic-suite reproducibility only | +| WP10 release prep | complete | pyproject/__init__/uv.lock → 1.0.0rc1 | test_version_sync (3) | rc, not final 1.0.0 | + +**Deferred (classified in `docs/V1_SCOPE.md`, not V1 blockers):** declarative-structural +checkers (config/OpenAPI/SQL value/type — the report's C7-style family), route/SQL binding +indices, YAML config binding (needs a runtime dep), the real-repo public micro-benchmark +(protocol exists; results post-V1), and audit-event/state single-transaction atomicity +(pre-existing, documented in `fold.py`). + +## 4. Commands run (final state, commit `4710604`) + +| command | result | +|---|---| +| `uv run dorian --version` | `dorian 1.0.0rc1` | +| `uv run ruff check src tests bench` | `All checks passed!` | +| `uv run ruff format --check src tests bench` | `108 files already formatted` | +| `uv run pytest -m "not slow"` | exit 0 — **658 passed** | +| `uv run pytest -m slow` | exit 0 — slow suite passed (wheel build, real pytest subprocess, regex-timeout) | +| `uv run pytest` (full, incl slow) | exit 0 — **733 collected** (baseline 636 → +97) | +| `dorian bench large-mutation` | 240 pairs, P=R=0.93, 11.6×/10.4× FP reduction | +| `dorian bench binding-lifecycle` | 808 pairs, selection recall 0.54→1.00, alarm precision/recall 1.00, 0 errored | +| `dorian bench realworld-usecases` | 5 cases: 2 solved / 1 partial / 2 not_solved | +| `mcp gitnexus detect_changes` (pre-commit) | changed symbols == intended; no surprise blast radius | + +## 5. Verification evidence + +- **Test suite:** 733 tests pass at `4710604` (lint + non-slow + slow all exit 0). +97 over + the 636-test `78dcd1a` baseline, across 6 new test files + (test_pystructural, test_semantic_context, test_strength, test_trusted_base, + test_config_binding, test_warrant_quality, test_benchmark_evidence). +- **CLI smoke:** `dorian bindings ` shows strength/risk (JSON + human golden tests); + `dorian bench warrant-quality --json` emits `dorian-warrant-quality-v1`; + `dorian revalidate --checker-source base` and env `DORIAN_CHECKER_SOURCE` both exercised. +- **Security fixtures:** `tests/test_trusted_base.py` (10) proves each "executed?" case with a + sentinel `touch` that must NOT appear under base mode — PR-added and PR-modified executable + checkers never run; missing/tampered base sidecar fails closed (ERRORED); deny-exec composes. +- **Benchmarks:** re-run at `1.0.0rc1`; figures identical to the historical runs (large-mutation + vs v0.7.0; binding-lifecycle same content-derived run_id as 0.9.0) — additive, no regression. +- **Docs wording:** historical docs carry version stamps/banners; `BENCHMARK_CURRENT.md` is + version+commit stamped with a what-it-does-NOT-prove block; guard tests pin all of it. + +## 6. Trigger-vs-truth preservation + +The distinction is preserved and made **more visible**, never blurred: + +- **Binding (trigger) stays trigger-only.** Config-key binding (WP5) and symbol binding only + widen the re-check set; `docs/VALIDATION_HONESTY.md`, `docs/V1_SCOPE.md`, and the + binding-lifecycle benchmark all state a watched-file change never makes a claim BROKEN by itself. +- **New truth-axis surfacing.** WP2 checker-strength classifies each checker's falsifying power + and flags kind-vs-strength **adequacy mismatches** (a `behavior` claim backed only by an + existence/text checker; a vacuous pytest node). WP8 warrant-quality scores per-claim + caught/missed/brittle/**ceiling** offline. +- **The ceiling is pinned, not hidden.** `py-signature:`/`symbol:` on a gutted-body change PASS + (a `test_..._gutted_body_still_passes_documented_ceiling` test asserts it); only a C4 test + catches a body change. ERROR is never BROKEN — a new end-to-end test drives a new-form ERROR + and asserts it lands in `errored` (exit 5), never `broken`. + +## 7. Security posture + +- **Trusted/internal (`head`, default):** unchanged from v0.11.0 — executes the checked-out + checker specs. Correct where everyone who can open a PR is trusted to run code in CI. +- **Public/fork (`checker_trust: base`):** **implemented and tested** (WP7). Resolves each + claim's checker spec from the base ref, so PR-added/modified executable checkers never run and + a rewritten checker cannot self-attest a verdict; fails closed on a missing/tampered base + sidecar. The `SECURITY.md` / `action/README.md` contradictions (which still said it was + unimplemented) were fixed and a guard test prevents recurrence. +- **Remaining non-sandbox caveat (stated everywhere):** a base-approved `pytest:` checker can + still execute PR-head code. base mode is a **checker-source trust root, not a sandbox** — for + fully untrusted forks combine `checker_trust: base` with `deny_exec: true` (or external + isolation). `--deny-exec`/`--deny-shell` remain fail-closed, not sandboxes. + +## 8. Benchmark / evidence posture + +- **Current results:** `docs/BENCHMARK_CURRENT.md` — version+commit stamped (1.0.0rc1 / `b7376e7`), + reproduction commands, environment, and an explicit non-overclaim block. +- **Historical docs labeled:** `BENCHMARK_v0.7.0.md` (version-stamped title; it is byte-matched to + its generator so it cannot carry a hand banner) and `BENCHMARK_BINDING_LIFECYCLE.md` (0.9.0, + HISTORICAL banner). Both preserved verbatim and cross-referenced from the current doc. +- **What the benchmarks support:** reproducibility on the named synthetic suites at the stamped + version, fewer false re-checks than file watchers, near-complete binding trigger recall with + zero false BROKEN — and that V1's additions did not regress any of it. **Not supported:** + "works on real repos", "validated", or that binding proves behavior (the gutted-body ceiling). + +## 9. Remaining risks and non-goals (after implementation) + +- **No real-repo validation yet** — evidence is synthetic-suite reproducibility plus offline + public-case reproductions; the public frozen-SHA micro-benchmark is protocol-only (post-V1). +- **`code:`/structural forms are Python-only**; other languages keep the raw-text survival class. +- **Config binding is TOML/JSON only** (YAML needs a runtime dep); unparseable supported config + files are surfaced, not silently skipped. +- **Audit-event/state atomicity** — change + event still commit separately (`fold.py`); a crash + between them can drop the event. Pre-existing, documented. +- **`--extract` stays draft/experimental** — not promoted in V1. + +## 10. Release decision + +**V1 release candidate prepared.** All quality gates passed, so version surfaces were synced to +`1.0.0rc1` (pyproject / `__init__` / uv.lock; `dorian --version` agrees; `test_version_sync` +green). It is a **release candidate, not final 1.0.0** — honest given the deferred post-V1 items +above and the absence of real-repo validation. **No tag, push, publish, or remote/secret change +was performed**, per the operating rules; the work lives on branch `dorian-v1-strengthening` +(9 commits off `main`). Suggested next steps (owner's call): open a PR to `main`, then run the +real-repo public micro-benchmark before promoting `1.0.0rc1 → 1.0.0`. diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md index 24bcfb7..71dff13 100644 --- a/V1_IMPLEMENTATION_TRACKER.md +++ b/V1_IMPLEMENTATION_TRACKER.md @@ -108,6 +108,12 @@ Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-documen | WP7 | trusted-base checker-source mode | DONE (revalidate --checker-source base + Action checker_trust; 10-case exploit matrix) | | WP8 | warrant-quality mutation harness | DONE (bench/warrant_quality.py; `dorian bench warrant-quality`; deterministic, offline, never mutates real repo; trigger vs verdict; ERROR bucket distinct; honest scope = structural/existence forms scored, others reported strength-only; 7 tests) | | WP9 | current-version benchmark results | TODO | -| WP10 | V1 release prep / decision | TODO | +| WP10 | V1 release prep / decision | DONE — version surfaces synced to `1.0.0rc1` (pyproject/__init__/uv.lock); no tag/push/publish. All gates pass; adversarial-review BLOCK resolved. | -Commits so far: `58b39e2` (WP3/4/2/6), trusted-base (WP7) next. +Branch `dorian-v1-strengthening`, 9 commits off `main`: +`58b39e2` WP3/4/2/6 · `6a8298c` WP7 · `04ab60b` WP5 · `2a66a49` WP8 · `4e586a7` WP9/WP1 · +`2a4befa` byte-match fix · `a6595ba` adversarial-review BLOCK fixes · `b7376e7` bump 1.0.0rc1 · +`4710604` benchmark re-stamp. + +Adversarial review (5 lenses, BLOCK): 6 must-fixes + 2 hygiene items all resolved with +regression tests. Final gate: ruff clean, 658 non-slow pass, 733 total (incl slow) green. From 33e9eaf4ae929d2736f14c682e0a55cb04c1a37d Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 20:10:24 +0530 Subject: [PATCH 11/13] audit: reconcile V1 release evidence + fix release-blocking doc drift MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Independent release audit (FIXED_NEEDED) findings, all repaired: Blockers: - docs/START_HERE.md still called trusted-base "(not yet implemented)" — a user-facing CI entry-point doc the prior fix missed; now describes it as implemented (V1). - docs/BENCHMARK_BINDING_LIFECYCLE.md banner said the current rerun was "0.11.0" while the branch is 1.0.0rc1 (and BENCHMARK_CURRENT says 1.0.0rc1) — corrected the version. - internal program docs (V1_IMPLEMENTATION_TRACKER.md, V1_ALIGNMENT_REPORT.md) were tracked; gitignored + git rm --cached (kept on disk as provenance). Also gitignore the research report, audit gate, release notes, and tool dirs (.claude/, .gitnexus/). docs/V1_SCOPE.md stays tracked (it is a public doc). Should-fixes: - docs/ROADMAP_BACKLOG.md trusted-base item flipped DEFER/HUMAN-REVIEW -> SHIPPED (V1). - c3_ref.py module docstring now documents the code: form (was omitted) and the py-const value-AND-type rule. - action.yml / action/README.md drop the stale 'dorian-vwp==0.6.*' pin example (no PyPI release yet) for the git source spec. - docs/BENCHMARK_CURRENT.md labels the metric commit vs the (docs-only) release commit. Hardened tests: test_no_live_doc_calls_trusted_base_unimplemented scans ALL live docs (not just SECURITY.md/action README); warrant-quality path-escape test pins the containment guard; benchmark-evidence commit-stamp check is version-agnostic. 660 non-slow tests pass; lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- .gitignore | 11 ++ V1_ALIGNMENT_REPORT.md | 141 ------------------------- V1_IMPLEMENTATION_TRACKER.md | 119 --------------------- action/README.md | 2 +- action/action.yml | 5 +- docs/BENCHMARK_BINDING_LIFECYCLE.md | 2 +- docs/BENCHMARK_CURRENT.md | 11 +- docs/ROADMAP_BACKLOG.md | 10 +- docs/START_HERE.md | 5 +- src/dorian/checkers/c3_ref.py | 18 ++-- tests/test_action_security_defaults.py | 27 +++++ tests/test_benchmark_evidence.py | 4 +- tests/test_warrant_quality.py | 25 +++++ 13 files changed, 97 insertions(+), 283 deletions(-) delete mode 100644 V1_ALIGNMENT_REPORT.md delete mode 100644 V1_IMPLEMENTATION_TRACKER.md diff --git a/.gitignore b/.gitignore index eec3828..02c1d4f 100644 --- a/.gitignore +++ b/.gitignore @@ -15,3 +15,14 @@ bench/real/ .DS_Store /assets/ .env + +# tool working dirs (not release content) +.claude/ +.gitnexus/ + +# internal program/audit working docs — provenance only, never shipped in the release +/RESEARCH_REPORT_DORIAN_0_11_0.md +/V1_IMPLEMENTATION_TRACKER.md +/V1_ALIGNMENT_REPORT.md +/AUDIT_RELEASE_GATE.md +/GITHUB_RELEASE_NOTES.md diff --git a/V1_ALIGNMENT_REPORT.md b/V1_ALIGNMENT_REPORT.md deleted file mode 100644 index fa4da7b..0000000 --- a/V1_ALIGNMENT_REPORT.md +++ /dev/null @@ -1,141 +0,0 @@ -# V1 alignment report - -Final report for the v0.11.0 → V1 strengthening program driven by -`RESEARCH_REPORT_DORIAN_0_11_0.md`. Every completion claim below is backed by a file -path and a command/test result. Behavior was verified against the current code; where -the report and code disagreed, code won (recorded in `V1_IMPLEMENTATION_TRACKER.md`). - -## 1. Version gate result - -| surface | start | final | -|---|---|---| -| `pyproject.toml` `[project].version` | `0.11.0` | `1.0.0rc1` | -| `src/dorian/__init__.py` `__version__` | `0.11.0` | `1.0.0rc1` | -| `dorian --version` | `dorian 0.11.0` | `dorian 1.0.0rc1` | -| branch | `main` @ `78dcd1a` | `dorian-v1-strengthening` @ `4710604` | - -Version gate **PASSED** at start (both surfaces `0.11.0`). No tag, push, publish, or -remote change performed. - -## 2. Executive result - -**V1 release candidate ready** (`1.0.0rc1`). All ten work packages are implemented (or -explicitly deferred with reasons), tested, and documented; a 5-lens adversarial review -returned BLOCK and all six must-fix findings are resolved with regression tests; the full -733-test suite and lint pass at the release commit. - -## 3. Work completed - -| WP | Status | Files | Tests | Caveat | -|---|---|---|---|---| -| WP1 docs/evidence hygiene | complete | README, docs/V1_SCOPE.md, BENCHMARK_v0.7.0/BINDING_LIFECYCLE banners, BENCHMARK_CURRENT.md | test_benchmark_evidence (5) | trust-state legend + historical labels | -| WP2 checker-strength / claim-risk | complete | src/dorian/strength.py, commands.py (bindings + binding-gate) | test_strength (20) | advisory only; never changes verdict/exit | -| WP3 Python structural checkers | complete | src/dorian/pyast.py, checkers/c3_ref.py, seal.py, spec/checkers.md | test_pystructural (29) | gutted-body is the documented ceiling | -| WP4 semantic-context `code:` | complete | pyast.code_only_python, c3_ref.py | test_semantic_context (14) | Python-only (documented) | -| WP5 multi-index binding (config-key) | complete | symbol_index.py (config_key_index, claim_watch_paths), commands.py | test_config_binding (12) | TOML/JSON only; YAML excluded (zero-dep) | -| WP6 C4 test-adequacy lint | complete | strength.c4_adequacy | (in test_strength) | advisory; conservative on helpers | -| WP7 trusted-base checker-source | complete | revalidate.py, cli.py, commands.py, action/action.yml | test_trusted_base (10) | trust root, NOT a sandbox | -| WP8 warrant-quality harness | complete | bench/warrant_quality.py, commands.py | test_warrant_quality (7) | structural/existence forms scored; others strength-only | -| WP9 current-version benchmarks | complete | docs/BENCHMARK_CURRENT.md | docs wording tests | synthetic-suite reproducibility only | -| WP10 release prep | complete | pyproject/__init__/uv.lock → 1.0.0rc1 | test_version_sync (3) | rc, not final 1.0.0 | - -**Deferred (classified in `docs/V1_SCOPE.md`, not V1 blockers):** declarative-structural -checkers (config/OpenAPI/SQL value/type — the report's C7-style family), route/SQL binding -indices, YAML config binding (needs a runtime dep), the real-repo public micro-benchmark -(protocol exists; results post-V1), and audit-event/state single-transaction atomicity -(pre-existing, documented in `fold.py`). - -## 4. Commands run (final state, commit `4710604`) - -| command | result | -|---|---| -| `uv run dorian --version` | `dorian 1.0.0rc1` | -| `uv run ruff check src tests bench` | `All checks passed!` | -| `uv run ruff format --check src tests bench` | `108 files already formatted` | -| `uv run pytest -m "not slow"` | exit 0 — **658 passed** | -| `uv run pytest -m slow` | exit 0 — slow suite passed (wheel build, real pytest subprocess, regex-timeout) | -| `uv run pytest` (full, incl slow) | exit 0 — **733 collected** (baseline 636 → +97) | -| `dorian bench large-mutation` | 240 pairs, P=R=0.93, 11.6×/10.4× FP reduction | -| `dorian bench binding-lifecycle` | 808 pairs, selection recall 0.54→1.00, alarm precision/recall 1.00, 0 errored | -| `dorian bench realworld-usecases` | 5 cases: 2 solved / 1 partial / 2 not_solved | -| `mcp gitnexus detect_changes` (pre-commit) | changed symbols == intended; no surprise blast radius | - -## 5. Verification evidence - -- **Test suite:** 733 tests pass at `4710604` (lint + non-slow + slow all exit 0). +97 over - the 636-test `78dcd1a` baseline, across 6 new test files - (test_pystructural, test_semantic_context, test_strength, test_trusted_base, - test_config_binding, test_warrant_quality, test_benchmark_evidence). -- **CLI smoke:** `dorian bindings ` shows strength/risk (JSON + human golden tests); - `dorian bench warrant-quality --json` emits `dorian-warrant-quality-v1`; - `dorian revalidate --checker-source base` and env `DORIAN_CHECKER_SOURCE` both exercised. -- **Security fixtures:** `tests/test_trusted_base.py` (10) proves each "executed?" case with a - sentinel `touch` that must NOT appear under base mode — PR-added and PR-modified executable - checkers never run; missing/tampered base sidecar fails closed (ERRORED); deny-exec composes. -- **Benchmarks:** re-run at `1.0.0rc1`; figures identical to the historical runs (large-mutation - vs v0.7.0; binding-lifecycle same content-derived run_id as 0.9.0) — additive, no regression. -- **Docs wording:** historical docs carry version stamps/banners; `BENCHMARK_CURRENT.md` is - version+commit stamped with a what-it-does-NOT-prove block; guard tests pin all of it. - -## 6. Trigger-vs-truth preservation - -The distinction is preserved and made **more visible**, never blurred: - -- **Binding (trigger) stays trigger-only.** Config-key binding (WP5) and symbol binding only - widen the re-check set; `docs/VALIDATION_HONESTY.md`, `docs/V1_SCOPE.md`, and the - binding-lifecycle benchmark all state a watched-file change never makes a claim BROKEN by itself. -- **New truth-axis surfacing.** WP2 checker-strength classifies each checker's falsifying power - and flags kind-vs-strength **adequacy mismatches** (a `behavior` claim backed only by an - existence/text checker; a vacuous pytest node). WP8 warrant-quality scores per-claim - caught/missed/brittle/**ceiling** offline. -- **The ceiling is pinned, not hidden.** `py-signature:`/`symbol:` on a gutted-body change PASS - (a `test_..._gutted_body_still_passes_documented_ceiling` test asserts it); only a C4 test - catches a body change. ERROR is never BROKEN — a new end-to-end test drives a new-form ERROR - and asserts it lands in `errored` (exit 5), never `broken`. - -## 7. Security posture - -- **Trusted/internal (`head`, default):** unchanged from v0.11.0 — executes the checked-out - checker specs. Correct where everyone who can open a PR is trusted to run code in CI. -- **Public/fork (`checker_trust: base`):** **implemented and tested** (WP7). Resolves each - claim's checker spec from the base ref, so PR-added/modified executable checkers never run and - a rewritten checker cannot self-attest a verdict; fails closed on a missing/tampered base - sidecar. The `SECURITY.md` / `action/README.md` contradictions (which still said it was - unimplemented) were fixed and a guard test prevents recurrence. -- **Remaining non-sandbox caveat (stated everywhere):** a base-approved `pytest:` checker can - still execute PR-head code. base mode is a **checker-source trust root, not a sandbox** — for - fully untrusted forks combine `checker_trust: base` with `deny_exec: true` (or external - isolation). `--deny-exec`/`--deny-shell` remain fail-closed, not sandboxes. - -## 8. Benchmark / evidence posture - -- **Current results:** `docs/BENCHMARK_CURRENT.md` — version+commit stamped (1.0.0rc1 / `b7376e7`), - reproduction commands, environment, and an explicit non-overclaim block. -- **Historical docs labeled:** `BENCHMARK_v0.7.0.md` (version-stamped title; it is byte-matched to - its generator so it cannot carry a hand banner) and `BENCHMARK_BINDING_LIFECYCLE.md` (0.9.0, - HISTORICAL banner). Both preserved verbatim and cross-referenced from the current doc. -- **What the benchmarks support:** reproducibility on the named synthetic suites at the stamped - version, fewer false re-checks than file watchers, near-complete binding trigger recall with - zero false BROKEN — and that V1's additions did not regress any of it. **Not supported:** - "works on real repos", "validated", or that binding proves behavior (the gutted-body ceiling). - -## 9. Remaining risks and non-goals (after implementation) - -- **No real-repo validation yet** — evidence is synthetic-suite reproducibility plus offline - public-case reproductions; the public frozen-SHA micro-benchmark is protocol-only (post-V1). -- **`code:`/structural forms are Python-only**; other languages keep the raw-text survival class. -- **Config binding is TOML/JSON only** (YAML needs a runtime dep); unparseable supported config - files are surfaced, not silently skipped. -- **Audit-event/state atomicity** — change + event still commit separately (`fold.py`); a crash - between them can drop the event. Pre-existing, documented. -- **`--extract` stays draft/experimental** — not promoted in V1. - -## 10. Release decision - -**V1 release candidate prepared.** All quality gates passed, so version surfaces were synced to -`1.0.0rc1` (pyproject / `__init__` / uv.lock; `dorian --version` agrees; `test_version_sync` -green). It is a **release candidate, not final 1.0.0** — honest given the deferred post-V1 items -above and the absence of real-repo validation. **No tag, push, publish, or remote/secret change -was performed**, per the operating rules; the work lives on branch `dorian-v1-strengthening` -(9 commits off `main`). Suggested next steps (owner's call): open a PR to `main`, then run the -real-repo public micro-benchmark before promoting `1.0.0rc1 → 1.0.0`. diff --git a/V1_IMPLEMENTATION_TRACKER.md b/V1_IMPLEMENTATION_TRACKER.md deleted file mode 100644 index 71dff13..0000000 --- a/V1_IMPLEMENTATION_TRACKER.md +++ /dev/null @@ -1,119 +0,0 @@ -# V1 implementation tracker - -Working tracker for the v0.11.0 → V1 strengthening program driven by -`RESEARCH_REPORT_DORIAN_0_11_0.md`. Behavior is verified against the **current -code**, not the report; where they disagree, code wins and the disagreement is -recorded here. - -## Phase 0 — version gate + scope evidence - -**Version gate: PASSED.** - -| Surface | Observed | -|---|---| -| `pyproject.toml` `[project].version` | `0.11.0` | -| `src/dorian/__init__.py` `__version__` | `0.11.0` | -| branch | `main` | -| commit SHA (start) | `78dcd1a6a242110e55dc31fd1db2e811de3e3898` | -| working tree | clean except untracked `.claude/`, `AGENTS.md`, `CLAUDE.md`, `RESEARCH_REPORT_DORIAN_0_11_0.md` | -| Python | 3.12.4 | -| toolchain | `uv` 0.5.9; `uv run pytest`; ruff for lint/format | -| baseline tests | `uv run pytest -m "not slow"` → **561 passed, exit 0**; 636 total incl. slow | - -## Phase 1 — baseline reconstruction (from current code) - -### Module map -- `model.py` — `Warrant`/`Claim`/`CheckerSpec`/`ReadSetEntry`, content-addressed id, canonical JSON. `CheckerType = C1|C3|C4|C5` (a *Literal* hint; registry dispatch is on the string `type`). -- `checkers/base.py` — `run_checker` is the single dispatch + the single execution-policy gate (blocked → `Verdict.ERROR`). -- `checkers/c1_span.py` — span anchor, relocation-tolerant, optional c2lite. -- `checkers/c3_ref.py` — `path:` / `symbol:` / `string:` / `regex:`; regex match in a spawn-killed worker (ReDoS backstop). -- `checkers/c4_test.py` — `pytest:`, careful exit-code mapping; ERROR≠FAIL. -- `checkers/c5_data.py` — typed data forms + opaque `shell:`. -- `policy.py` — `ExecutionPolicy`, `executable_kind` (single source of "what executes": C4=pytest, C5 shell=shell). -- `seal.py` — born-verifiable seal; scope lint; watch derivation; additive symbol-definer widening; duplicate-id reject; atomic write; idempotent re-seal. -- `revalidate.py` — changed-path discovery, rename persistence, cheapest-first checks (C1 REVOKED` is NOT drift.** Report (medium-confidence) called it stale. Verified: `fold.fold()` only emits TRUSTED/DEGRADED/REVOKED/UNKNOWN; the *born* trust state is `WARRANTED` (set at seal); the first fold therefore renders `WARRANTED -> `. `tests/test_render_md.py:168-169` pins `WARRANTED -> REVOKED` and `WARRANTED -> UNKNOWN` as correct md output. Action: **do not "fix"; add a short trust-state vocabulary note to remove reader confusion.** -- **C4 adequacy blind spot** — report marks INFERENCE; confirmed: `c4_test.py` maps pytest exit codes only, no assertion/relevance inspection. Valid advisory target (WP6). -- **PyPI install wording** — report marks UNVERIFIED. Per project state, dorian is NOT on PyPI; README "until the first PyPI release … install from source" is accurate. Keep. - -## Report coverage matrix (every material finding classified) - -Categories: IMPL=must-implement · TEST=must-test regression · DOC=must-document · BENCH=must-benchmark · BOUNDARY=honest non-goal · DONE=already in v0.11.0 · DEFER=post-V1/blocked. - -| # | Report finding / recommendation | Category | Current evidence | Planned action | Acceptance/verification | Status | -|---|---|---|---|---|---|---| -| 1 | README trust-state vocab (WARRANTED vs TRUSTED/…) | DOC | code correct; README lacks a glossary | add trust-state legend; keep examples | docs test + render_md tests stay green | TODO | -| 2 | ERROR must never collapse into BROKEN | DONE+TEST | base/fold/revalidate all enforce | keep; add a guard test if any new path | existing + new ERROR≠BROKEN tests | TODO | -| 3 | C1 span + c2lite regression | DONE | test_c1.py | none (keep green) | test_c1 passes | DONE | -| 4 | C3 regex ReDoS timeout regression | DONE | test_c3_regex_timeout.py (slow) | none | passes | DONE | -| 5 | C3 symbol existence ceiling / gutted-body | IMPL+DOC | symbol: existence-only | add `py-signature:` structural checker (WP3) | gutted-body PASS under symbol, FAIL under signature when sig changes; body-only stays PASS (documented ceiling) | TODO | -| 6 | C3 string/regex comment/docstring survival | IMPL+DOC | raw text search | add semantic code-context search mode (WP4) | literal only in comment/docstring → FAIL in code mode | TODO | -| 7 | C4 pytest vacuous/zero-assertion adequacy | IMPL | none | advisory adequacy lint (WP6) | zero-assertion / assert-True node warns; normal test does not | TODO | -| 8 | C5 typed grammar limits / snapshot brittleness | BOUNDARY+DOC | documented | document in V1-meaning; optional structural data checker DEFER | doc states grammar bounds | TODO | -| 9 | duplicate claim-id rejection | DONE | seal.py step 0 | keep | test_seal covers | DONE | -| 10 | scope-lint named-read-set-only limitation | DONE+DOC | SECURITY_BOUNDARY | keep wording | docs test | DONE | -| 11 | deny-exec/deny-shell fail-closed, not sandbox | DONE | policy.py, docs | keep | test_deny_exec_policy | DONE | -| 12 | sidecar source-of-truth vs SQLite derived | DONE | seal/revalidate/sync | keep | test_store/sync | DONE | -| 13 | canonical JSON / content-addressed identity | DONE | model.compute_id + Warrant.load integrity | keep | test_model/determinism | DONE | -| 14 | atomic no-write on failed seal | DONE | seal os.replace + refusal order | keep | test_seal/deny_exec | DONE | -| 15 | changed-path discovery + persisted rename | DONE | revalidate + store rename_log | keep | test_revalidate | DONE | -| 16 | checker ordering + FAIL vs ERROR discipline | DONE | revalidate _check_claim | keep | existing | DONE | -| 17 | fold + blast/recall lineage | DONE | fold.py, blast.py | keep | test_fold/test_blast | DONE | -| 18 | audit/state separate-transaction limitation | BOUNDARY | fold.py docstring documents it | document in V1-meaning as known limitation | doc names it | TODO | -| 19 | binding ambiguity handling | DONE | symbol_index ambiguous_symbol_mentions + flag | keep; extend provenance (WP5) | test_symbol_index | DONE | -| 20 | oversized/unparseable file diagnostics | IMPL | silently skipped today | surface multi-index unparse diagnostics (WP5) loudly | giant/unparseable supported file → diagnostic not silent | TODO | -| 21 | pyproject script binding | DONE | pyproject_script_definers | keep | test_symbol_index | DONE | -| 22 | watch glob over/under-match risk | TEST | _covered glob logic | add a glob over/under test if WP5 touches it | test | TODO | -| 23 | public/fork self-attested verdict risk | IMPL+DOC | head-mode only | trusted-base checker-source (WP7) | exploit fixtures: PR-added/modified exec checker not run; non-exec rewrite surfaced | TODO | -| 24 | trusted-base design + non-sandbox caveat | IMPL+DOC | design-only | implement `--checker-source base` + Action input; keep non-sandbox caveat | WP7 test matrix | TODO | -| 25 | historical benchmark docs (v0.7.0, v0.9.0) | DOC | unlabeled as historical in body | add HISTORICAL banner; README cross-link labels | docs wording test | TODO | -| 26 | public benchmark protocol w/o results | DOC | protocol only | keep; note in current-results doc | unchanged | TODO | -| 27 | current-version benchmark rerun | BENCH | none | rerun + version-stamped `BENCHMARK_CURRENT.md` | bench smoke + stamp present | TODO | -| 28 | extractor remains draft/experimental | DONE | README + AGENT_CLAIMS | keep; do not promote | docs test | DONE | -| 29 | release/install-status uncertainty | DOC | README source-install accurate | keep; V1 release report states status | report | TODO | -| 30 | checker-strength / claim-risk visibility | IMPL | bindings flags exist but no strength score | strength + claim-risk diagnostics (WP2) | behavior+symbol → adequacy-mismatch; unbacked load-bearing → high risk | TODO | -| 31 | multi-index binding (routes/config/etc.) | IMPL | python+script only | config-key index (WP5), provenance-tagged | config-key change selects claim; ambiguous skipped+warned | TODO | -| 32 | warrant-quality mutation harness | BENCH | repo-level bench only | `dorian bench warrant-quality` (WP8) | deterministic per-claim trigger/truth score on fixture | TODO | - -## Work-package status (live) - -| WP | Title | Status | -|---|---|---| -| WP1 | docs/evidence hygiene | DONE (trust-state legend; historical banners on v0.7.0/0.9.0 benchmark docs; docs/V1_SCOPE.md; README command-surface + new-forms + historical labels; benchmark-evidence wording tests) | -| WP2 | checker-strength / claim-risk linter | DONE (strength.py; surfaced in `bindings` + binding-gate warn; 19 tests) | -| WP3 | Python structural checkers (py-signature, py-const) | DONE (pyast.py + C3 subgrammars; 27 tests incl. e2e) | -| WP4 | semantic-context source search (`code:`) | DONE (pyast.code_only_python + C3 `code:`; 12 tests) | -| WP5 | multi-index binding (config-key) | DONE (symbol_index.config_key_index + claim_watch_paths; TOML/JSON only, YAML excluded = zero-dep; provenance in bind-suggest; ambiguity + unparseable surfaced; 9 tests) | -| WP6 | C4 test-adequacy lint | DONE (strength.c4_adequacy; folded into WP2 tests) | -| WP7 | trusted-base checker-source mode | DONE (revalidate --checker-source base + Action checker_trust; 10-case exploit matrix) | -| WP8 | warrant-quality mutation harness | DONE (bench/warrant_quality.py; `dorian bench warrant-quality`; deterministic, offline, never mutates real repo; trigger vs verdict; ERROR bucket distinct; honest scope = structural/existence forms scored, others reported strength-only; 7 tests) | -| WP9 | current-version benchmark results | TODO | -| WP10 | V1 release prep / decision | DONE — version surfaces synced to `1.0.0rc1` (pyproject/__init__/uv.lock); no tag/push/publish. All gates pass; adversarial-review BLOCK resolved. | - -Branch `dorian-v1-strengthening`, 9 commits off `main`: -`58b39e2` WP3/4/2/6 · `6a8298c` WP7 · `04ab60b` WP5 · `2a66a49` WP8 · `4e586a7` WP9/WP1 · -`2a4befa` byte-match fix · `a6595ba` adversarial-review BLOCK fixes · `b7376e7` bump 1.0.0rc1 · -`4710604` benchmark re-stamp. - -Adversarial review (5 lenses, BLOCK): 6 must-fixes + 2 hygiene items all resolved with -regression tests. Final gate: ruff clean, 658 non-slow pass, 733 total (incl slow) green. diff --git a/action/README.md b/action/README.md index 558c19e..51b2254 100644 --- a/action/README.md +++ b/action/README.md @@ -126,7 +126,7 @@ Hard rules either way: | --------------- | -------------------------------------------- | ------------------------------------------------------------------------ | | `fail_on` | `revoked` | when to fail the step: `revoked` (exit 4 only), `degraded` (3 or 4), `never` | | `base` | `${{ github.event.pull_request.base.sha }}` | git ref passed to `dorian revalidate --since` | -| `install` | `dorian-vwp` | pip spec; pin `dorian-vwp==0.6.*`, or `.` for checkout installs | +| `install` | `dorian-vwp` | pip spec; until the first PyPI release use the git source spec (below), or `.` for checkout installs | | `deny_exec` | `false` | refuse to run executable checkers (C4 pytest, C5 shell): they ERROR. For untrusted/fork PRs; fail-closed, not a sandbox | | `deny_shell` | `false` | narrower than `deny_exec`: block only C5 shell, still allow C4 pytest | | `checker_trust` | `head` | `head` runs the checked-out checker spec (trusted repos); `base` runs the base-ref spec so PR-authored executable checkers never run (public/fork PRs) | diff --git a/action/action.yml b/action/action.yml index 77fabdc..67d5697 100644 --- a/action/action.yml +++ b/action/action.yml @@ -23,8 +23,9 @@ inputs: default: ${{ github.event.pull_request.base.sha }} install: description: >- - pip requirement spec for dorian. Pin a release ('dorian-vwp==0.6.*') - or pass '.' to install the checked-out source. + pip requirement spec for dorian. Until the first PyPI release, use a git + source spec ('dorian-vwp @ git+https://github.com/ajaysurya1221/dorian.git'), + pass '.' to install the checked-out source, or pin a tag once published. required: false default: dorian-vwp deny_exec: diff --git a/docs/BENCHMARK_BINDING_LIFECYCLE.md b/docs/BENCHMARK_BINDING_LIFECYCLE.md index 6c51b96..1158b83 100644 --- a/docs/BENCHMARK_BINDING_LIFECYCLE.md +++ b/docs/BENCHMARK_BINDING_LIFECYCLE.md @@ -2,7 +2,7 @@ > **HISTORICAL — measured at dorian 0.9.0** (see the run header below; the preserved 808-pair > full run). Evidence about the 0.9.0 implementation, not current behavior. The current-version -> rerun (0.11.0, identical results — see [`BENCHMARK_CURRENT.md`](BENCHMARK_CURRENT.md)) confirms +> rerun (1.0.0rc1, identical results — see [`BENCHMARK_CURRENT.md`](BENCHMARK_CURRENT.md)) confirms > the V1 changes did not regress it. NOTE: `dorian bench binding-lifecycle` REGENERATES this file; > restore it from git after a rerun so the historical record survives. diff --git a/docs/BENCHMARK_CURRENT.md b/docs/BENCHMARK_CURRENT.md index 48cf898..ef3834a 100644 --- a/docs/BENCHMARK_CURRENT.md +++ b/docs/BENCHMARK_CURRENT.md @@ -11,15 +11,18 @@ and are kept as-is for provenance. | field | value | | --- | --- | | dorian version | `1.0.0rc1` (V1 release candidate) | -| measured commit | `b7376e7762571e7c802c220aa50c241d2dae7e39` | +| metric commit | `b7376e7` (the benchmark figures were measured here) | +| release commit | the tagged `v1.0.0rc1` commit is a later **docs/release-hygiene only** commit; `git diff b7376e7.. -- src bench` is empty, so the figures apply unchanged | | Python | 3.12.4 | | platform | darwin (CI matrix: 3.11 / 3.12 / 3.13) | | reproduce | `dorian bench large-mutation` · `dorian bench binding-lifecycle` · `dorian bench realworld-usecases` | These numbers were re-run at the `1.0.0rc1` commit *after* the adversarial-review fixes -landed, confirming those fixes (py-const type check, `code:` docstring handling, config-key -stopwords) did not move the benchmark figures — expected, since the suites exercise C1/C3 -(symbol/regex/string/path)/C5, not the new structural/config-binding paths. +landed AND again during the independent release audit, confirming those fixes (py-const type +check, `code:` docstring handling, config-key stopwords) did not move the benchmark figures — +expected, since the suites exercise C1/C3 (symbol/regex/string/path)/C5, not the new +structural/config-binding paths. Commits between the metric commit and the release tag change +only docs/release hygiene, never checker or benchmark logic. ## Results diff --git a/docs/ROADMAP_BACKLOG.md b/docs/ROADMAP_BACKLOG.md index 85b27b4..f0b916f 100644 --- a/docs/ROADMAP_BACKLOG.md +++ b/docs/ROADMAP_BACKLOG.md @@ -125,13 +125,11 @@ before marketing, deterministic verification before AI automation.* - id: trusted-base-action-mode title: Trusted-base Action mode for public fork PRs - status: DEFER/HUMAN-REVIEW + status: SHIPPED (V1, 1.0.0rc1) problem: deny-exec removes code execution but not the self-attested-verdict problem; a real public-fork story needs base-ref checker definitions. - evidence: docs/TRUSTED_BASE_ACTION_DESIGN.md (design only). - proposed_scope: execute only checker specs present on the trusted base ref; parse/lint (never execute) PR-changed sidecars; deny-exec default for forks; fail-closed; tests simulating a fork sidecar trying to execute shell. - why_deferred: Action security defaults are a trust-model change; needs maintainer review and dedicated tests before any public-fork-safe claim. - human_review_required: yes # Action trust model - confidence: medium + evidence: implemented — revalidate --checker-source base (src/dorian/revalidate.py), Action checker_trust input (action/action.yml), tests/test_trusted_base.py (10-case exploit matrix); see docs/TRUSTED_BASE_ACTION_DESIGN.md (STATUS: IMPLEMENTED). + shipped_scope: executes only checker specs resolved from the trusted base ref; PR-added/modified executable checkers never run; missing/tampered base sidecar fails closed; deny-exec composes. Residual (documented, not a sandbox)- a base-approved pytest checker can still execute PR-head code, so pair with deny-exec for untrusted forks. + confidence: high - id: binding-beyond-python-symbols title: Bind routes / configs / schemas / non-Python indices diff --git a/docs/START_HERE.md b/docs/START_HERE.md index 5a5cfd6..fc1667b 100644 --- a/docs/START_HERE.md +++ b/docs/START_HERE.md @@ -38,8 +38,9 @@ exists to catch). - [`action/README.md`](../action/README.md) — the composite GitHub Action and its **security notes** (checker programs are executable; trusted repos only). -- [`TRUSTED_BASE_ACTION_DESIGN.md`](TRUSTED_BASE_ACTION_DESIGN.md) — design (not yet implemented) for a - trusted-base Action mode that executes only base-branch checker specs. +- [`TRUSTED_BASE_ACTION_DESIGN.md`](TRUSTED_BASE_ACTION_DESIGN.md) — the trusted-base Action mode + (`revalidate --checker-source base` / Action `checker_trust: base`), **implemented in V1**: it + executes only base-branch checker specs (a trust root, not a sandbox) for public/fork PRs. ## I want the why and the roadmap diff --git a/src/dorian/checkers/c3_ref.py b/src/dorian/checkers/c3_ref.py index fe06e35..7c6b7ca 100644 --- a/src/dorian/checkers/c3_ref.py +++ b/src/dorian/checkers/c3_ref.py @@ -16,12 +16,18 @@ documented ceiling — only a C4 test catches that. - py-const::::: structural (Python AST): the named module/class assignment has the stated LITERAL value - (compared by value, so quote style / int base / spacing - are tolerated, and a comment/docstring mention cannot - pass). FAIL on a value drift; ERROR on a non-literal RHS. - -The `py-*` structural forms parse the file's AST (`dorian.pyast`); they read only and -never execute the target. See `dorian/pyast.py` and `spec/checkers.md`. + (compared by value AND type, so quote style / int base / + spacing are tolerated but 30 != 30.0 and 1 != True, and a + comment/docstring mention cannot pass). FAIL on a value + drift; ERROR on a non-literal RHS. +- code::: semantic regex (Python-only): re.search over the file with + comments and docstrings BLANKED (a fact surviving only in a + comment/docstring FAILs; real string literals are kept). + Same 500-char cap + worker-process timeout as `regex:`; + ERROR('code_unparseable') on a non-parseable / non-Python target. + +The `py-*` structural and `code:` semantic forms parse the file's AST (`dorian.pyast`); +they read only and never execute the target. See `dorian/pyast.py` and `spec/checkers.md`. `regex:` is the shape-tolerant form: prefer it over `string:` for facts that must survive reformatting (the v0.0 false-positive class — e.g. 'TIMEOUT\\s*=\\s*30' diff --git a/tests/test_action_security_defaults.py b/tests/test_action_security_defaults.py index f369aff..a27a3d1 100644 --- a/tests/test_action_security_defaults.py +++ b/tests/test_action_security_defaults.py @@ -74,3 +74,30 @@ def test_security_docs_reflect_trusted_base_as_implemented() -> None: assert "not yet a full public-fork story" not in low # the actual feature is named (Action input and/or CLI flag) assert "checker_trust" in doc or "checker-source" in doc + + +def test_no_live_doc_calls_trusted_base_unimplemented() -> None: + """Release-audit regression: trusted-base shipped in V1, so NO live doc may still + describe it as unimplemented/design-only. A prior pass fixed SECURITY.md and + action/README.md but missed START_HERE.md and ROADMAP_BACKLOG.md — this scans every + live doc (README, SECURITY, action README, docs/*.md). Archival change-notes + (docs/changes/) and history (docs/history/) are dated snapshots, intentionally excluded + (docs/*.md does not recurse into them).""" + import re + + stale = re.compile( + r"not\s+(yet\s+)?implemented|design[- ]only|\(not implemented\)", re.IGNORECASE + ) + live = [REPO_ROOT / "README.md", REPO_ROOT / "SECURITY.md", ACTION_README] + live += sorted((REPO_ROOT / "docs").glob("*.md")) + offenders = [] + for path in live: + if not path.is_file(): + continue + for i, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1): + low = line.lower() + if ("trusted-base" in low or "trusted base" in low) and stale.search(line): + offenders.append(f"{path.relative_to(REPO_ROOT)}:{i}: {line.strip()}") + assert not offenders, "live doc(s) still call trusted-base unimplemented:\n" + "\n".join( + offenders + ) diff --git a/tests/test_benchmark_evidence.py b/tests/test_benchmark_evidence.py index 640e955..30dd6d0 100644 --- a/tests/test_benchmark_evidence.py +++ b/tests/test_benchmark_evidence.py @@ -34,12 +34,14 @@ def test_historical_benchmark_docs_are_labeled_historical() -> None: def test_current_benchmark_doc_is_version_and_commit_stamped() -> None: + import re import tomllib doc = _read("docs/BENCHMARK_CURRENT.md") version = tomllib.loads(_read("pyproject.toml"))["project"]["version"] assert version in doc, f"current benchmark doc must stamp the live version {version!r}" - assert "measured commit" in doc.lower() + # must be commit-stamped (metric/release commit) — accept any 7+ hex SHA reference + assert "commit" in doc.lower() and re.search(r"\b[0-9a-f]{7,40}\b", doc) assert "Python" in doc # environment summary assert "reproduce" in doc.lower() # the mandatory non-overclaim block diff --git a/tests/test_warrant_quality.py b/tests/test_warrant_quality.py index b3bebe7..de7b479 100644 --- a/tests/test_warrant_quality.py +++ b/tests/test_warrant_quality.py @@ -124,6 +124,31 @@ def test_deterministic_output(wq, tmp_path: Path) -> None: assert (repo / "src/config.py").read_text() == "TIMEOUT = 30\n" +def test_path_escape_operand_does_not_escape_sandbox(wq, tmp_path: Path) -> None: + """A warrant-controlled `../`-escaping file operand must not make the harness read or + write outside the repo/temp sandbox — the module docstring advertises containment.""" + from dorian.model import CheckerSpec, Claim + + repo = _repo(tmp_path) + # a file OUTSIDE the repo the mutation must never read or rewrite + outside = tmp_path / "outside.py" + outside.write_text("LIMIT = 5\n", encoding="utf-8") + outside_mtime = outside.stat().st_mtime + claim = Claim( + id="esc", + text="x", + kind="quantity", + load_bearing=True, + checkers=(CheckerSpec(type="C3", program="py-const:../outside.py::LIMIT::5"),), + ) + verdict = wq._run_mutated( + repo, claim, 0, "../outside.py", lambda t: t + "\nX=1\n", ExecutionPolicy() + ) + assert verdict.value == "ERROR" # refused, not run against the out-of-repo file + assert outside.read_text(encoding="utf-8") == "LIMIT = 5\n" # untouched + assert outside.stat().st_mtime == outside_mtime # not even rewritten + + def test_cli_smoke_json(wq, tmp_path: Path, capsys) -> None: from dorian import cli From 24ae7c82bcd6f1bffc273f8fb2a6bf4ab2ffc035 Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 20:13:29 +0530 Subject: [PATCH 12/13] audit: restamp BENCHMARK_CURRENT metric commit to the audited release commit The benchmarks were re-run during the release audit at 33e9eaf and are identical (large-mutation P=R=0.93, 11.6x/10.4x; binding-lifecycle 808 pairs 0.54->1.00, precision/recall 1.00; realworld 2/1/2). Stamp the metric commit as 33e9eaf; the tagged release commit is only this docs re-stamp (git diff 33e9eaf..HEAD -- src bench is empty). Fixes the earlier note which referenced b7376e7 and predated the c3_ref docstring edit. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/BENCHMARK_CURRENT.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/BENCHMARK_CURRENT.md b/docs/BENCHMARK_CURRENT.md index ef3834a..a26d3df 100644 --- a/docs/BENCHMARK_CURRENT.md +++ b/docs/BENCHMARK_CURRENT.md @@ -11,8 +11,8 @@ and are kept as-is for provenance. | field | value | | --- | --- | | dorian version | `1.0.0rc1` (V1 release candidate) | -| metric commit | `b7376e7` (the benchmark figures were measured here) | -| release commit | the tagged `v1.0.0rc1` commit is a later **docs/release-hygiene only** commit; `git diff b7376e7.. -- src bench` is empty, so the figures apply unchanged | +| metric commit | `33e9eaf` (the benchmark figures were measured here, during the release audit) | +| release commit | the tagged `v1.0.0rc1` commit is the immediate docs-only re-stamp of this file over the metric commit; `git diff 33e9eaf.. -- src bench` is empty, so the figures apply unchanged | | Python | 3.12.4 | | platform | darwin (CI matrix: 3.11 / 3.12 / 3.13) | | reproduce | `dorian bench large-mutation` · `dorian bench binding-lifecycle` · `dorian bench realworld-usecases` | From 79136d57b6c0fe9970de1e3bea6d92cf569a565e Mon Sep 17 00:00:00 2001 From: Ajay Surya Date: Mon, 15 Jun 2026 20:50:07 +0530 Subject: [PATCH 13/13] docs(readme): reflect shipped trusted-base + v1.0.0rc1 tag - intro blockquote: note checker_trust: base as the public/fork trust root (still not a sandbox), instead of flatly "not public CI for forked PRs" now that trusted-base shipped. - roadmap: "tagged release" is done (v1.0.0rc1 prerelease); only PyPI trusted publishing remains. Post-tag branch update (the v1.0.0rc1 tag stays frozen at 24ae7c8); folds into the next tag / the PR to main. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 06c59cf..d58973b 100644 --- a/README.md +++ b/README.md @@ -36,8 +36,9 @@ now and is re-checked on every future change, so a confident summary doesn't qui > commits, nothing else — with **zero model tokens at check time**, so the checker can't be talked > past by the code it verifies. Because checker programs are *executable* (C4 runs `pytest`, C5 > `shell:` runs a command), it is built for **trusted, internal repositories** — not public CI -> taking forked pull requests. Pairs naturally with a coding agent such as **Claude Code** -> ([how](#using-dorian-with-claude-code)). +> taking forked pull requests by default (for public/fork PRs, `checker_trust: base` runs only +> base-approved checker specs — a trust root, still not a sandbox). Pairs naturally with a coding +> agent such as **Claude Code** ([how](#using-dorian-with-claude-code)). ## Table of contents @@ -475,7 +476,8 @@ work perishable, so you find out when it expired. ([`docs/REALWORLD_USECASES.md`](docs/REALWORLD_USECASES.md)) reproduce real problem *classes*; the next rung is frozen public-repo SHAs with manual claims and reproducible known-truth labels ([`docs/SOLO_VALIDATION_LADDER.md`](docs/SOLO_VALIDATION_LADDER.md)). -- **Tagged release and PyPI trusted publishing.** +- **PyPI trusted publishing** — tagged releases now ship (latest: **`v1.0.0rc1`**, a V1 release + candidate / prerelease); publishing `dorian-vwp` to PyPI via a Trusted Publisher is next. Non-goals stay non-goals: no servers, no dashboards, no hosted control plane, no model at check time. Local-first is the design center.