dorian V1 strengthening (1.0.0rc1)#6
Conversation
…ength diagnostics WP3/WP4/WP2/WP6 of the v0.11.0 -> V1 strengthening program (research-report driven). - C3 py-signature:/py-const: structural checkers (dorian/pyast.py): AST-based, close the symbol-existence and string/regex comment-survival ceilings; gutted-body remains the documented ceiling (only C4 catches a body change behind an unchanged signature). - C3 code: semantic-context regex over comment/docstring-stripped Python (same ReDoS worker-timeout as regex:), so a fact surviving only in a comment/docstring FAILs. - checker-strength / claim-risk diagnostics (dorian/strength.py): classify truth strength per checker, flag kind-vs-strength adequacy mismatches, advisory C4 zero/constant-assertion lint; surfaced in `dorian bindings` (JSON + human) and the opt-in --binding-gate warn output. Advisory only — never a verdict/trust/exit change. - watch derivation + binding diagnostics recognize the new C3 forms (seal, bindings). - spec/checkers.md + docs/AGENT_CLAIMS.md document the new grammars and the ceiling. +58 tests (test_pystructural, test_semantic_context, test_strength); 619 non-slow pass. ERROR-vs-FAIL discipline preserved; trigger-vs-truth split made explicit, not blurred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(WP7)
revalidate --checker-source {head,base} (default head) + Action checker_trust input.
base mode resolves each candidate claim's checker SPEC from the --since (base) ref and
runs it against PR-head sources, so a PR-added or PR-modified executable (C4/C5 shell)
checker is never executed and a rewritten checker cannot self-attest a verdict (base
spec wins; the change is surfaced as a trust-root note). Fail-closed: a missing/tampered
base sidecar ERRORs (never executed), never BROKEN, never green. Composes with
deny-exec. NOT a sandbox — a base-approved pytest checker can still run head code, stated
in every surface.
- revalidate.py: checker_source param, _load_base_warrant (integrity-checked base
sidecar via gitio.file_at_ref), RevalResult.notes, text/md rendering of notes.
- cli.py/commands.py: --checker-source flag + DORIAN_CHECKER_SOURCE env fallback;
base requires --since.
- action.yml: checker_trust input (default head) -> DORIAN_CHECKER_SOURCE; README + Inputs
table updated (also documents the pre-existing deny_exec/deny_shell inputs).
- docs: TRUSTED_BASE_ACTION_DESIGN status -> IMPLEMENTED; SECURITY_BOUNDARY public-fork
checklist updated (trust-root conditions met; sandboxing still out of scope).
- tests/test_trusted_base.py: the §6 exploit matrix (10 cases) — each "executed?" case
proven by a sentinel touch that must NOT appear under base mode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends binding beyond Python definers/console-scripts to config keys in tracked .toml/.json files: a claim mentioning a config key is re-checked when the defining config file changes. Conservative and trigger-only (never proves truth): - symbol_index.config_key_index: key -> tracked .toml/.json files + unparseable list. YAML deliberately excluded (parsing needs a third-party dep; core stays zero-dep). - claim_config_watch_paths + claim_watch_paths (unified symbol+script+config union); verify/rebind now widen with the merged watch set. - ambiguous_config_mentions: a key in >1 file is skipped (a wrong watch is a false alarm) and surfaced via verify warnings + bind-suggest, never guessed. - unparseable supported config files are surfaced as a diagnostic, never silent. - bind-suggest gains provenance (bind (symbol) vs bind (config)) + ambiguous-config + unparseable-config lines; JSON adds bind_config/ambiguous_config/unparseable_config. - config_key_index degrades to empty on a non-git repo (never blocks). Updated test_symbol_index pyproject-script expectation (a script-name claim now also watches pyproject.toml where the script is declared) and the trusted-base design doc-guard (now IMPLEMENTED). 639 non-slow tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lity (WP8) Offline, per-claim evidence: for each claim it derives deterministic mutations from the checker grammar and records whether the verdict matched expectation — - falsify (rename symbol / reassign const / change param): expect FAIL; a PASS is a MISS; - benign (trailing comment): expect PASS; a FAIL is BRITTLE (false alarm); - ceiling (content drift keeping an existence symbol): expect PASS, recorded as the documented trigger-vs-truth ceiling, never a penalty. ERROR (e.g. an executable checker under --deny-exec) is its own bucket, never a miss. Output is deterministic (no timestamps/randomness) and never mutates the real repo — each mutation runs against a throwaway copy of only the file the checker reads. Honest scope: structural/existence C3 forms are mutation-scored; string/regex/code, typed C5, C1, C4 are reported with strength and `mutation: unsupported` (no fabricated mutation). Registered as the `warrant-quality` bench subcommand. 7 tests; 645 non-slow pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…WP1) - docs/BENCHMARK_CURRENT.md: version- and commit-stamped reruns of the reproducible suites on current code — large-mutation (240 pairs, P=R=0.93, 11.6x/10.4x FP reduction), binding-lifecycle (808 pairs, selection recall 0.54->1.00, alarm precision/recall 1.00, 0 errored), realworld (5 cases 2/1/2), and the new warrant-quality harness. The reruns MATCH the historical runs (same content-derived run_id), proving the V1 changes are additive and do not regress the benchmarks. Includes a what-this-does-NOT-prove block. - HISTORICAL banners on docs/BENCHMARK_v0.7.0.md (v0.7.0) and docs/BENCHMARK_BINDING_LIFECYCLE.md (0.9.0), each pointing to BENCHMARK_CURRENT.md; the historical numbers are preserved verbatim. - docs/V1_SCOPE.md: what V1 strengthening means and does NOT mean (no universal semantic correctness; trusted-base is a trust root not a sandbox; config binding is TOML/JSON only; code:/structural are Python-only; extractor stays draft; carried-forward limitations). - README: trust-state legend (WARRANTED born -> TRUSTED/DEGRADED/REVOKED/UNKNOWN), historical labels on the benchmark citations, command-surface entries for the new C3 forms, checker-strength in bindings, config provenance in bind-suggest, checker-source base, and bench warrant-quality. - tests/test_benchmark_evidence.py: wording guards (historical docs labeled; current doc version/commit-stamped with a non-overclaim block; README links current; V1_SCOPE boundary). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_large_mutation::test_committed_doc_matches_render asserts docs/BENCHMARK_v0.7.0.md == lm.render_markdown(summary), so the generated doc cannot carry a hand-added banner. Drop the HISTORICAL banner from it (the title already version-stamps it "(v0.7.0)"); its historical status is conveyed by README + BENCHMARK_CURRENT.md (which names it as the historical source). The binding-lifecycle doc has no byte-match guard, so it keeps its banner. Updated test_benchmark_evidence to match: binding-lifecycle by banner, v0.7.0 by version-stamped title + the current doc's cross-reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…items A 5-lens adversarial review (BLOCK verdict) reproduced 6 real defects; all fixed red-green, plus two hygiene items that contradicted stated invariants: 1. Config-key over-binding broke "default unchanged unless opt-in": a claim backticking a common config word (e.g. `dependencies`) bound pyproject.toml and could newly refuse a clean `verify` with exit 6. Fix: _CONFIG_KEY_STOPWORDS (PEP 621 / common keys) on the config axis; specific keys (max_workers) still bind. Regression test reproduces the exit-6. 2/3. SECURITY.md + action/README.md still said trusted-base was "not yet implemented" — false on this branch. Updated both to describe checker_trust: base as shipped (with the non-sandbox residual); added a guard test so the drift can't recur. 4. `code:` false PASS on an f-string docstring — code_only_python now recognises ast.JoinedStr docstrings. 5. `code:` false FAIL on a real string co-located on a docstring's line — docstrings are now blanked by AST node SPAN, not whole line. 6. `py-const` PASSed on value-TYPE drift (30/30.0, 1/True, 0/False) via Python == — now requires matching type before ==. Documented + red-green tested. Hygiene: warrant-quality _run_mutated refuses a `../`-escaping file operand (its docstring promised containment); check_signature wraps comparison in _PARSE_ERRORS so a pathological signature ERRORs within pyast. Added an end-to-end ERROR-never-BROKEN test for the new C3 forms (non-literal RHS -> ERRORED, exit 5, never BROKEN). 658 non-slow tests pass; lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All V1 strengthening work packages (WP1-WP9) are implemented, tested, and documented; the 5-lens adversarial review's BLOCK findings are all resolved with regression tests; 733 tests pass (incl. slow); lint clean. Bump the three version surfaces (pyproject / __init__ / uv.lock) to the V1 release candidate. No tag, push, or publish. rc1 (not final 1.0.0) is honest: the candidate invites real-repo benchmark validation and the explicitly-deferred post-V1 items (declarative-structural checkers, route/SQL binding indices, YAML config binding, audit-event atomicity) documented in docs/V1_SCOPE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st-fix) Re-ran large-mutation / binding-lifecycle / realworld at commit b7376e7 (1.0.0rc1), after the adversarial-review fixes: figures identical (large-mutation P=R=0.93, 11.6x/10.4x; binding-lifecycle 808 pairs 0.54->1.00 selection, 1.00 alarm; realworld 2/1/2), confirming the fixes don't touch the benchmarked paths. Version/commit stamps updated; the version-stamp evidence test now reads the live pyproject version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final evidence-backed report: version gate, per-WP status, commands+results, verification evidence, trigger-vs-truth preservation, security posture, benchmark posture, remaining risks/non-goals, and the release decision (1.0.0rc1 candidate; no tag/push/publish). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent release audit (FIXED_NEEDED) findings, all repaired: Blockers: - docs/START_HERE.md still called trusted-base "(not yet implemented)" — a user-facing CI entry-point doc the prior fix missed; now describes it as implemented (V1). - docs/BENCHMARK_BINDING_LIFECYCLE.md banner said the current rerun was "0.11.0" while the branch is 1.0.0rc1 (and BENCHMARK_CURRENT says 1.0.0rc1) — corrected the version. - internal program docs (V1_IMPLEMENTATION_TRACKER.md, V1_ALIGNMENT_REPORT.md) were tracked; gitignored + git rm --cached (kept on disk as provenance). Also gitignore the research report, audit gate, release notes, and tool dirs (.claude/, .gitnexus/). docs/V1_SCOPE.md stays tracked (it is a public doc). Should-fixes: - docs/ROADMAP_BACKLOG.md trusted-base item flipped DEFER/HUMAN-REVIEW -> SHIPPED (V1). - c3_ref.py module docstring now documents the code: form (was omitted) and the py-const value-AND-type rule. - action.yml / action/README.md drop the stale 'dorian-vwp==0.6.*' pin example (no PyPI release yet) for the git source spec. - docs/BENCHMARK_CURRENT.md labels the metric commit vs the (docs-only) release commit. Hardened tests: test_no_live_doc_calls_trusted_base_unimplemented scans ALL live docs (not just SECURITY.md/action README); warrant-quality path-escape test pins the containment guard; benchmark-evidence commit-stamp check is version-agnostic. 660 non-slow tests pass; lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… commit The benchmarks were re-run during the release audit at 33e9eaf and are identical (large-mutation P=R=0.93, 11.6x/10.4x; binding-lifecycle 808 pairs 0.54->1.00, precision/recall 1.00; realworld 2/1/2). Stamp the metric commit as 33e9eaf; the tagged release commit is only this docs re-stamp (git diff 33e9eaf..HEAD -- src bench is empty). Fixes the earlier note which referenced b7376e7 and predated the c3_ref docstring edit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- intro blockquote: note checker_trust: base as the public/fork trust root (still not a sandbox), instead of flatly "not public CI for forked PRs" now that trusted-base shipped. - roadmap: "tagged release" is done (v1.0.0rc1 prerelease); only PyPI trusted publishing remains. Post-tag branch update (the v1.0.0rc1 tag stays frozen at 24ae7c8); folds into the next tag / the PR to main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR releases dorian v1.0.0rc1, adding five V1 features: Python-AST structural C3 checker forms ( Changesdorian V1 Feature Set
Sequence Diagram(s)sequenceDiagram
participant PR as PR CI (Action)
participant cmd as commands.cmd_revalidate
participant reval as revalidate()
participant base as _load_base_warrant()
participant git as git base ref
participant checker as _check_claim()
PR->>cmd: checker_trust=base, since=base_sha
cmd->>reval: checker_source="base", since=base_sha
loop per artifact claim
reval->>base: artifact_uri, base_sha
base->>git: read .warrant sidecar at base_sha
git-->>base: raw bytes or FileNotFoundError
base-->>reval: Warrant | None
alt base warrant missing or tampered
reval-->>reval: claim → ERRORED, no execution
else PR added/modified executable checker
reval-->>reval: record trust-root note, use base spec
reval->>checker: base-approved CheckerSpec
checker-->>reval: PASS | FAIL | ERROR
else spec unchanged
reval->>checker: base spec
checker-->>reval: PASS | FAIL | ERROR
end
end
reval-->>cmd: RevalResult(notes=[...])
cmd-->>PR: exit code + sticky PR comment
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
Summary
V1 strengthening program (from v0.11.0), released as the prerelease tag
v1.0.0rc1andindependently release-audited (FIXED_PASS). All additions are additive and backward-compatible.
py-signature:/py-const:(AST; close thesymbol:existenceand
string:/regex:comment-survival ceilings;py-const:compares value and type).code:— regex over comment/docstring-stripped Python.dorian bindings(+ C4 adequacy lint); advisory..toml/.json(zero runtime deps; YAML excluded).revalidate --checker-source base/ Actionchecker_trust: base— base-approvedchecker specs only for public/fork PRs (a trust root, not a sandbox).
dorian bench warrant-quality— offline per-claim mutation scoring.V1_SCOPE.md, version-stampedBENCHMARK_CURRENT.md, historical benchmark labels.Two adversarial reviews (implementation BLOCK → 6 must-fixes; independent release audit
FIXED_PASS → 2 doc-drift blockers + should-fixes) all resolved with regression tests.
Invariants preserved: ERROR ≠ BROKEN; checkers read-only (except C4/C5-shell); binding is
trigger-only; zero runtime dependencies. Release candidate, not final 1.0.0 (real-repo
public micro-benchmark + RC caveats remain post-V1; see
docs/V1_SCOPE.md).Test Plan
uv run pytest→ 735 passed (incl. slow); ruff clean;uv build+clean-venv install →
dorian 1.0.0rc1; benchmarks reproduce unchanged; trusted-base exploitmatrix (10 cases) passes.
Summary by CodeRabbit
New Features
Documentation