Update evals dataset to tag draft documents#462
Conversation
…s, 1 LLM-non-det improvement)
There was a problem hiding this comment.
Pull request overview
This PR introduces a document_type tag (Final / Draft / Proposal) to eval manifest entries and updates the date-extraction eval to only run on Final documents, since adoption dates aren’t meaningful for drafts/proposals. It also updates stored dev/held-out eval result artifacts to reflect the new filtered dataset and reruns.
Changes:
- Filter
evals/test_run_date_extraction_evals.pycases todocument_type == "Final"(defaulting missingdocument_typetoFinal). - Add
document_typeto dev + held-out solar manifests. - Refresh checked-in dev/held-out result JSON and dev breakdown CSV after rerunning evals with the new filtering.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| evals/test_run_date_extraction_evals.py | Filters date extraction eval cases to Final documents only. |
| evals/data/dev/solar/manifest.json5 | Adds document_type annotations per dev-case classification. |
| evals/data/held-out/solar/manifest.json5 | Adds document_type annotations (all Final). |
| evals/results/dev/date_extraction_evals.json | Updates dev aggregate metrics after filtering to Final only. |
| evals/results/dev/date_extraction_evals_breakdown.csv | Updates per-case breakdown after filtering/rerun. |
| evals/results/held_out/date_extraction_evals.json | Updates held-out aggregate metrics after rerun. |
ppinchuk
left a comment
There was a problem hiding this comment.
I don't think you should hold this PR, bur I would recommend changing the "document_type" key to "document_publish_status" or "document_draft_status" or even just "document_status" (or whatever else you like) in order to avoid confusion with actual document types such as pdf, text, doc, etc
Good point. I changed it. |
Adds a
document_typefield (Final / Draft / Proposal) at the top level of each manifest entry. The date extraction eval now runs only onFinaldocuments — date extraction is only meaningful for enacted ordinances, since drafts and proposals have no adoption date by definition.Dev: 13/48 tagged Draft/Proposal (per prior failure analysis in
unified-query-failure-analysis_v2.md), filter reduces the run to 35 cases. No real verdict regressions on the kept Final cases — every previously-passing Final case still passes; one (Greene, TN) improved via LLM non-determinism.Held-out: every case classified by reading the document text (in parallel subagents). All 22 are Final; one PDF (
Carroll_County_Indiana.pdf) was replaced — the file on disk contained Carroll County, Maryland content instead of the Indiana ordinance the manifest pointed to. Held-out accuracy/verdicts unchanged after the swap; only the OCR overhead from the now-correctly-scanned-PDF shows up as a small time/token bump in the baseline.