Skip to content

feat(admin): prune_dataset() scoring module — rfc-corpus-growth-controls (task 1/9)#50

Merged
ulmentflam merged 4 commits into
mainfrom
nightly/prune-admin-011742Z
May 25, 2026
Merged

feat(admin): prune_dataset() scoring module — rfc-corpus-growth-controls (task 1/9)#50
ulmentflam merged 4 commits into
mainfrom
nightly/prune-admin-011742Z

Conversation

@ulmentflam
Copy link
Copy Markdown
Owner

@ulmentflam ulmentflam commented May 23, 2026

Summary

First task of rfc-corpus-growth-controls.md: introduce corpus_forge/admin/prune.py with a prune_dataset(...) entry point that scores chunks on a weighted rubric and selects the bottom N% as prune candidates. Default is dry-run; deletion only happens with explicit apply=True.

  • prune_dataset(backend, *, dataset, percentile=10, apply=False, ...) -> PruneReport — scoring + selection + (gated) deletion.
  • PruneReport carries the per-chunk score breakdowns, source-grouped summary, and a duplicate_density_available flag so callers can warn when the MinHash backend is absent.
  • Reuses the existing curation sub-score primitives in corpus_forge.curation.selector. Adds two new prune-tuned signals (duplicate_density, feedback_drag) inline in prune.py.
  • Backend-agnostic apply path: prefers a delete_chunks_by_ids hook, falls back to bulk ANY(%s) (Postgres) or chunked IN (?, …) (SQLite, 500-id batches). Dispatch keys on _paramstyle == "pyformat" with a class-name fallback for test doubles.
  • Unknown-dataset safety: raises ValueError before any candidate walk when a named dataset doesn't resolve — silent corpus-wide deletion under apply=True would be too sharp an edge.

Explicitly out of scope (separate RFC tasks)

  • corpus-forge prune CLI verb.
  • GrowthConfig Pydantic block (prune_percentile_default, etc.).
  • score_for_pruning(chunk, ...) extension on the selector.
  • Per-source caps (max_rows, max_bytes).
  • estimate sync pre-flight.
  • tests/integration/test_prune_e2e.py (postponed to the CLI-verb task where there's an end-to-end surface).

Test plan

  • tests/unit/test_prune_scorer.py — 22 tests pass.
    • Dry-run default never touches the delete path.
    • Score ordering (worst rows surface first).
    • Percentile bounds (validated 0–100 inclusive; ValueError on out-of-range).
    • feedback_drag flips on rejected feedback / negative rating.
    • duplicate_density silently zeros when MinHash unimportable; flag promoted to PruneReport.
    • apply=True calls the delete_chunks_by_ids hook when present.
    • Postgres bulk ANY(%s) branch covered via _paramstyle="pyformat" test double.
    • SQLite chunked IN (?, ?, …) branch covered via 7-id pool with batch size monkey-patched to 3 (3 batches, last batch = 1).
    • Unknown-dataset → ValueError before walk.
  • No regressions in tests/unit/test_curation_selector.py / test_selector_learned_signal.py (103 tests pass combined).
  • ruff check clean.
  • pyrefly check clean (pre-commit hook).
  • Manual smoke against a live Postgres dataset — deferred to CLI-verb task.

Disclosures

Detail in .nightly/runs/2026-05-23T01-15-27Z/tasks/0001-prune-admin-module/uncertainty.md:

  • chunk_feedback vs corpus.feedback schema discrepancy — dual-probed, both shapes tolerated.
  • Postgres dispatch heuristic (_paramstyle probe + class-name fallback).
  • Tie-break behavior on equal prune_score — Python's stable sort preserves backend yield order; not contract-tested.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added dataset pruning functionality with configurable scoring rubric and optional batch deletion of low-quality chunks.
    • Enhanced ingestion progress reporting with per-source task tracking and improved runtime metrics.
  • Bug Fixes

    • Improved error classification during data ingestion to better identify and report embedder-related failures.
  • Documentation

    • Updated changelog with details on pruning module capabilities.
  • Tests

    • Added comprehensive test coverage for pruning and ingestion workflows.

Review Change Stack

ulmentflam and others added 4 commits May 22, 2026 12:53
Two issues surfaced during a real ingest run:

1. The per-source progress bar was unbounded (``⠼ Ingest 0:00:01``,
   no percentage, no live ETA), even though the up-front
   ``_log_ingest_eta`` walk had already counted every file. The
   ETA was logged once at startup and then disappeared during the
   actual work — the user couldn't tell what fraction was done.

2. Ollama embedder 5xx with ``"failed to encode response: json:
   unsupported value: NaN"`` was logged as ``"Extractor failed on
   <file>"``. That mis-attributes a model quirk (NaN in the
   embedding vector) to file content, sending users hunting for
   "broken" files instead of revisiting the embedder choice.

Changes:

- Rename ``_log_ingest_eta`` → ``_plan_ingest``. Same walk + same
  ETA log, but now also returns a ``dict[id(source_config), int]``
  with per-source file counts. The previous "walk twice" cost
  (planner + ``source.scan()``) for filesystem-rooted sources is
  preserved on the planner side and consumed on the ingest side.
- ``ingest_once`` threads each source's total into ``make_progress``.
  Rich's bounded-mode columns (``BarColumn``, ``MofNCompleteColumn``,
  ``TimeRemainingColumn``) auto-render a live ETA from total +
  advance, so users now see ``X/Y`` + remaining time. Sources
  without a resolvable filesystem root (API-only plugins) keep
  the previous unbounded behaviour.
- New ``_classify_and_log_ingest_error`` classifies per-document
  failures: ``"unsupported value: NaN"`` → WARNING with a hint to
  swap embedders or filter empty chunks; embedder 5xx → WARNING;
  everything else keeps the historic ``extract_logger.info(
  "Extractor failed on X: …")`` taxonomy so existing grep
  patterns / dashboards still match.

Tests: 3 new (planner returns per-source counts; planner returns
``{}`` when no filesystem roots; NaN message logs at WARNING with
actionable hint; generic failure keeps the extractor-failed
wording). 173 estimate/ingest/runtime_profile tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on the per-source progress bar from the previous commit by
adding a persistent *global* bar that tracks total work across every
source in the run. The user asked for "a global estimation of total
time remaining on ALL items as a progress bar" — this is that.

Layout while ingest runs::

    Ingest (all sources) ━━━━━━━━ 4,321/250,003 0:01:30 1d 11h 30m
      filesystem         ━━━━━━━━     4,321/51,001 0:01:30 0:08:13

The top bar persists for the whole run; the indented per-source bar
appears when each source starts and is removed when it finishes, so
exactly one source task is visible at any time underneath the
global view.

Why one ``Progress``, not two
-----------------------------
Rich's ``Live`` (which ``Progress`` wraps) is single-active per
console — two nested ``make_progress`` blocks would race for the
screen. The fix is to keep a single ``Progress`` instance for the
duration of ``ingest_once`` and use ``add_task`` / ``remove_task``
to slot each source's child task in and out underneath the
persistent global task. Total = ``sum(_plan_ingest(config).values())``
or ``None`` when no source contributed a count (API-only configs
keep the previous indeterminate bar).

Other tidying
-------------
- Hoisted the ``import time`` that lived inside ``ingest_one`` to
  the top of the module — it's used both there and in the new
  per-source timing block, and the in-loop ``import time as _time``
  was needless ceremony now that ``ingest.py`` already imports
  stdlib at the top.
- Replaced the old single ``"Scan complete: N documents"`` log with
  a paired ``"Ingest (X) started"`` / ``"Ingest (X) complete: N in
  Ts (rate R/s)"`` so the rotating log still has the bookend pair
  that ``make_progress``'s ``logger=`` was previously providing per
  source (the outer ``make_progress`` now logs once for the whole
  run, not per source).

All 173 estimate + ingest + runtime_profile + MCP-estimate tests
pass. Pre-commit + pre-push hook gates clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer-flagged regression: the source + global progress bars only
advanced on successful ``ingest_one`` calls because the
``progress.update(..., advance=1)`` calls lived AFTER the
``try/except`` block and the ``except`` branch ended with
``continue``. Per-file failures (Ollama-NaN 5xx, extractor crashes,
…) silently dropped advances on the floor.

With the new global bar this is doubly bad: planner totals come
from ``estimate_sync`` which counts every file regardless of
whether ingest will succeed, so a single failure leaves the global
bar permanently stranded below 100% even after every source has
been processed.

Fix: move both ``progress.update`` calls into a ``finally:`` block
on the per-file ``try`` so they fire regardless of outcome. Dropped
the now-redundant ``continue`` from the ``except`` arm — there was
no code after the try/finally inside the loop body for it to skip.

Test: source-text inspection of ``ingest_once`` verifying both
advance calls live inside the ``finally`` block. End-to-end Progress
mocking would require significant fake-backend plumbing; the
structural check is precise enough to lock the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First task of `.planning/rfcs/rfc-corpus-growth-controls.md`. New module
`corpus_forge/admin/prune.py` exposes `prune_dataset(...)` returning a
`PruneReport` of the bottom-percentile chunks. Dry-run by default.

- Reuses sub-score primitives from `corpus_forge.curation.selector`
  (confidence_deficit, missing_metadata, freshness) and adds two
  prune-tuned signals: `duplicate_density` (skipped when the MinHash
  backend from `rfc-nlp-data-quality-signals.md` is absent) and
  `feedback_drag` (dual-probes `corpus.chunk_feedback` and
  `corpus.feedback` for shape-tolerance).
- Final rubric weights: 0.20/0.20/0.15/0.25/0.20 (sums to 1.0).
- Backend dispatch via capability probe (`_paramstyle == "pyformat"`),
  with a class-name fallback for test doubles.
- Apply path: prefers a `delete_chunks_by_ids` hook, else bulk
  `ANY(%s)` for Postgres, else chunked `IN (?, …)` for SQLite (batched
  at 500 ids).
- Unknown-dataset safety: raises `ValueError` before walking when a
  named dataset doesn't resolve — silent corpus-wide deletion under
  `apply=True` would be too sharp an edge.

Followups in separate RFC tasks (intentionally out of scope here):
- `corpus-forge prune` CLI verb.
- `GrowthConfig` Pydantic block (`prune_percentile_default`, etc.).
- `score_for_pruning(chunk, ...)` extension on the selector.
- Per-source caps (`max_rows`, `max_bytes`).
- `estimate sync` pre-flight.

22/22 prune-scorer tests pass; no regressions in curation suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces dataset pruning as a new admin feature and improves ingest orchestration. The pruning module adds configurable candidate selection via a five-part rubric (including new MinHash-based duplicate detection and feedback-drag scoring) with Postgres/SQLite deletion dispatch. Ingest gains file-count planning, smarter error classification for embedder failures, and unified progress tracking via Rich.

Changes

Dataset Pruning Admin Feature

Layer / File(s) Summary
Prune Data Contracts & Module Setup
CHANGELOG.md, corpus_forge/admin/__init__.py, corpus_forge/admin/prune.py (lines 1–100, 107–149, 590–594)
PruneCandidate and PruneReport dataclasses capture per-candidate scores and report-level metadata; module exports these symbols and defines the five-part rubric weights, candidate pool defaults, and percentile bounds.
Candidate Scoring Subsystem
corpus_forge/admin/prune.py (lines 156–202, 209–302, 309–333)
MinHash-backed duplicate_density scorer with graceful degradation to 0.0; feedback-drag scorer that flags rejected/negative feedback; and _prune_reason generator that picks the top weighted contributor for human-readable explanations.
Deletion Dispatch & Backend Detection
corpus_forge/admin/prune.py (lines 340–423)
Backend-shape detection routes Postgres-like backends (via _paramstyle or class name) to bulk DELETE ... ANY(%s), SQLite-like backends to chunked IN (...) batches, with proper placeholder syntax and optional schema prefix.
Main Prune Flow & Orchestration
corpus_forge/admin/prune.py (lines 430–588)
prune_dataset validates percentile bounds, resolves dataset identity (safety check prevents cross-dataset pruning), loads candidate pool, scores each with the rubric, selects worst-first by percentile, groups by source URI stem, and conditionally deletes when apply=True.
Comprehensive Test Suite
tests/unit/test_prune_scorer.py
Validates dry-run/apply modes, percentile selection and bounds checking, score monotonicity and ranking, feedback-drag flipping and degradation, duplicate-density availability probing, apply-path deletion with Postgres/SQLite dispatch, report grouping by source stem, dataclass immutability, dataset safety refusal, and per-source file-count wiring.

Ingest Orchestration Improvements

Layer / File(s) Summary
Timing Instrumentation
corpus_forge/ingest.py (lines 5, 309–318)
time.perf_counter() now used directly; per-document chunking and DB-write elapsed times are measured and recorded for runtime profiling.
Planning & Error Classification Helpers
corpus_forge/ingest.py (lines 533–599, 635–643)
_plan_ingest walks filesystem-rooted sources upfront, returns per-source file counts, logs totals, and degrades gracefully when no filesystem roots exist; _classify_and_log_ingest_error detects embedder NaN/5xx issues (logged at WARNING with model-selection hint) vs default extractor failures.
Progress Reporting Refactor
corpus_forge/ingest.py (lines 663–674, 699–813)
Single Rich Progress instance replaces per-source make_progress contexts; global "Total" task sized to planned file counts with dynamically added/removed per-source child tasks; progress updates advance both tasks on every iteration inside finally: block (success or failure).
Ingest Helper Test Coverage
tests/unit/test_ingest_helpers.py (lines 73–146, 148–191, 192–242)
_plan_ingest returns file-count mapping for filesystem sources and empty dict for API-only configs; _classify_and_log_ingest_error logs NaN/500 at WARNING with hint vs generic failures; regression test confirms progress advances on both global and source tasks inside finally: block.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A pruner hops through chunks so deep,
Scores the worst and makes them reap,
While ingest counts and progresses fair,
With Postgres dispatch and SQLite care!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main addition: a prune_dataset() scoring module with reference to the RFC task tracking.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nightly/prune-admin-011742Z

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/unit/test_ingest_helpers.py (2)

176-189: ⚡ Quick win

Consider verifying the log level for generic failures.

The NaN test thoroughly validates both the log level (WARNING) and message content, but this test only checks that "Extractor failed" appears in the message without asserting the expected log level. Adding a level assertion would make the test more robust and symmetric with the NaN test.

🔍 Suggested enhancement
         with caplog.at_level(logging.INFO, logger="corpus_forge.ingest.extract"):
             _classify_and_log_ingest_error(_Raw(), ValueError("PDF parse error"))
         records = [r for r in caplog.records if "Extractor failed" in r.getMessage()]
         assert records, "expected an extractor-failure line on a generic exception"
+        assert records[0].levelno == logging.ERROR  # or INFO/WARNING depending on expected behavior
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_ingest_helpers.py` around lines 176 - 189, Add an assertion
that the generic failure is logged at the expected WARNING level: in
test_generic_failure_keeps_extractor_taxonomy, after selecting records that
contain "Extractor failed" (using _classify_and_log_ingest_error on _Raw()),
assert that at least one matching record has record.levelno == logging.WARNING
(or record.levelname == "WARNING") so the test verifies both message content and
log level for generic failures; reference the test function
test_generic_failure_keeps_extractor_taxonomy and the
_classify_and_log_ingest_error call to locate where to add the assertion.

204-242: 💤 Low value

Consider more resilient pattern matching for the source inspection.

The test uses hardcoded indentation patterns ("\n for " with exactly 12 spaces) and exact parameter-order matching. While the docstring acknowledges the tradeoff of source inspection vs. full integration testing, the pattern matching could be made more resilient to formatting changes without significantly increasing complexity.

💡 More flexible alternatives

Using regex patterns would survive auto-formatting and minor refactoring:

import re

# Instead of exact indentation:
next_for = re.search(r'\n\s+for\s+', tail)
next_func = re.search(r'\ndef\s+', tail)

# Instead of exact parameter order:
assert re.search(r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1', finally_block)
assert re.search(r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1', finally_block)

This maintains the contract check while tolerating formatting and keyword-argument reordering.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_ingest_helpers.py` around lines 204 - 242, The test
test_loop_body_uses_finally_for_progress_update uses brittle exact-string
searches to locate the finally block and the two progress.update calls; update
it to use regex-based searches so minor formatting/arg-order changes don't break
the check: use re.search(r'\n\s+for\s+') and re.search(r'\ndef\s+') (or
equivalent) to find the finally block boundaries in the inspected source of
ingest.ingest_once, and assert the progress.update occurrences with patterns
like r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1' and
r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1' to tolerate spacing
and keyword-argument reordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@corpus_forge/admin/prune.py`:
- Around line 263-271: The SQL fallback hard-codes the "corpus." schema (e.g. in
the execute calls inside the try/except where logger.debug is used and where
_coerce_to_dict_list wraps execute), which breaks non-default-schema Postgres
setups; remove the "corpus." prefix from the table names (use unqualified
chunk_feedback and feedback and keep the WHERE entity_type = 'chunk' clause) so
the connection's search_path/default schema is respected; apply the same change
for the other occurrence referenced (around the code at lines noted, e.g. the
later execute calls at ~395-397) and keep function names execute and
_coerce_to_dict_list and the surrounding try/except/logger.debug logic
unchanged.
- Around line 481-494: The code currently swallows exceptions from
backend.find_dataset_id_by_name and continues to a broad candidate walk; change
it to hard-fail on resolution errors and use the resolved id for the candidate
walk. Specifically, in the block that uses resolver = getattr(backend,
"find_dataset_id_by_name", None) and calls resolved = resolver(dataset), remove
the catch-then-default behavior that sets resolved = "<unknown>" and instead
re-raise or raise a ValueError when resolver raises; after a successful call, if
resolved is None raise ValueError(f"dataset {dataset!r} not found") and then
assign dataset = resolved so that _iter_curation_candidates(backend,
dataset=dataset, ...) uses the resolved id.

In `@corpus_forge/ingest.py`:
- Around line 711-717: The global progress task (global_task) is created with
total=global_total which only counts sources present in per_source_totals, but
the code always advances global_task for every processed item; only increment
global_task when the current source has a known planned total (i.e.,
source_total is not None and the source exists in per_source_totals). Update the
places that call progress.advance(global_task, ...) (including where
source_total is read/used) to guard the advance with a check like "if
source_total is not None" (or "if source_name in per_source_totals") so
unplanned/API-only sources don't change the global bar; apply the same
conditional to the other similar advance calls referenced (the blocks around
make_progress, global_task, and loops that process per-source items).

---

Nitpick comments:
In `@tests/unit/test_ingest_helpers.py`:
- Around line 176-189: Add an assertion that the generic failure is logged at
the expected WARNING level: in test_generic_failure_keeps_extractor_taxonomy,
after selecting records that contain "Extractor failed" (using
_classify_and_log_ingest_error on _Raw()), assert that at least one matching
record has record.levelno == logging.WARNING (or record.levelname == "WARNING")
so the test verifies both message content and log level for generic failures;
reference the test function test_generic_failure_keeps_extractor_taxonomy and
the _classify_and_log_ingest_error call to locate where to add the assertion.
- Around line 204-242: The test test_loop_body_uses_finally_for_progress_update
uses brittle exact-string searches to locate the finally block and the two
progress.update calls; update it to use regex-based searches so minor
formatting/arg-order changes don't break the check: use
re.search(r'\n\s+for\s+') and re.search(r'\ndef\s+') (or equivalent) to find the
finally block boundaries in the inspected source of ingest.ingest_once, and
assert the progress.update occurrences with patterns like
r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1' and
r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1' to tolerate spacing
and keyword-argument reordering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 214d9b1c-4aec-43f1-ada9-13b180b607a0

📥 Commits

Reviewing files that changed from the base of the PR and between c0e82aa and 2d4df08.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • corpus_forge/admin/__init__.py
  • corpus_forge/admin/prune.py
  • corpus_forge/ingest.py
  • tests/unit/test_ingest_helpers.py
  • tests/unit/test_prune_scorer.py

Comment on lines +263 to +271
execute("SELECT chunk_id, kind, rating FROM corpus.chunk_feedback")
)
except Exception as exc:
logger.debug("corpus.chunk_feedback probe failed (%r) — trying corpus.feedback", exc)
try:
rows_typed = _coerce_to_dict_list(
execute(
"SELECT entity_id AS chunk_id, kind, rating FROM corpus.feedback "
"WHERE entity_type = 'chunk'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hard-coded corpus. schema in SQL fallbacks.

The fallback SQL pins table names to corpus.*, which breaks when Postgres uses a non-default schema and weakens backend portability. Prefer unqualified table names and rely on backend search path / connection default schema.

🛠️ Suggested portability fix
-            execute("SELECT chunk_id, kind, rating FROM corpus.chunk_feedback")
+            execute("SELECT chunk_id, kind, rating FROM chunk_feedback")
...
-                    "SELECT entity_id AS chunk_id, kind, rating FROM corpus.feedback "
+                    "SELECT entity_id AS chunk_id, kind, rating FROM feedback "
                     "WHERE entity_type = 'chunk'"
...
-                "DELETE FROM corpus.chunks WHERE id = ANY(%s)",
+                "DELETE FROM chunks WHERE id = ANY(%s)",

Also applies to: 395-397

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@corpus_forge/admin/prune.py` around lines 263 - 271, The SQL fallback
hard-codes the "corpus." schema (e.g. in the execute calls inside the try/except
where logger.debug is used and where _coerce_to_dict_list wraps execute), which
breaks non-default-schema Postgres setups; remove the "corpus." prefix from the
table names (use unqualified chunk_feedback and feedback and keep the WHERE
entity_type = 'chunk' clause) so the connection's search_path/default schema is
respected; apply the same change for the other occurrence referenced (around the
code at lines noted, e.g. the later execute calls at ~395-397) and keep function
names execute and _coerce_to_dict_list and the surrounding
try/except/logger.debug logic unchanged.

Comment on lines +481 to +494
if dataset is not None:
resolver: Any = getattr(backend, "find_dataset_id_by_name", None)
if callable(resolver):
try:
resolved = resolver(dataset)
except Exception: # pragma: no cover — defensive
logger.exception("find_dataset_id_by_name(%r) raised", dataset)
resolved = "<unknown>"
if resolved is None:
raise ValueError(f"dataset {dataset!r} not found")

candidates: list[_Candidate] = list(
_iter_curation_candidates(backend, dataset=dataset, limit=candidate_pool)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fail closed for dataset resolution before candidate walk.

A named dataset prune currently continues even when find_dataset_id_by_name raises, which can degrade into an unintended broad candidate walk. This path should hard-fail before scoring/deletion.

🔒 Suggested safety fix
     if dataset is not None:
         resolver: Any = getattr(backend, "find_dataset_id_by_name", None)
-        if callable(resolver):
-            try:
-                resolved = resolver(dataset)
-            except Exception:  # pragma: no cover — defensive
-                logger.exception("find_dataset_id_by_name(%r) raised", dataset)
-                resolved = "<unknown>"
-            if resolved is None:
-                raise ValueError(f"dataset {dataset!r} not found")
+        if not callable(resolver):
+            raise ValueError(
+                "dataset-scoped prune requires backend.find_dataset_id_by_name"
+            )
+        try:
+            resolved = resolver(dataset)
+        except Exception as exc:  # pragma: no cover — defensive
+            logger.exception("find_dataset_id_by_name(%r) raised", dataset)
+            raise RuntimeError(f"failed to resolve dataset {dataset!r}") from exc
+        if resolved is None:
+            raise ValueError(f"dataset {dataset!r} not found")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@corpus_forge/admin/prune.py` around lines 481 - 494, The code currently
swallows exceptions from backend.find_dataset_id_by_name and continues to a
broad candidate walk; change it to hard-fail on resolution errors and use the
resolved id for the candidate walk. Specifically, in the block that uses
resolver = getattr(backend, "find_dataset_id_by_name", None) and calls resolved
= resolver(dataset), remove the catch-then-default behavior that sets resolved =
"<unknown>" and instead re-raise or raise a ValueError when resolver raises;
after a successful call, if resolved is None raise ValueError(f"dataset
{dataset!r} not found") and then assign dataset = resolved so that
_iter_curation_candidates(backend, dataset=dataset, ...) uses the resolved id.

Comment thread corpus_forge/ingest.py
Comment on lines +711 to +717
global_total = sum(per_source_totals.values()) or None
with make_progress(
"Ingest (all sources)",
total=global_total,
logger=scan_logger,
) as progress:
global_task = progress.add_task("Total", total=global_total)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't advance the global task for unplanned sources.

global_total only includes sources present in per_source_totals, but Line 795 advances that task for every item. If any source has source_total is None (API-only source, planner miss, planner failure), the "Total" bar can reach 100% too early or overshoot its declared total.

💡 Suggested fix
-    global_total = sum(per_source_totals.values()) or None
+    all_source_ids = {
+        id(source_config)
+        for dataset in config.datasets
+        for source_config in dataset.sources
+    }
+    planned_source_ids = set(per_source_totals)
+    global_total = (
+        sum(per_source_totals.values()) if planned_source_ids == all_source_ids else None
+    ) or None
...
-                        progress.update(global_task, advance=1)
+                        if global_total is None or source_total is not None:
+                            progress.update(global_task, advance=1)

Also applies to: 757-760, 785-795

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@corpus_forge/ingest.py` around lines 711 - 717, The global progress task
(global_task) is created with total=global_total which only counts sources
present in per_source_totals, but the code always advances global_task for every
processed item; only increment global_task when the current source has a known
planned total (i.e., source_total is not None and the source exists in
per_source_totals). Update the places that call progress.advance(global_task,
...) (including where source_total is read/used) to guard the advance with a
check like "if source_total is not None" (or "if source_name in
per_source_totals") so unplanned/API-only sources don't change the global bar;
apply the same conditional to the other similar advance calls referenced (the
blocks around make_progress, global_task, and loops that process per-source
items).

@ulmentflam ulmentflam merged commit 6e906e8 into main May 25, 2026
14 checks passed
ulmentflam added a commit that referenced this pull request May 25, 2026
Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the
prune-tuned weighted combination from `corpus_forge/admin/prune.py`
into a new public function on `corpus_forge.curation.selector` so
future callers (CLI verb, MCP surface, EDA dashboards) can score
candidates the same way without re-implementing the rubric.

- `corpus_forge.curation.selector.score_for_pruning(candidate, *,
  sub_scores, weights=None) -> tuple[float, dict[str, float]]`
- Default weights exposed as the public alias `PRUNE_WEIGHTS` (the
  internal `_PRUNE_WEIGHTS` stays private for the module's own use).
- Validates: weight keys (extra/missing), weight sum (within 1e-6 of
  1.0), sub_score completeness. Raises `ValueError` on any mismatch.
- `corpus_forge/admin/prune.py` refactored: drops its local
  `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new
  function. All 22 tests in `tests/unit/test_prune_scorer.py` pass
  UNMODIFIED (the regression net for the refactor).

12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`;
no regressions in curation suites.

Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will
retarget to `main` automatically when PR #50 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ulmentflam added a commit that referenced this pull request May 25, 2026
…ls (#51)

Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the
prune-tuned weighted combination from `corpus_forge/admin/prune.py`
into a new public function on `corpus_forge.curation.selector` so
future callers (CLI verb, MCP surface, EDA dashboards) can score
candidates the same way without re-implementing the rubric.

- `corpus_forge.curation.selector.score_for_pruning(candidate, *,
  sub_scores, weights=None) -> tuple[float, dict[str, float]]`
- Default weights exposed as the public alias `PRUNE_WEIGHTS` (the
  internal `_PRUNE_WEIGHTS` stays private for the module's own use).
- Validates: weight keys (extra/missing), weight sum (within 1e-6 of
  1.0), sub_score completeness. Raises `ValueError` on any mismatch.
- `corpus_forge/admin/prune.py` refactored: drops its local
  `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new
  function. All 22 tests in `tests/unit/test_prune_scorer.py` pass
  UNMODIFIED (the regression net for the refactor).

12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`;
no regressions in curation suites.

Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will
retarget to `main` automatically when PR #50 merges.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ulmentflam ulmentflam deleted the nightly/prune-admin-011742Z branch May 26, 2026 15:57
ulmentflam added a commit that referenced this pull request May 26, 2026
Cuts a feature release on top of v0.1.0b9 — 14 commits since the
b9 tag, including new CLI surface, three RFC foundation tasks,
and the bisecting OpenAI embedder recovery.

Highlights
----------

**RFC `rfc-corpus-growth-controls` (P1) — four landed tasks:**

- New `corpus-forge prune --dataset NAME` CLI verb (#52). Dry-run
  default; `--apply` deletes the bottom-percentile chunks scored by
  `score_for_pruning()`. `--dry-run-json` emits the plan for CI / agent
  inspection.
- `prune_dataset()` module + scoring rubric (#50) — Postgres / SQLite
  dispatch via `_is_postgres_like` capability probe; named-but-unknown
  datasets raise `ValueError` before any candidate walk (safety guard
  under `apply=True`). 22 unit tests pin the rubric.
- Public `score_for_pruning()` extracted onto
  `corpus_forge.curation.selector` (#51); default weights exposed as
  `PRUNE_WEIGHTS`.
- Per-source cap enforcement in `ingest` (#53) — evicts lowest-scoring
  chunks after each source completes a scan; attribution via URI
  prefix matching with a one-line WARNING fall-through for non-unique
  scheme plugins.
- `[growth]` config block (#37) + `DatasetSourceConfig.max_rows` /
  `max_bytes` storage.

**RFC `rfc-eval-framework-expansion` (P1) — two foundation tasks:**

- `corpus_forge.eval._schema.EvalOutput` shared output envelope (#38).
  Pydantic v2 with `extra='forbid'` so future evaluators can't widen
  the envelope.
- `[eval_regression]` config block + `EvalRegressionConfig` (#41)
  for the future `eval regression --baseline` verb's tolerance gating.

**RFC `rfc-nlp-data-quality-signals` (P1) — one foundation task:**

- `corpus_forge.quality.HeuristicQualityEnricher` (#40) — pure-Python
  composite quality scorer (token-rate, punctuation, repetition,
  shouting → weighted geometric mean).

**RFC `rfc-developer-ux-verbs` (P3) — one checkbox closed:**

- `corpus-forge logs tail --level <name>` minimum-severity filter (#39).

**Embedder reliability (the bug that triggered all this):**

- `OpenAIEmbedder.encode` bisects on failure (#49). On a 5xx or
  NaN-laced response the batch is recursively halved until the
  offending chunk is isolated, logged at WARNING (orig_idx / chars /
  sha256, no PII preview), and skipped. `max_retries=0` on the SDK
  client so failures fast-fail (~1s instead of ~7s). Non-recoverable
  4xx errors (auth, missing model, 400/422) re-raise immediately
  instead of being bisected — bisection can't fix a wrong config.
  Row-count mismatches also bisect. `backfill_embedder` exits its
  inner loop on an all-skipped batch instead of refetching the same
  `chunks_missing_embedding` rows forever. Unblocks the maintainer's
  Ollama-served `qwen3-embedding:8b` instance which intermittently
  returned NaN-laced 500s for ~3% of chunks.
- `embedder_drift` doctor check + `corpus-forge embedder gc` CLI (#48).
  Catches the silent embedder-rename bug — renaming an embedder in
  `config.toml` left an orphan `embeddings_<name>` table + catalog
  row behind, invisible to `embedder list`. The maintainer's instance
  had a 209 MB orphan on top of a 342 MB real-data DB before this
  landed.

**macOS install hardening (#47):**

- `corpus_forge.macos_tcc` module — iCloud Drive + TCC integration.
- `icloud_access` doctor check.
- TCC handshake in `corpus-forge setup` and
  `service install --launchd`.
- `FilesystemSource.parse` proactively `brctl download`s evicted
  iCloud placeholders.

Hygiene
-------

CHANGELOG cleanup: 9 entries from PRs that landed AFTER v0.1.0b9 was
tagged were misfiled into the `[0.1.0b8]` `### Added` section (they
got dropped there by automated CHANGELOG patchers running on stale
checkouts). Moved them up under the new `[0.1.0b10]` section so the
b8 release notes reflect what actually shipped in b8.

Verified
--------

- `uv run pytest tests/unit/test_phase_ci3_*.py` — 69 passed, 1
  skipped (full wheel install round-trip behind `CI3_FULL_INSTALL=1`).
- All six version-pin sites updated: `pyproject.toml`,
  `corpus_forge/__init__.py`, `uv.lock`, and two `test_phase_ci3_*`
  literal-string asserts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant