feat(admin): prune_dataset() scoring module — rfc-corpus-growth-controls (task 1/9) by ulmentflam · Pull Request #50 · ulmentflam/corpus-forge

ulmentflam · 2026-05-23T01:51:04Z

Summary

First task of rfc-corpus-growth-controls.md: introduce corpus_forge/admin/prune.py with a prune_dataset(...) entry point that scores chunks on a weighted rubric and selects the bottom N% as prune candidates. Default is dry-run; deletion only happens with explicit apply=True.

prune_dataset(backend, *, dataset, percentile=10, apply=False, ...) -> PruneReport — scoring + selection + (gated) deletion.
PruneReport carries the per-chunk score breakdowns, source-grouped summary, and a duplicate_density_available flag so callers can warn when the MinHash backend is absent.
Reuses the existing curation sub-score primitives in corpus_forge.curation.selector. Adds two new prune-tuned signals (duplicate_density, feedback_drag) inline in prune.py.
Backend-agnostic apply path: prefers a delete_chunks_by_ids hook, falls back to bulk ANY(%s) (Postgres) or chunked IN (?, …) (SQLite, 500-id batches). Dispatch keys on _paramstyle == "pyformat" with a class-name fallback for test doubles.
Unknown-dataset safety: raises ValueError before any candidate walk when a named dataset doesn't resolve — silent corpus-wide deletion under apply=True would be too sharp an edge.

Explicitly out of scope (separate RFC tasks)

corpus-forge prune CLI verb.
GrowthConfig Pydantic block (prune_percentile_default, etc.).
score_for_pruning(chunk, ...) extension on the selector.
Per-source caps (max_rows, max_bytes).
estimate sync pre-flight.
tests/integration/test_prune_e2e.py (postponed to the CLI-verb task where there's an end-to-end surface).

Test plan

tests/unit/test_prune_scorer.py — 22 tests pass.
- Dry-run default never touches the delete path.
- Score ordering (worst rows surface first).
- Percentile bounds (validated 0–100 inclusive; ValueError on out-of-range).
- feedback_drag flips on rejected feedback / negative rating.
- duplicate_density silently zeros when MinHash unimportable; flag promoted to PruneReport.
- apply=True calls the delete_chunks_by_ids hook when present.
- Postgres bulk ANY(%s) branch covered via _paramstyle="pyformat" test double.
- SQLite chunked IN (?, ?, …) branch covered via 7-id pool with batch size monkey-patched to 3 (3 batches, last batch = 1).
- Unknown-dataset → ValueError before walk.
No regressions in tests/unit/test_curation_selector.py / test_selector_learned_signal.py (103 tests pass combined).
ruff check clean.
pyrefly check clean (pre-commit hook).
Manual smoke against a live Postgres dataset — deferred to CLI-verb task.

Disclosures

Detail in .nightly/runs/2026-05-23T01-15-27Z/tasks/0001-prune-admin-module/uncertainty.md:

chunk_feedback vs corpus.feedback schema discrepancy — dual-probed, both shapes tolerated.
Postgres dispatch heuristic (_paramstyle probe + class-name fallback).
Tie-break behavior on equal prune_score — Python's stable sort preserves backend yield order; not contract-tested.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added dataset pruning functionality with configurable scoring rubric and optional batch deletion of low-quality chunks.
- Enhanced ingestion progress reporting with per-source task tracking and improved runtime metrics.
Bug Fixes
- Improved error classification during data ingestion to better identify and report embedder-related failures.
Documentation
- Updated changelog with details on pruning module capabilities.
Tests
- Added comprehensive test coverage for pruning and ingestion workflows.

Two issues surfaced during a real ingest run: 1. The per-source progress bar was unbounded (``⠼ Ingest 0:00:01``, no percentage, no live ETA), even though the up-front ``_log_ingest_eta`` walk had already counted every file. The ETA was logged once at startup and then disappeared during the actual work — the user couldn't tell what fraction was done. 2. Ollama embedder 5xx with ``"failed to encode response: json: unsupported value: NaN"`` was logged as ``"Extractor failed on <file>"``. That mis-attributes a model quirk (NaN in the embedding vector) to file content, sending users hunting for "broken" files instead of revisiting the embedder choice. Changes: - Rename ``_log_ingest_eta`` → ``_plan_ingest``. Same walk + same ETA log, but now also returns a ``dict[id(source_config), int]`` with per-source file counts. The previous "walk twice" cost (planner + ``source.scan()``) for filesystem-rooted sources is preserved on the planner side and consumed on the ingest side. - ``ingest_once`` threads each source's total into ``make_progress``. Rich's bounded-mode columns (``BarColumn``, ``MofNCompleteColumn``, ``TimeRemainingColumn``) auto-render a live ETA from total + advance, so users now see ``X/Y`` + remaining time. Sources without a resolvable filesystem root (API-only plugins) keep the previous unbounded behaviour. - New ``_classify_and_log_ingest_error`` classifies per-document failures: ``"unsupported value: NaN"`` → WARNING with a hint to swap embedders or filter empty chunks; embedder 5xx → WARNING; everything else keeps the historic ``extract_logger.info( "Extractor failed on X: …")`` taxonomy so existing grep patterns / dashboards still match. Tests: 3 new (planner returns per-source counts; planner returns ``{}`` when no filesystem roots; NaN message logs at WARNING with actionable hint; generic failure keeps the extractor-failed wording). 173 estimate/ingest/runtime_profile tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds on the per-source progress bar from the previous commit by adding a persistent *global* bar that tracks total work across every source in the run. The user asked for "a global estimation of total time remaining on ALL items as a progress bar" — this is that. Layout while ingest runs:: Ingest (all sources) ━━━━━━━━ 4,321/250,003 0:01:30 1d 11h 30m filesystem ━━━━━━━━ 4,321/51,001 0:01:30 0:08:13 The top bar persists for the whole run; the indented per-source bar appears when each source starts and is removed when it finishes, so exactly one source task is visible at any time underneath the global view. Why one ``Progress``, not two ----------------------------- Rich's ``Live`` (which ``Progress`` wraps) is single-active per console — two nested ``make_progress`` blocks would race for the screen. The fix is to keep a single ``Progress`` instance for the duration of ``ingest_once`` and use ``add_task`` / ``remove_task`` to slot each source's child task in and out underneath the persistent global task. Total = ``sum(_plan_ingest(config).values())`` or ``None`` when no source contributed a count (API-only configs keep the previous indeterminate bar). Other tidying ------------- - Hoisted the ``import time`` that lived inside ``ingest_one`` to the top of the module — it's used both there and in the new per-source timing block, and the in-loop ``import time as _time`` was needless ceremony now that ``ingest.py`` already imports stdlib at the top. - Replaced the old single ``"Scan complete: N documents"`` log with a paired ``"Ingest (X) started"`` / ``"Ingest (X) complete: N in Ts (rate R/s)"`` so the rotating log still has the bookend pair that ``make_progress``'s ``logger=`` was previously providing per source (the outer ``make_progress`` now logs once for the whole run, not per source). All 173 estimate + ingest + runtime_profile + MCP-estimate tests pass. Pre-commit + pre-push hook gates clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer-flagged regression: the source + global progress bars only advanced on successful ``ingest_one`` calls because the ``progress.update(..., advance=1)`` calls lived AFTER the ``try/except`` block and the ``except`` branch ended with ``continue``. Per-file failures (Ollama-NaN 5xx, extractor crashes, …) silently dropped advances on the floor. With the new global bar this is doubly bad: planner totals come from ``estimate_sync`` which counts every file regardless of whether ingest will succeed, so a single failure leaves the global bar permanently stranded below 100% even after every source has been processed. Fix: move both ``progress.update`` calls into a ``finally:`` block on the per-file ``try`` so they fire regardless of outcome. Dropped the now-redundant ``continue`` from the ``except`` arm — there was no code after the try/finally inside the loop body for it to skip. Test: source-text inspection of ``ingest_once`` verifying both advance calls live inside the ``finally`` block. End-to-end Progress mocking would require significant fake-backend plumbing; the structural check is precise enough to lock the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First task of `.planning/rfcs/rfc-corpus-growth-controls.md`. New module `corpus_forge/admin/prune.py` exposes `prune_dataset(...)` returning a `PruneReport` of the bottom-percentile chunks. Dry-run by default. - Reuses sub-score primitives from `corpus_forge.curation.selector` (confidence_deficit, missing_metadata, freshness) and adds two prune-tuned signals: `duplicate_density` (skipped when the MinHash backend from `rfc-nlp-data-quality-signals.md` is absent) and `feedback_drag` (dual-probes `corpus.chunk_feedback` and `corpus.feedback` for shape-tolerance). - Final rubric weights: 0.20/0.20/0.15/0.25/0.20 (sums to 1.0). - Backend dispatch via capability probe (`_paramstyle == "pyformat"`), with a class-name fallback for test doubles. - Apply path: prefers a `delete_chunks_by_ids` hook, else bulk `ANY(%s)` for Postgres, else chunked `IN (?, …)` for SQLite (batched at 500 ids). - Unknown-dataset safety: raises `ValueError` before walking when a named dataset doesn't resolve — silent corpus-wide deletion under `apply=True` would be too sharp an edge. Followups in separate RFC tasks (intentionally out of scope here): - `corpus-forge prune` CLI verb. - `GrowthConfig` Pydantic block (`prune_percentile_default`, etc.). - `score_for_pruning(chunk, ...)` extension on the selector. - Per-source caps (`max_rows`, `max_bytes`). - `estimate sync` pre-flight. 22/22 prune-scorer tests pass; no regressions in curation suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-23T01:51:10Z

📝 Walkthrough

Walkthrough

This PR introduces dataset pruning as a new admin feature and improves ingest orchestration. The pruning module adds configurable candidate selection via a five-part rubric (including new MinHash-based duplicate detection and feedback-drag scoring) with Postgres/SQLite deletion dispatch. Ingest gains file-count planning, smarter error classification for embedder failures, and unified progress tracking via Rich.

Changes

Dataset Pruning Admin Feature

Layer / File(s)	Summary
Prune Data Contracts & Module Setup `CHANGELOG.md`, `corpus_forge/admin/__init__.py`, `corpus_forge/admin/prune.py` (lines 1–100, 107–149, 590–594)	`PruneCandidate` and `PruneReport` dataclasses capture per-candidate scores and report-level metadata; module exports these symbols and defines the five-part rubric weights, candidate pool defaults, and percentile bounds.
Candidate Scoring Subsystem `corpus_forge/admin/prune.py` (lines 156–202, 209–302, 309–333)	MinHash-backed `duplicate_density` scorer with graceful degradation to 0.0; feedback-drag scorer that flags rejected/negative feedback; and `_prune_reason` generator that picks the top weighted contributor for human-readable explanations.
Deletion Dispatch & Backend Detection `corpus_forge/admin/prune.py` (lines 340–423)	Backend-shape detection routes Postgres-like backends (via `_paramstyle` or class name) to bulk `DELETE ... ANY(%s)`, SQLite-like backends to chunked `IN (...)` batches, with proper placeholder syntax and optional schema prefix.
Main Prune Flow & Orchestration `corpus_forge/admin/prune.py` (lines 430–588)	`prune_dataset` validates percentile bounds, resolves dataset identity (safety check prevents cross-dataset pruning), loads candidate pool, scores each with the rubric, selects worst-first by percentile, groups by source URI stem, and conditionally deletes when `apply=True`.
Comprehensive Test Suite `tests/unit/test_prune_scorer.py`	Validates dry-run/apply modes, percentile selection and bounds checking, score monotonicity and ranking, feedback-drag flipping and degradation, duplicate-density availability probing, apply-path deletion with Postgres/SQLite dispatch, report grouping by source stem, dataclass immutability, dataset safety refusal, and per-source file-count wiring.

Ingest Orchestration Improvements

Layer / File(s)	Summary
Timing Instrumentation `corpus_forge/ingest.py` (lines 5, 309–318)	`time.perf_counter()` now used directly; per-document chunking and DB-write elapsed times are measured and recorded for runtime profiling.
Planning & Error Classification Helpers `corpus_forge/ingest.py` (lines 533–599, 635–643)	`_plan_ingest` walks filesystem-rooted sources upfront, returns per-source file counts, logs totals, and degrades gracefully when no filesystem roots exist; `_classify_and_log_ingest_error` detects embedder NaN/5xx issues (logged at `WARNING` with model-selection hint) vs default extractor failures.
Progress Reporting Refactor `corpus_forge/ingest.py` (lines 663–674, 699–813)	Single Rich `Progress` instance replaces per-source `make_progress` contexts; global "Total" task sized to planned file counts with dynamically added/removed per-source child tasks; progress updates advance both tasks on every iteration inside `finally:` block (success or failure).
Ingest Helper Test Coverage `tests/unit/test_ingest_helpers.py` (lines 73–146, 148–191, 192–242)	`_plan_ingest` returns file-count mapping for filesystem sources and empty dict for API-only configs; `_classify_and_log_ingest_error` logs NaN/500 at `WARNING` with hint vs generic failures; regression test confirms progress advances on both global and source tasks inside `finally:` block.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A pruner hops through chunks so deep,
Scores the worst and makes them reap,
While ingest counts and progresses fair,
With Postgres dispatch and SQLite care!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main addition: a prune_dataset() scoring module with reference to the RFC task tracking.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch nightly/prune-admin-011742Z

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/unit/test_ingest_helpers.py (2)
176-189: ⚡ Quick win

Consider verifying the log level for generic failures.

The NaN test thoroughly validates both the log level (WARNING) and message content, but this test only checks that "Extractor failed" appears in the message without asserting the expected log level. Adding a level assertion would make the test more robust and symmetric with the NaN test.
🔍 Suggested enhancement
         with caplog.at_level(logging.INFO, logger="corpus_forge.ingest.extract"):
             _classify_and_log_ingest_error(_Raw(), ValueError("PDF parse error"))
         records = [r for r in caplog.records if "Extractor failed" in r.getMessage()]
         assert records, "expected an extractor-failure line on a generic exception"
+        assert records[0].levelno == logging.ERROR  # or INFO/WARNING depending on expected behavior
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_ingest_helpers.py` around lines 176 - 189, Add an assertion
that the generic failure is logged at the expected WARNING level: in
test_generic_failure_keeps_extractor_taxonomy, after selecting records that
contain "Extractor failed" (using _classify_and_log_ingest_error on _Raw()),
assert that at least one matching record has record.levelno == logging.WARNING
(or record.levelname == "WARNING") so the test verifies both message content and
log level for generic failures; reference the test function
test_generic_failure_keeps_extractor_taxonomy and the
_classify_and_log_ingest_error call to locate where to add the assertion.
204-242: 💤 Low value

Consider more resilient pattern matching for the source inspection.

The test uses hardcoded indentation patterns ("\n for " with exactly 12 spaces) and exact parameter-order matching. While the docstring acknowledges the tradeoff of source inspection vs. full integration testing, the pattern matching could be made more resilient to formatting changes without significantly increasing complexity.
💡 More flexible alternatives

Using regex patterns would survive auto-formatting and minor refactoring:
import re

# Instead of exact indentation:
next_for = re.search(r'\n\s+for\s+', tail)
next_func = re.search(r'\ndef\s+', tail)

# Instead of exact parameter order:
assert re.search(r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1', finally_block)
assert re.search(r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1', finally_block)
This maintains the contract check while tolerating formatting and keyword-argument reordering.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_ingest_helpers.py` around lines 204 - 242, The test
test_loop_body_uses_finally_for_progress_update uses brittle exact-string
searches to locate the finally block and the two progress.update calls; update
it to use regex-based searches so minor formatting/arg-order changes don't break
the check: use re.search(r'\n\s+for\s+') and re.search(r'\ndef\s+') (or
equivalent) to find the finally block boundaries in the inspected source of
ingest.ingest_once, and assert the progress.update occurrences with patterns
like r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1' and
r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1' to tolerate spacing
and keyword-argument reordering.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@corpus_forge/admin/prune.py`:
- Around line 263-271: The SQL fallback hard-codes the "corpus." schema (e.g. in
the execute calls inside the try/except where logger.debug is used and where
_coerce_to_dict_list wraps execute), which breaks non-default-schema Postgres
setups; remove the "corpus." prefix from the table names (use unqualified
chunk_feedback and feedback and keep the WHERE entity_type = 'chunk' clause) so
the connection's search_path/default schema is respected; apply the same change
for the other occurrence referenced (around the code at lines noted, e.g. the
later execute calls at ~395-397) and keep function names execute and
_coerce_to_dict_list and the surrounding try/except/logger.debug logic
unchanged.
- Around line 481-494: The code currently swallows exceptions from
backend.find_dataset_id_by_name and continues to a broad candidate walk; change
it to hard-fail on resolution errors and use the resolved id for the candidate
walk. Specifically, in the block that uses resolver = getattr(backend,
"find_dataset_id_by_name", None) and calls resolved = resolver(dataset), remove
the catch-then-default behavior that sets resolved = "<unknown>" and instead
re-raise or raise a ValueError when resolver raises; after a successful call, if
resolved is None raise ValueError(f"dataset {dataset!r} not found") and then
assign dataset = resolved so that _iter_curation_candidates(backend,
dataset=dataset, ...) uses the resolved id.

In `@corpus_forge/ingest.py`:
- Around line 711-717: The global progress task (global_task) is created with
total=global_total which only counts sources present in per_source_totals, but
the code always advances global_task for every processed item; only increment
global_task when the current source has a known planned total (i.e.,
source_total is not None and the source exists in per_source_totals). Update the
places that call progress.advance(global_task, ...) (including where
source_total is read/used) to guard the advance with a check like "if
source_total is not None" (or "if source_name in per_source_totals") so
unplanned/API-only sources don't change the global bar; apply the same
conditional to the other similar advance calls referenced (the blocks around
make_progress, global_task, and loops that process per-source items).

---

Nitpick comments:
In `@tests/unit/test_ingest_helpers.py`:
- Around line 176-189: Add an assertion that the generic failure is logged at
the expected WARNING level: in test_generic_failure_keeps_extractor_taxonomy,
after selecting records that contain "Extractor failed" (using
_classify_and_log_ingest_error on _Raw()), assert that at least one matching
record has record.levelno == logging.WARNING (or record.levelname == "WARNING")
so the test verifies both message content and log level for generic failures;
reference the test function test_generic_failure_keeps_extractor_taxonomy and
the _classify_and_log_ingest_error call to locate where to add the assertion.
- Around line 204-242: The test test_loop_body_uses_finally_for_progress_update
uses brittle exact-string searches to locate the finally block and the two
progress.update calls; update it to use regex-based searches so minor
formatting/arg-order changes don't break the check: use
re.search(r'\n\s+for\s+') and re.search(r'\ndef\s+') (or equivalent) to find the
finally block boundaries in the inspected source of ingest.ingest_once, and
assert the progress.update occurrences with patterns like
r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1' and
r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1' to tolerate spacing
and keyword-argument reordering.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 214d9b1c-4aec-43f1-ada9-13b180b607a0

📥 Commits

Reviewing files that changed from the base of the PR and between c0e82aa and 2d4df08.

📒 Files selected for processing (6)

CHANGELOG.md
corpus_forge/admin/__init__.py
corpus_forge/admin/prune.py
corpus_forge/ingest.py
tests/unit/test_ingest_helpers.py
tests/unit/test_prune_scorer.py

coderabbitai · 2026-05-25T17:11:47Z

+            execute("SELECT chunk_id, kind, rating FROM corpus.chunk_feedback")
+        )
+    except Exception as exc:
+        logger.debug("corpus.chunk_feedback probe failed (%r) — trying corpus.feedback", exc)
+        try:
+            rows_typed = _coerce_to_dict_list(
+                execute(
+                    "SELECT entity_id AS chunk_id, kind, rating FROM corpus.feedback "
+                    "WHERE entity_type = 'chunk'"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hard-coded corpus. schema in SQL fallbacks.

The fallback SQL pins table names to corpus.*, which breaks when Postgres uses a non-default schema and weakens backend portability. Prefer unqualified table names and rely on backend search path / connection default schema.

🛠️ Suggested portability fix

- execute("SELECT chunk_id, kind, rating FROM corpus.chunk_feedback") + execute("SELECT chunk_id, kind, rating FROM chunk_feedback") ... - "SELECT entity_id AS chunk_id, kind, rating FROM corpus.feedback " + "SELECT entity_id AS chunk_id, kind, rating FROM feedback " "WHERE entity_type = 'chunk'" ... - "DELETE FROM corpus.chunks WHERE id = ANY(%s)", + "DELETE FROM chunks WHERE id = ANY(%s)",

Also applies to: 395-397

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@corpus_forge/admin/prune.py` around lines 263 - 271, The SQL fallback hard-codes the "corpus." schema (e.g. in the execute calls inside the try/except where logger.debug is used and where _coerce_to_dict_list wraps execute), which breaks non-default-schema Postgres setups; remove the "corpus." prefix from the table names (use unqualified chunk_feedback and feedback and keep the WHERE entity_type = 'chunk' clause) so the connection's search_path/default schema is respected; apply the same change for the other occurrence referenced (around the code at lines noted, e.g. the later execute calls at ~395-397) and keep function names execute and _coerce_to_dict_list and the surrounding try/except/logger.debug logic unchanged.

coderabbitai · 2026-05-25T17:11:47Z

+    if dataset is not None:
+        resolver: Any = getattr(backend, "find_dataset_id_by_name", None)
+        if callable(resolver):
+            try:
+                resolved = resolver(dataset)
+            except Exception:  # pragma: no cover — defensive
+                logger.exception("find_dataset_id_by_name(%r) raised", dataset)
+                resolved = "<unknown>"
+            if resolved is None:
+                raise ValueError(f"dataset {dataset!r} not found")
+
+    candidates: list[_Candidate] = list(
+        _iter_curation_candidates(backend, dataset=dataset, limit=candidate_pool)
+    )


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fail closed for dataset resolution before candidate walk.

A named dataset prune currently continues even when find_dataset_id_by_name raises, which can degrade into an unintended broad candidate walk. This path should hard-fail before scoring/deletion.

🔒 Suggested safety fix

if dataset is not None: resolver: Any = getattr(backend, "find_dataset_id_by_name", None) - if callable(resolver): - try: - resolved = resolver(dataset) - except Exception: # pragma: no cover — defensive - logger.exception("find_dataset_id_by_name(%r) raised", dataset) - resolved = "<unknown>" - if resolved is None: - raise ValueError(f"dataset {dataset!r} not found") + if not callable(resolver): + raise ValueError( + "dataset-scoped prune requires backend.find_dataset_id_by_name" + ) + try: + resolved = resolver(dataset) + except Exception as exc: # pragma: no cover — defensive + logger.exception("find_dataset_id_by_name(%r) raised", dataset) + raise RuntimeError(f"failed to resolve dataset {dataset!r}") from exc + if resolved is None: + raise ValueError(f"dataset {dataset!r} not found")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@corpus_forge/admin/prune.py` around lines 481 - 494, The code currently swallows exceptions from backend.find_dataset_id_by_name and continues to a broad candidate walk; change it to hard-fail on resolution errors and use the resolved id for the candidate walk. Specifically, in the block that uses resolver = getattr(backend, "find_dataset_id_by_name", None) and calls resolved = resolver(dataset), remove the catch-then-default behavior that sets resolved = "<unknown>" and instead re-raise or raise a ValueError when resolver raises; after a successful call, if resolved is None raise ValueError(f"dataset {dataset!r} not found") and then assign dataset = resolved so that _iter_curation_candidates(backend, dataset=dataset, ...) uses the resolved id.

coderabbitai · 2026-05-25T17:11:47Z

+    global_total = sum(per_source_totals.values()) or None
+    with make_progress(
+        "Ingest (all sources)",
+        total=global_total,
+        logger=scan_logger,
+    ) as progress:
+        global_task = progress.add_task("Total", total=global_total)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't advance the global task for unplanned sources.

global_total only includes sources present in per_source_totals, but Line 795 advances that task for every item. If any source has source_total is None (API-only source, planner miss, planner failure), the "Total" bar can reach 100% too early or overshoot its declared total.

💡 Suggested fix

- global_total = sum(per_source_totals.values()) or None + all_source_ids = { + id(source_config) + for dataset in config.datasets + for source_config in dataset.sources + } + planned_source_ids = set(per_source_totals) + global_total = ( + sum(per_source_totals.values()) if planned_source_ids == all_source_ids else None + ) or None ... - progress.update(global_task, advance=1) + if global_total is None or source_total is not None: + progress.update(global_task, advance=1)

Also applies to: 757-760, 785-795

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@corpus_forge/ingest.py` around lines 711 - 717, The global progress task (global_task) is created with total=global_total which only counts sources present in per_source_totals, but the code always advances global_task for every processed item; only increment global_task when the current source has a known planned total (i.e., source_total is not None and the source exists in per_source_totals). Update the places that call progress.advance(global_task, ...) (including where source_total is read/used) to guard the advance with a check like "if source_total is not None" (or "if source_name in per_source_totals") so unplanned/API-only sources don't change the global bar; apply the same conditional to the other similar advance calls referenced (the blocks around make_progress, global_task, and loops that process per-source items).

Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the prune-tuned weighted combination from `corpus_forge/admin/prune.py` into a new public function on `corpus_forge.curation.selector` so future callers (CLI verb, MCP surface, EDA dashboards) can score candidates the same way without re-implementing the rubric. - `corpus_forge.curation.selector.score_for_pruning(candidate, *, sub_scores, weights=None) -> tuple[float, dict[str, float]]` - Default weights exposed as the public alias `PRUNE_WEIGHTS` (the internal `_PRUNE_WEIGHTS` stays private for the module's own use). - Validates: weight keys (extra/missing), weight sum (within 1e-6 of 1.0), sub_score completeness. Raises `ValueError` on any mismatch. - `corpus_forge/admin/prune.py` refactored: drops its local `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new function. All 22 tests in `tests/unit/test_prune_scorer.py` pass UNMODIFIED (the regression net for the refactor). 12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`; no regressions in curation suites. Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will retarget to `main` automatically when PR #50 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ls (#51) Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the prune-tuned weighted combination from `corpus_forge/admin/prune.py` into a new public function on `corpus_forge.curation.selector` so future callers (CLI verb, MCP surface, EDA dashboards) can score candidates the same way without re-implementing the rubric. - `corpus_forge.curation.selector.score_for_pruning(candidate, *, sub_scores, weights=None) -> tuple[float, dict[str, float]]` - Default weights exposed as the public alias `PRUNE_WEIGHTS` (the internal `_PRUNE_WEIGHTS` stays private for the module's own use). - Validates: weight keys (extra/missing), weight sum (within 1e-6 of 1.0), sub_score completeness. Raises `ValueError` on any mismatch. - `corpus_forge/admin/prune.py` refactored: drops its local `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new function. All 22 tests in `tests/unit/test_prune_scorer.py` pass UNMODIFIED (the regression net for the refactor). 12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`; no regressions in curation suites. Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will retarget to `main` automatically when PR #50 merges. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cuts a feature release on top of v0.1.0b9 — 14 commits since the b9 tag, including new CLI surface, three RFC foundation tasks, and the bisecting OpenAI embedder recovery. Highlights ---------- **RFC `rfc-corpus-growth-controls` (P1) — four landed tasks:** - New `corpus-forge prune --dataset NAME` CLI verb (#52). Dry-run default; `--apply` deletes the bottom-percentile chunks scored by `score_for_pruning()`. `--dry-run-json` emits the plan for CI / agent inspection. - `prune_dataset()` module + scoring rubric (#50) — Postgres / SQLite dispatch via `_is_postgres_like` capability probe; named-but-unknown datasets raise `ValueError` before any candidate walk (safety guard under `apply=True`). 22 unit tests pin the rubric. - Public `score_for_pruning()` extracted onto `corpus_forge.curation.selector` (#51); default weights exposed as `PRUNE_WEIGHTS`. - Per-source cap enforcement in `ingest` (#53) — evicts lowest-scoring chunks after each source completes a scan; attribution via URI prefix matching with a one-line WARNING fall-through for non-unique scheme plugins. - `[growth]` config block (#37) + `DatasetSourceConfig.max_rows` / `max_bytes` storage. **RFC `rfc-eval-framework-expansion` (P1) — two foundation tasks:** - `corpus_forge.eval._schema.EvalOutput` shared output envelope (#38). Pydantic v2 with `extra='forbid'` so future evaluators can't widen the envelope. - `[eval_regression]` config block + `EvalRegressionConfig` (#41) for the future `eval regression --baseline` verb's tolerance gating. **RFC `rfc-nlp-data-quality-signals` (P1) — one foundation task:** - `corpus_forge.quality.HeuristicQualityEnricher` (#40) — pure-Python composite quality scorer (token-rate, punctuation, repetition, shouting → weighted geometric mean). **RFC `rfc-developer-ux-verbs` (P3) — one checkbox closed:** - `corpus-forge logs tail --level <name>` minimum-severity filter (#39). **Embedder reliability (the bug that triggered all this):** - `OpenAIEmbedder.encode` bisects on failure (#49). On a 5xx or NaN-laced response the batch is recursively halved until the offending chunk is isolated, logged at WARNING (orig_idx / chars / sha256, no PII preview), and skipped. `max_retries=0` on the SDK client so failures fast-fail (~1s instead of ~7s). Non-recoverable 4xx errors (auth, missing model, 400/422) re-raise immediately instead of being bisected — bisection can't fix a wrong config. Row-count mismatches also bisect. `backfill_embedder` exits its inner loop on an all-skipped batch instead of refetching the same `chunks_missing_embedding` rows forever. Unblocks the maintainer's Ollama-served `qwen3-embedding:8b` instance which intermittently returned NaN-laced 500s for ~3% of chunks. - `embedder_drift` doctor check + `corpus-forge embedder gc` CLI (#48). Catches the silent embedder-rename bug — renaming an embedder in `config.toml` left an orphan `embeddings_<name>` table + catalog row behind, invisible to `embedder list`. The maintainer's instance had a 209 MB orphan on top of a 342 MB real-data DB before this landed. **macOS install hardening (#47):** - `corpus_forge.macos_tcc` module — iCloud Drive + TCC integration. - `icloud_access` doctor check. - TCC handshake in `corpus-forge setup` and `service install --launchd`. - `FilesystemSource.parse` proactively `brctl download`s evicted iCloud placeholders. Hygiene ------- CHANGELOG cleanup: 9 entries from PRs that landed AFTER v0.1.0b9 was tagged were misfiled into the `[0.1.0b8]` `### Added` section (they got dropped there by automated CHANGELOG patchers running on stale checkouts). Moved them up under the new `[0.1.0b10]` section so the b8 release notes reflect what actually shipped in b8. Verified -------- - `uv run pytest tests/unit/test_phase_ci3_*.py` — 69 passed, 1 skipped (full wheel install round-trip behind `CI3_FULL_INSTALL=1`). - All six version-pin sites updated: `pyproject.toml`, `corpus_forge/__init__.py`, `uv.lock`, and two `test_phase_ci3_*` literal-string asserts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ulmentflam and others added 4 commits May 22, 2026 12:53

ulmentflam marked this pull request as ready for review May 25, 2026 17:02

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

ulmentflam merged commit 6e906e8 into main May 25, 2026
14 checks passed

ulmentflam mentioned this pull request May 25, 2026

fix(admin/prune): dynamic-import minhash module to satisfy pyrefly #59

Merged

3 tasks

ulmentflam deleted the nightly/prune-admin-011742Z branch May 26, 2026 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(admin): prune_dataset() scoring module — rfc-corpus-growth-controls (task 1/9)#50

feat(admin): prune_dataset() scoring module — rfc-corpus-growth-controls (task 1/9)#50
ulmentflam merged 4 commits into
mainfrom
nightly/prune-admin-011742Z

ulmentflam commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ulmentflam commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Explicitly out of scope (separate RFC tasks)

Test plan

Disclosures

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ulmentflam commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading