feat(admin): prune_dataset() scoring module — rfc-corpus-growth-controls (task 1/9)#50
Conversation
Two issues surfaced during a real ingest run:
1. The per-source progress bar was unbounded (``⠼ Ingest 0:00:01``,
no percentage, no live ETA), even though the up-front
``_log_ingest_eta`` walk had already counted every file. The
ETA was logged once at startup and then disappeared during the
actual work — the user couldn't tell what fraction was done.
2. Ollama embedder 5xx with ``"failed to encode response: json:
unsupported value: NaN"`` was logged as ``"Extractor failed on
<file>"``. That mis-attributes a model quirk (NaN in the
embedding vector) to file content, sending users hunting for
"broken" files instead of revisiting the embedder choice.
Changes:
- Rename ``_log_ingest_eta`` → ``_plan_ingest``. Same walk + same
ETA log, but now also returns a ``dict[id(source_config), int]``
with per-source file counts. The previous "walk twice" cost
(planner + ``source.scan()``) for filesystem-rooted sources is
preserved on the planner side and consumed on the ingest side.
- ``ingest_once`` threads each source's total into ``make_progress``.
Rich's bounded-mode columns (``BarColumn``, ``MofNCompleteColumn``,
``TimeRemainingColumn``) auto-render a live ETA from total +
advance, so users now see ``X/Y`` + remaining time. Sources
without a resolvable filesystem root (API-only plugins) keep
the previous unbounded behaviour.
- New ``_classify_and_log_ingest_error`` classifies per-document
failures: ``"unsupported value: NaN"`` → WARNING with a hint to
swap embedders or filter empty chunks; embedder 5xx → WARNING;
everything else keeps the historic ``extract_logger.info(
"Extractor failed on X: …")`` taxonomy so existing grep
patterns / dashboards still match.
Tests: 3 new (planner returns per-source counts; planner returns
``{}`` when no filesystem roots; NaN message logs at WARNING with
actionable hint; generic failure keeps the extractor-failed
wording). 173 estimate/ingest/runtime_profile tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on the per-source progress bar from the previous commit by
adding a persistent *global* bar that tracks total work across every
source in the run. The user asked for "a global estimation of total
time remaining on ALL items as a progress bar" — this is that.
Layout while ingest runs::
Ingest (all sources) ━━━━━━━━ 4,321/250,003 0:01:30 1d 11h 30m
filesystem ━━━━━━━━ 4,321/51,001 0:01:30 0:08:13
The top bar persists for the whole run; the indented per-source bar
appears when each source starts and is removed when it finishes, so
exactly one source task is visible at any time underneath the
global view.
Why one ``Progress``, not two
-----------------------------
Rich's ``Live`` (which ``Progress`` wraps) is single-active per
console — two nested ``make_progress`` blocks would race for the
screen. The fix is to keep a single ``Progress`` instance for the
duration of ``ingest_once`` and use ``add_task`` / ``remove_task``
to slot each source's child task in and out underneath the
persistent global task. Total = ``sum(_plan_ingest(config).values())``
or ``None`` when no source contributed a count (API-only configs
keep the previous indeterminate bar).
Other tidying
-------------
- Hoisted the ``import time`` that lived inside ``ingest_one`` to
the top of the module — it's used both there and in the new
per-source timing block, and the in-loop ``import time as _time``
was needless ceremony now that ``ingest.py`` already imports
stdlib at the top.
- Replaced the old single ``"Scan complete: N documents"`` log with
a paired ``"Ingest (X) started"`` / ``"Ingest (X) complete: N in
Ts (rate R/s)"`` so the rotating log still has the bookend pair
that ``make_progress``'s ``logger=`` was previously providing per
source (the outer ``make_progress`` now logs once for the whole
run, not per source).
All 173 estimate + ingest + runtime_profile + MCP-estimate tests
pass. Pre-commit + pre-push hook gates clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer-flagged regression: the source + global progress bars only advanced on successful ``ingest_one`` calls because the ``progress.update(..., advance=1)`` calls lived AFTER the ``try/except`` block and the ``except`` branch ended with ``continue``. Per-file failures (Ollama-NaN 5xx, extractor crashes, …) silently dropped advances on the floor. With the new global bar this is doubly bad: planner totals come from ``estimate_sync`` which counts every file regardless of whether ingest will succeed, so a single failure leaves the global bar permanently stranded below 100% even after every source has been processed. Fix: move both ``progress.update`` calls into a ``finally:`` block on the per-file ``try`` so they fire regardless of outcome. Dropped the now-redundant ``continue`` from the ``except`` arm — there was no code after the try/finally inside the loop body for it to skip. Test: source-text inspection of ``ingest_once`` verifying both advance calls live inside the ``finally`` block. End-to-end Progress mocking would require significant fake-backend plumbing; the structural check is precise enough to lock the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First task of `.planning/rfcs/rfc-corpus-growth-controls.md`. New module `corpus_forge/admin/prune.py` exposes `prune_dataset(...)` returning a `PruneReport` of the bottom-percentile chunks. Dry-run by default. - Reuses sub-score primitives from `corpus_forge.curation.selector` (confidence_deficit, missing_metadata, freshness) and adds two prune-tuned signals: `duplicate_density` (skipped when the MinHash backend from `rfc-nlp-data-quality-signals.md` is absent) and `feedback_drag` (dual-probes `corpus.chunk_feedback` and `corpus.feedback` for shape-tolerance). - Final rubric weights: 0.20/0.20/0.15/0.25/0.20 (sums to 1.0). - Backend dispatch via capability probe (`_paramstyle == "pyformat"`), with a class-name fallback for test doubles. - Apply path: prefers a `delete_chunks_by_ids` hook, else bulk `ANY(%s)` for Postgres, else chunked `IN (?, …)` for SQLite (batched at 500 ids). - Unknown-dataset safety: raises `ValueError` before walking when a named dataset doesn't resolve — silent corpus-wide deletion under `apply=True` would be too sharp an edge. Followups in separate RFC tasks (intentionally out of scope here): - `corpus-forge prune` CLI verb. - `GrowthConfig` Pydantic block (`prune_percentile_default`, etc.). - `score_for_pruning(chunk, ...)` extension on the selector. - Per-source caps (`max_rows`, `max_bytes`). - `estimate sync` pre-flight. 22/22 prune-scorer tests pass; no regressions in curation suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces dataset pruning as a new admin feature and improves ingest orchestration. The pruning module adds configurable candidate selection via a five-part rubric (including new MinHash-based duplicate detection and feedback-drag scoring) with Postgres/SQLite deletion dispatch. Ingest gains file-count planning, smarter error classification for embedder failures, and unified progress tracking via Rich. ChangesDataset Pruning Admin Feature
Ingest Orchestration Improvements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/unit/test_ingest_helpers.py (2)
176-189: ⚡ Quick winConsider verifying the log level for generic failures.
The NaN test thoroughly validates both the log level (WARNING) and message content, but this test only checks that "Extractor failed" appears in the message without asserting the expected log level. Adding a level assertion would make the test more robust and symmetric with the NaN test.
🔍 Suggested enhancement
with caplog.at_level(logging.INFO, logger="corpus_forge.ingest.extract"): _classify_and_log_ingest_error(_Raw(), ValueError("PDF parse error")) records = [r for r in caplog.records if "Extractor failed" in r.getMessage()] assert records, "expected an extractor-failure line on a generic exception" + assert records[0].levelno == logging.ERROR # or INFO/WARNING depending on expected behavior🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_ingest_helpers.py` around lines 176 - 189, Add an assertion that the generic failure is logged at the expected WARNING level: in test_generic_failure_keeps_extractor_taxonomy, after selecting records that contain "Extractor failed" (using _classify_and_log_ingest_error on _Raw()), assert that at least one matching record has record.levelno == logging.WARNING (or record.levelname == "WARNING") so the test verifies both message content and log level for generic failures; reference the test function test_generic_failure_keeps_extractor_taxonomy and the _classify_and_log_ingest_error call to locate where to add the assertion.
204-242: 💤 Low valueConsider more resilient pattern matching for the source inspection.
The test uses hardcoded indentation patterns (
"\n for "with exactly 12 spaces) and exact parameter-order matching. While the docstring acknowledges the tradeoff of source inspection vs. full integration testing, the pattern matching could be made more resilient to formatting changes without significantly increasing complexity.💡 More flexible alternatives
Using regex patterns would survive auto-formatting and minor refactoring:
import re # Instead of exact indentation: next_for = re.search(r'\n\s+for\s+', tail) next_func = re.search(r'\ndef\s+', tail) # Instead of exact parameter order: assert re.search(r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1', finally_block) assert re.search(r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1', finally_block)This maintains the contract check while tolerating formatting and keyword-argument reordering.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_ingest_helpers.py` around lines 204 - 242, The test test_loop_body_uses_finally_for_progress_update uses brittle exact-string searches to locate the finally block and the two progress.update calls; update it to use regex-based searches so minor formatting/arg-order changes don't break the check: use re.search(r'\n\s+for\s+') and re.search(r'\ndef\s+') (or equivalent) to find the finally block boundaries in the inspected source of ingest.ingest_once, and assert the progress.update occurrences with patterns like r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1' and r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1' to tolerate spacing and keyword-argument reordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@corpus_forge/admin/prune.py`:
- Around line 263-271: The SQL fallback hard-codes the "corpus." schema (e.g. in
the execute calls inside the try/except where logger.debug is used and where
_coerce_to_dict_list wraps execute), which breaks non-default-schema Postgres
setups; remove the "corpus." prefix from the table names (use unqualified
chunk_feedback and feedback and keep the WHERE entity_type = 'chunk' clause) so
the connection's search_path/default schema is respected; apply the same change
for the other occurrence referenced (around the code at lines noted, e.g. the
later execute calls at ~395-397) and keep function names execute and
_coerce_to_dict_list and the surrounding try/except/logger.debug logic
unchanged.
- Around line 481-494: The code currently swallows exceptions from
backend.find_dataset_id_by_name and continues to a broad candidate walk; change
it to hard-fail on resolution errors and use the resolved id for the candidate
walk. Specifically, in the block that uses resolver = getattr(backend,
"find_dataset_id_by_name", None) and calls resolved = resolver(dataset), remove
the catch-then-default behavior that sets resolved = "<unknown>" and instead
re-raise or raise a ValueError when resolver raises; after a successful call, if
resolved is None raise ValueError(f"dataset {dataset!r} not found") and then
assign dataset = resolved so that _iter_curation_candidates(backend,
dataset=dataset, ...) uses the resolved id.
In `@corpus_forge/ingest.py`:
- Around line 711-717: The global progress task (global_task) is created with
total=global_total which only counts sources present in per_source_totals, but
the code always advances global_task for every processed item; only increment
global_task when the current source has a known planned total (i.e.,
source_total is not None and the source exists in per_source_totals). Update the
places that call progress.advance(global_task, ...) (including where
source_total is read/used) to guard the advance with a check like "if
source_total is not None" (or "if source_name in per_source_totals") so
unplanned/API-only sources don't change the global bar; apply the same
conditional to the other similar advance calls referenced (the blocks around
make_progress, global_task, and loops that process per-source items).
---
Nitpick comments:
In `@tests/unit/test_ingest_helpers.py`:
- Around line 176-189: Add an assertion that the generic failure is logged at
the expected WARNING level: in test_generic_failure_keeps_extractor_taxonomy,
after selecting records that contain "Extractor failed" (using
_classify_and_log_ingest_error on _Raw()), assert that at least one matching
record has record.levelno == logging.WARNING (or record.levelname == "WARNING")
so the test verifies both message content and log level for generic failures;
reference the test function test_generic_failure_keeps_extractor_taxonomy and
the _classify_and_log_ingest_error call to locate where to add the assertion.
- Around line 204-242: The test test_loop_body_uses_finally_for_progress_update
uses brittle exact-string searches to locate the finally block and the two
progress.update calls; update it to use regex-based searches so minor
formatting/arg-order changes don't break the check: use
re.search(r'\n\s+for\s+') and re.search(r'\ndef\s+') (or equivalent) to find the
finally block boundaries in the inspected source of ingest.ingest_once, and
assert the progress.update occurrences with patterns like
r'progress\.update\([^)]*source_task[^)]*advance\s*=\s*1' and
r'progress\.update\([^)]*global_task[^)]*advance\s*=\s*1' to tolerate spacing
and keyword-argument reordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 214d9b1c-4aec-43f1-ada9-13b180b607a0
📒 Files selected for processing (6)
CHANGELOG.mdcorpus_forge/admin/__init__.pycorpus_forge/admin/prune.pycorpus_forge/ingest.pytests/unit/test_ingest_helpers.pytests/unit/test_prune_scorer.py
| execute("SELECT chunk_id, kind, rating FROM corpus.chunk_feedback") | ||
| ) | ||
| except Exception as exc: | ||
| logger.debug("corpus.chunk_feedback probe failed (%r) — trying corpus.feedback", exc) | ||
| try: | ||
| rows_typed = _coerce_to_dict_list( | ||
| execute( | ||
| "SELECT entity_id AS chunk_id, kind, rating FROM corpus.feedback " | ||
| "WHERE entity_type = 'chunk'" |
There was a problem hiding this comment.
Avoid hard-coded corpus. schema in SQL fallbacks.
The fallback SQL pins table names to corpus.*, which breaks when Postgres uses a non-default schema and weakens backend portability. Prefer unqualified table names and rely on backend search path / connection default schema.
🛠️ Suggested portability fix
- execute("SELECT chunk_id, kind, rating FROM corpus.chunk_feedback")
+ execute("SELECT chunk_id, kind, rating FROM chunk_feedback")
...
- "SELECT entity_id AS chunk_id, kind, rating FROM corpus.feedback "
+ "SELECT entity_id AS chunk_id, kind, rating FROM feedback "
"WHERE entity_type = 'chunk'"
...
- "DELETE FROM corpus.chunks WHERE id = ANY(%s)",
+ "DELETE FROM chunks WHERE id = ANY(%s)",Also applies to: 395-397
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@corpus_forge/admin/prune.py` around lines 263 - 271, The SQL fallback
hard-codes the "corpus." schema (e.g. in the execute calls inside the try/except
where logger.debug is used and where _coerce_to_dict_list wraps execute), which
breaks non-default-schema Postgres setups; remove the "corpus." prefix from the
table names (use unqualified chunk_feedback and feedback and keep the WHERE
entity_type = 'chunk' clause) so the connection's search_path/default schema is
respected; apply the same change for the other occurrence referenced (around the
code at lines noted, e.g. the later execute calls at ~395-397) and keep function
names execute and _coerce_to_dict_list and the surrounding
try/except/logger.debug logic unchanged.
| if dataset is not None: | ||
| resolver: Any = getattr(backend, "find_dataset_id_by_name", None) | ||
| if callable(resolver): | ||
| try: | ||
| resolved = resolver(dataset) | ||
| except Exception: # pragma: no cover — defensive | ||
| logger.exception("find_dataset_id_by_name(%r) raised", dataset) | ||
| resolved = "<unknown>" | ||
| if resolved is None: | ||
| raise ValueError(f"dataset {dataset!r} not found") | ||
|
|
||
| candidates: list[_Candidate] = list( | ||
| _iter_curation_candidates(backend, dataset=dataset, limit=candidate_pool) | ||
| ) |
There was a problem hiding this comment.
Fail closed for dataset resolution before candidate walk.
A named dataset prune currently continues even when find_dataset_id_by_name raises, which can degrade into an unintended broad candidate walk. This path should hard-fail before scoring/deletion.
🔒 Suggested safety fix
if dataset is not None:
resolver: Any = getattr(backend, "find_dataset_id_by_name", None)
- if callable(resolver):
- try:
- resolved = resolver(dataset)
- except Exception: # pragma: no cover — defensive
- logger.exception("find_dataset_id_by_name(%r) raised", dataset)
- resolved = "<unknown>"
- if resolved is None:
- raise ValueError(f"dataset {dataset!r} not found")
+ if not callable(resolver):
+ raise ValueError(
+ "dataset-scoped prune requires backend.find_dataset_id_by_name"
+ )
+ try:
+ resolved = resolver(dataset)
+ except Exception as exc: # pragma: no cover — defensive
+ logger.exception("find_dataset_id_by_name(%r) raised", dataset)
+ raise RuntimeError(f"failed to resolve dataset {dataset!r}") from exc
+ if resolved is None:
+ raise ValueError(f"dataset {dataset!r} not found")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@corpus_forge/admin/prune.py` around lines 481 - 494, The code currently
swallows exceptions from backend.find_dataset_id_by_name and continues to a
broad candidate walk; change it to hard-fail on resolution errors and use the
resolved id for the candidate walk. Specifically, in the block that uses
resolver = getattr(backend, "find_dataset_id_by_name", None) and calls resolved
= resolver(dataset), remove the catch-then-default behavior that sets resolved =
"<unknown>" and instead re-raise or raise a ValueError when resolver raises;
after a successful call, if resolved is None raise ValueError(f"dataset
{dataset!r} not found") and then assign dataset = resolved so that
_iter_curation_candidates(backend, dataset=dataset, ...) uses the resolved id.
| global_total = sum(per_source_totals.values()) or None | ||
| with make_progress( | ||
| "Ingest (all sources)", | ||
| total=global_total, | ||
| logger=scan_logger, | ||
| ) as progress: | ||
| global_task = progress.add_task("Total", total=global_total) |
There was a problem hiding this comment.
Don't advance the global task for unplanned sources.
global_total only includes sources present in per_source_totals, but Line 795 advances that task for every item. If any source has source_total is None (API-only source, planner miss, planner failure), the "Total" bar can reach 100% too early or overshoot its declared total.
💡 Suggested fix
- global_total = sum(per_source_totals.values()) or None
+ all_source_ids = {
+ id(source_config)
+ for dataset in config.datasets
+ for source_config in dataset.sources
+ }
+ planned_source_ids = set(per_source_totals)
+ global_total = (
+ sum(per_source_totals.values()) if planned_source_ids == all_source_ids else None
+ ) or None
...
- progress.update(global_task, advance=1)
+ if global_total is None or source_total is not None:
+ progress.update(global_task, advance=1)Also applies to: 757-760, 785-795
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@corpus_forge/ingest.py` around lines 711 - 717, The global progress task
(global_task) is created with total=global_total which only counts sources
present in per_source_totals, but the code always advances global_task for every
processed item; only increment global_task when the current source has a known
planned total (i.e., source_total is not None and the source exists in
per_source_totals). Update the places that call progress.advance(global_task,
...) (including where source_total is read/used) to guard the advance with a
check like "if source_total is not None" (or "if source_name in
per_source_totals") so unplanned/API-only sources don't change the global bar;
apply the same conditional to the other similar advance calls referenced (the
blocks around make_progress, global_task, and loops that process per-source
items).
Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the prune-tuned weighted combination from `corpus_forge/admin/prune.py` into a new public function on `corpus_forge.curation.selector` so future callers (CLI verb, MCP surface, EDA dashboards) can score candidates the same way without re-implementing the rubric. - `corpus_forge.curation.selector.score_for_pruning(candidate, *, sub_scores, weights=None) -> tuple[float, dict[str, float]]` - Default weights exposed as the public alias `PRUNE_WEIGHTS` (the internal `_PRUNE_WEIGHTS` stays private for the module's own use). - Validates: weight keys (extra/missing), weight sum (within 1e-6 of 1.0), sub_score completeness. Raises `ValueError` on any mismatch. - `corpus_forge/admin/prune.py` refactored: drops its local `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new function. All 22 tests in `tests/unit/test_prune_scorer.py` pass UNMODIFIED (the regression net for the refactor). 12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`; no regressions in curation suites. Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will retarget to `main` automatically when PR #50 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ls (#51) Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the prune-tuned weighted combination from `corpus_forge/admin/prune.py` into a new public function on `corpus_forge.curation.selector` so future callers (CLI verb, MCP surface, EDA dashboards) can score candidates the same way without re-implementing the rubric. - `corpus_forge.curation.selector.score_for_pruning(candidate, *, sub_scores, weights=None) -> tuple[float, dict[str, float]]` - Default weights exposed as the public alias `PRUNE_WEIGHTS` (the internal `_PRUNE_WEIGHTS` stays private for the module's own use). - Validates: weight keys (extra/missing), weight sum (within 1e-6 of 1.0), sub_score completeness. Raises `ValueError` on any mismatch. - `corpus_forge/admin/prune.py` refactored: drops its local `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new function. All 22 tests in `tests/unit/test_prune_scorer.py` pass UNMODIFIED (the regression net for the refactor). 12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`; no regressions in curation suites. Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will retarget to `main` automatically when PR #50 merges. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts a feature release on top of v0.1.0b9 — 14 commits since the b9 tag, including new CLI surface, three RFC foundation tasks, and the bisecting OpenAI embedder recovery. Highlights ---------- **RFC `rfc-corpus-growth-controls` (P1) — four landed tasks:** - New `corpus-forge prune --dataset NAME` CLI verb (#52). Dry-run default; `--apply` deletes the bottom-percentile chunks scored by `score_for_pruning()`. `--dry-run-json` emits the plan for CI / agent inspection. - `prune_dataset()` module + scoring rubric (#50) — Postgres / SQLite dispatch via `_is_postgres_like` capability probe; named-but-unknown datasets raise `ValueError` before any candidate walk (safety guard under `apply=True`). 22 unit tests pin the rubric. - Public `score_for_pruning()` extracted onto `corpus_forge.curation.selector` (#51); default weights exposed as `PRUNE_WEIGHTS`. - Per-source cap enforcement in `ingest` (#53) — evicts lowest-scoring chunks after each source completes a scan; attribution via URI prefix matching with a one-line WARNING fall-through for non-unique scheme plugins. - `[growth]` config block (#37) + `DatasetSourceConfig.max_rows` / `max_bytes` storage. **RFC `rfc-eval-framework-expansion` (P1) — two foundation tasks:** - `corpus_forge.eval._schema.EvalOutput` shared output envelope (#38). Pydantic v2 with `extra='forbid'` so future evaluators can't widen the envelope. - `[eval_regression]` config block + `EvalRegressionConfig` (#41) for the future `eval regression --baseline` verb's tolerance gating. **RFC `rfc-nlp-data-quality-signals` (P1) — one foundation task:** - `corpus_forge.quality.HeuristicQualityEnricher` (#40) — pure-Python composite quality scorer (token-rate, punctuation, repetition, shouting → weighted geometric mean). **RFC `rfc-developer-ux-verbs` (P3) — one checkbox closed:** - `corpus-forge logs tail --level <name>` minimum-severity filter (#39). **Embedder reliability (the bug that triggered all this):** - `OpenAIEmbedder.encode` bisects on failure (#49). On a 5xx or NaN-laced response the batch is recursively halved until the offending chunk is isolated, logged at WARNING (orig_idx / chars / sha256, no PII preview), and skipped. `max_retries=0` on the SDK client so failures fast-fail (~1s instead of ~7s). Non-recoverable 4xx errors (auth, missing model, 400/422) re-raise immediately instead of being bisected — bisection can't fix a wrong config. Row-count mismatches also bisect. `backfill_embedder` exits its inner loop on an all-skipped batch instead of refetching the same `chunks_missing_embedding` rows forever. Unblocks the maintainer's Ollama-served `qwen3-embedding:8b` instance which intermittently returned NaN-laced 500s for ~3% of chunks. - `embedder_drift` doctor check + `corpus-forge embedder gc` CLI (#48). Catches the silent embedder-rename bug — renaming an embedder in `config.toml` left an orphan `embeddings_<name>` table + catalog row behind, invisible to `embedder list`. The maintainer's instance had a 209 MB orphan on top of a 342 MB real-data DB before this landed. **macOS install hardening (#47):** - `corpus_forge.macos_tcc` module — iCloud Drive + TCC integration. - `icloud_access` doctor check. - TCC handshake in `corpus-forge setup` and `service install --launchd`. - `FilesystemSource.parse` proactively `brctl download`s evicted iCloud placeholders. Hygiene ------- CHANGELOG cleanup: 9 entries from PRs that landed AFTER v0.1.0b9 was tagged were misfiled into the `[0.1.0b8]` `### Added` section (they got dropped there by automated CHANGELOG patchers running on stale checkouts). Moved them up under the new `[0.1.0b10]` section so the b8 release notes reflect what actually shipped in b8. Verified -------- - `uv run pytest tests/unit/test_phase_ci3_*.py` — 69 passed, 1 skipped (full wheel install round-trip behind `CI3_FULL_INSTALL=1`). - All six version-pin sites updated: `pyproject.toml`, `corpus_forge/__init__.py`, `uv.lock`, and two `test_phase_ci3_*` literal-string asserts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
First task of
rfc-corpus-growth-controls.md: introducecorpus_forge/admin/prune.pywith aprune_dataset(...)entry point that scores chunks on a weighted rubric and selects the bottom N% as prune candidates. Default is dry-run; deletion only happens with explicitapply=True.prune_dataset(backend, *, dataset, percentile=10, apply=False, ...) -> PruneReport— scoring + selection + (gated) deletion.PruneReportcarries the per-chunk score breakdowns, source-grouped summary, and aduplicate_density_availableflag so callers can warn when the MinHash backend is absent.corpus_forge.curation.selector. Adds two new prune-tuned signals (duplicate_density,feedback_drag) inline inprune.py.delete_chunks_by_idshook, falls back to bulkANY(%s)(Postgres) or chunkedIN (?, …)(SQLite, 500-id batches). Dispatch keys on_paramstyle == "pyformat"with a class-name fallback for test doubles.ValueErrorbefore any candidate walk when a named dataset doesn't resolve — silent corpus-wide deletion underapply=Truewould be too sharp an edge.Explicitly out of scope (separate RFC tasks)
corpus-forge pruneCLI verb.GrowthConfigPydantic block (prune_percentile_default, etc.).score_for_pruning(chunk, ...)extension on the selector.max_rows,max_bytes).estimate syncpre-flight.tests/integration/test_prune_e2e.py(postponed to the CLI-verb task where there's an end-to-end surface).Test plan
tests/unit/test_prune_scorer.py— 22 tests pass.ValueErroron out-of-range).feedback_dragflips on rejected feedback / negative rating.duplicate_densitysilently zeros when MinHash unimportable; flag promoted toPruneReport.apply=Truecalls thedelete_chunks_by_idshook when present.ANY(%s)branch covered via_paramstyle="pyformat"test double.IN (?, ?, …)branch covered via 7-id pool with batch size monkey-patched to 3 (3 batches, last batch = 1).ValueErrorbefore walk.tests/unit/test_curation_selector.py/test_selector_learned_signal.py(103 tests pass combined).ruff checkclean.pyrefly checkclean (pre-commit hook).Disclosures
Detail in
.nightly/runs/2026-05-23T01-15-27Z/tasks/0001-prune-admin-module/uncertainty.md:chunk_feedbackvscorpus.feedbackschema discrepancy — dual-probed, both shapes tolerated._paramstyleprobe + class-name fallback).prune_score— Python's stable sort preserves backend yield order; not contract-tested.🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Tests