Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ version numbers (so `0.1.0b1` is the first beta of the `0.1.0` line).

### Added

- feat(admin): new `prune_dataset()` module + scoring rubric (`corpus_forge/admin/prune.py`) — first step of `rfc-corpus-growth-controls`. CLI verb + GrowthConfig block land in follow-up RFC tasks. Postgres / SQLite dispatch goes through a small `_is_postgres_like` capability probe (`_paramstyle == "pyformat"` first, class-name `"postgres"` substring as fallback) so we don't lean on a single brittle name check; SQLite branch chunks the IN-list at `_SQLITE_BATCH_SIZE = 500` ids. `PruneReport.duplicate_density_available` exposes whether the MinHash quality signal ran (promoted off the head candidate's `sub_scores` so every element of `selected` is now shape-uniform). Named-but-unknown datasets raise `ValueError` before any candidate walk — critical safety guard under `apply=True` so a typo'd name can never delete from the wrong scope. 22 unit tests in `tests/unit/test_prune_scorer.py` (up from 17 in the initial round) lock the rubric, the dispatch heuristics, both delete paths, and the unknown-dataset refusal.
- `tests/unit/test_cli_human_friendly.py` — first two tests against
the human-friendly CLI testable properties: (1) doctor's
`_check_config_present` pins the `corpus-forge setup` recovery
Expand Down
8 changes: 7 additions & 1 deletion corpus_forge/admin/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,10 @@

from __future__ import annotations

__all__: list[str] = []
from corpus_forge.admin.prune import PruneCandidate, PruneReport, prune_dataset

__all__: list[str] = [
"PruneCandidate",
"PruneReport",
"prune_dataset",
]
Loading
Loading