feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9)#53
Draft
ulmentflam wants to merge 4 commits into
Draft
feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9)#53ulmentflam wants to merge 4 commits into
ulmentflam wants to merge 4 commits into
Conversation
Closes the `DatasetSourceConfig max_rows / max_bytes` checkbox of RFC `rfc-corpus-growth-controls` (P1). Storage-only this PR — the eviction runtime in `ingest_once` lands in a follow-up. Change ------ `corpus_forge/config.py`: two new optional fields on `DatasetSourceConfig`: - `max_rows: int | None = Field(default=None, gt=0)` — per-source row cap. - `max_bytes: int | None = Field(default=None, gt=0)` — per-source byte cap. Default `None` on both = "uncapped." When set, the future eviction loop will enforce these AFTER each batch insert and evict the lowest-scoring rows from that source (LRU + score). The eviction honours `GrowthConfig.per_source_cap_default_rows` from PR #37 as a fallback when `max_rows is None` but the global default is non-zero; that fallback wiring belongs in the eviction-loop PR, not this one. Plain ints (no human-readable string parsing). Future-compat: the `_parse_bytes` helper on PR #37's branch can be adopted here once both PRs merge if reviewers want `"10G"`-style strings on per-source caps too — for now the global cap accepts strings, the per-source caps take raw ints. Tests ----- `tests/unit/test_dataset_source_caps.py` — 11 tests across five classes: - `TestDefaults` — both fields default to `None`. - `TestMaxRowsBounds` — positive int accepted, zero / negative rejected. - `TestMaxBytesBounds` — same. - `TestCoexistence` — independently or both set. - `TestBackwardsCompat` — existing configs without these fields still validate. Verified -------- - `uv run pytest tests/unit/test_dataset_source_caps.py tests/unit/test_config.py -q` → 17/17 in 0.17s. - `uv run ruff check`, `uv run ruff format --check`, `uv run pyrefly check corpus_forge/config.py` → all clean. The pre-existing `Lint + format + typecheck` CI failure on `main` (PR #31 fixes) will still fail this PR's lint job until #31 lands and this rebases. Same blast pattern as the other open Nightly PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the prune-tuned weighted combination from `corpus_forge/admin/prune.py` into a new public function on `corpus_forge.curation.selector` so future callers (CLI verb, MCP surface, EDA dashboards) can score candidates the same way without re-implementing the rubric. - `corpus_forge.curation.selector.score_for_pruning(candidate, *, sub_scores, weights=None) -> tuple[float, dict[str, float]]` - Default weights exposed as the public alias `PRUNE_WEIGHTS` (the internal `_PRUNE_WEIGHTS` stays private for the module's own use). - Validates: weight keys (extra/missing), weight sum (within 1e-6 of 1.0), sub_score completeness. Raises `ValueError` on any mismatch. - `corpus_forge/admin/prune.py` refactored: drops its local `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new function. All 22 tests in `tests/unit/test_prune_scorer.py` pass UNMODIFIED (the regression net for the refactor). 12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`; no regressions in curation suites. Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will retarget to `main` automatically when PR #50 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…st-caps-022000Z # Conflicts: # CHANGELOG.md
Fourth task of `.planning/rfcs/rfc-corpus-growth-controls.md`. When
a `DatasetSourceConfig` has `max_rows` or `max_bytes` set,
`ingest_once` now calls `enforce_source_caps()` once per source per
ingest cycle. If the source is over its cap, the lowest-scoring
chunks (per `score_for_pruning(...)`) are evicted until back under
the cap. Cap-check failures are logged at WARNING level but never
break ingest.
- New module `corpus_forge/admin/source_caps.py`:
- `derive_source_uri_prefix(source_config) -> str | None` —
per-plugin URI prefix table covering markdown_vault, claude_code
(+ claude-code-history), opencode, gemini_cli, codex_cli,
chatgpt_export, jsonl_chat, zotero (with user_id/group_id), and
filesystem. Unknown plugins → None (skipped with WARNING).
- `count_source_rows(backend, dataset_id, prefix)` — paramstyle-
aware COUNT + SUM(LENGTH) across documents and conversations,
correctly scoped to dataset_id and schema-prefix-aware
(`corpus.chunks` for Postgres, bare `chunks` for SQLite).
- `enforce_source_caps(backend, dataset_id, source_config)` —
returns a `CapEnforcementReport` with structured outcome.
- `_evict_lowest_scoring(...)` — resolves dataset_id → dataset_name
via `find_dataset_name_by_id` (probe) or `_execute` (fallback),
then passes the name to `_iter_curation_candidates(...)` so
cross-dataset URI collisions cannot delete from the wrong
dataset. Reuses prune.py's `_delete_chunks` dispatch.
21 unit tests cover URI-prefix derivation, no-op branches, both
eviction branches (rows + bytes), claude_code dual-scheme summation,
and a dataset-scope regression that locks the cross-dataset deletion
safety guard.
Stacks on PR #51 (`score_for_pruning`) and PR #42
(`DatasetSourceConfig.max_rows/max_bytes`), merged into the base
branch as the merge commit on top of the existing prune stack.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fourth task of
rfc-corpus-growth-controls.md. When aDatasetSourceConfighasmax_rowsormax_bytesset,ingest_oncenow enforces the caps by evicting the lowest-scoring chunks (perscore_for_pruning(...)) until the source is back under both caps.New module
corpus_forge/admin/source_caps.py:derive_source_uri_prefix(source_config) -> str | None— per-plugin URI prefix table coveringmarkdown_vault,claude_code(+claude-code-history),opencode,gemini_cli,codex_cli,chatgpt_export,jsonl_chat,zotero(withuser_id/group_id), andfilesystem. Unknown plugins →None(skipped with one-line WARNING).count_source_rows(backend, dataset_id, prefix)— paramstyle-aware COUNT +SUM(LENGTH)across documents and conversations, correctly scoped todataset_idand schema-prefix-aware (corpus.chunksfor Postgres, barechunksfor SQLite).enforce_source_caps(backend, dataset_id, source_config)— returns aCapEnforcementReportwith structured outcome (no_cap/no_prefix/under_cap/evicted_max_rows/evicted_max_bytes)._evict_lowest_scoring(...)— resolvesdataset_id → dataset_nameviafind_dataset_name_by_id(probe) or_execute(fallback), then passes the name to_iter_curation_candidates(...)so cross-dataset URI collisions cannot delete from the wrong dataset. Reusesprune.py's_delete_chunksdispatch.corpus_forge/ingest.py: cap enforcement runs once per source per ingest cycle (post-scan), not per-document. Failure-isolated try/except: cap-check errors emit a WARNING but never break ingest.Stacking
This PR's base is
nightly/score-for-pruning-015212Z(PR #51). The merge commit imports PR #42'sDatasetSourceConfig.max_rows/max_bytesfields, which the cap-enforcement reads. When PR #51 and PR #42 merge, this PR's diff converges to justsource_caps.py+ theingest.pywiring + the test file.Dependency chain: #50 (prune module) → #51 (score_for_pruning) ← #42 (config caps) → #53 (this PR).
Test plan
tests/unit/test_source_caps_enforcement.py— 21 new tests, all pass.no_cap,no_prefix,under_cap.max_rows,max_bytes, both caps stricter-wins.claude-code://+claude-code-history://) both counted under one umbrella.claude-code://prefix; enforcing cap on dataset 1 evicts only dataset-1 chunks. (Caught by the reviewer; fixed before this PR landed.)tests/unit/test_prune_scorer.py(22),tests/unit/test_selector_score_for_pruning.py(12), ortests/unit/test_dataset_source_caps.py(12).ruff check+pyrefly typecheckclean (pre-commit hooks pass).Disclosures (
uncertainty.md)claude_codesources in the same dataset (e.g. personal + work) share theclaude-code://scheme and would be conflated. Rare in practice; documented._DEFAULT_SCORING_POOL=10_000. Caps above this would see a truncated scoring window. Fine for typical configs.🤖 Generated with Claude Code