feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9) by ulmentflam · Pull Request #53 · ulmentflam/corpus-forge

ulmentflam · 2026-05-23T18:04:33Z

Summary

Fourth task of rfc-corpus-growth-controls.md. When a DatasetSourceConfig has max_rows or max_bytes set, ingest_once now enforces the caps by evicting the lowest-scoring chunks (per score_for_pruning(...)) until the source is back under both caps.

New module corpus_forge/admin/source_caps.py:
- derive_source_uri_prefix(source_config) -> str | None — per-plugin URI prefix table covering markdown_vault, claude_code (+ claude-code-history), opencode, gemini_cli, codex_cli, chatgpt_export, jsonl_chat, zotero (with user_id / group_id), and filesystem. Unknown plugins → None (skipped with one-line WARNING).
- count_source_rows(backend, dataset_id, prefix) — paramstyle-aware COUNT + SUM(LENGTH) across documents and conversations, correctly scoped to dataset_id and schema-prefix-aware (corpus.chunks for Postgres, bare chunks for SQLite).
- enforce_source_caps(backend, dataset_id, source_config) — returns a CapEnforcementReport with structured outcome (no_cap / no_prefix / under_cap / evicted_max_rows / evicted_max_bytes).
- _evict_lowest_scoring(...) — resolves dataset_id → dataset_name via find_dataset_name_by_id (probe) or _execute (fallback), then passes the name to _iter_curation_candidates(...) so cross-dataset URI collisions cannot delete from the wrong dataset. Reuses prune.py's _delete_chunks dispatch.
corpus_forge/ingest.py: cap enforcement runs once per source per ingest cycle (post-scan), not per-document. Failure-isolated try/except: cap-check errors emit a WARNING but never break ingest.

Stacking

This PR's base is nightly/score-for-pruning-015212Z (PR #51). The merge commit imports PR #42's DatasetSourceConfig.max_rows/max_bytes fields, which the cap-enforcement reads. When PR #51 and PR #42 merge, this PR's diff converges to just source_caps.py + the ingest.py wiring + the test file.

Dependency chain: #50 (prune module) → #51 (score_for_pruning) ← #42 (config caps) → #53 (this PR).

Test plan

tests/unit/test_source_caps_enforcement.py — 21 new tests, all pass.
- URI-prefix derivation for every supported plugin + the unknown-plugin branch.
- No-op branches: no_cap, no_prefix, under_cap.
- Eviction branches: max_rows, max_bytes, both caps stricter-wins.
- claude_code's two URI schemes (claude-code:// + claude-code-history://) both counted under one umbrella.
- Dataset-scope regression: two datasets share the claude-code:// prefix; enforcing cap on dataset 1 evicts only dataset-1 chunks. (Caught by the reviewer; fixed before this PR landed.)
No regressions in tests/unit/test_prune_scorer.py (22), tests/unit/test_selector_score_for_pruning.py (12), or tests/unit/test_dataset_source_caps.py (12).
ruff check + pyrefly typecheck clean (pre-commit hooks pass).

Disclosures (`uncertainty.md`)

Per-source-once vs per-document enforcement — plan said "every insert"; implementer chose "once per source per ingest cycle" to avoid scoring every candidate per file. A chatty source can briefly exceed cap by the whole batch before eviction fires at end-of-scan. Acceptable for the RFC's soft-cap framing.
URI-scheme conflation across plugin instances — two claude_code sources in the same dataset (e.g. personal + work) share the claude-code:// scheme and would be conflated. Rare in practice; documented.
Candidate pool capped at 10,000 — _DEFAULT_SCORING_POOL=10_000. Caps above this would see a truncated scoring window. Fine for typical configs.

🤖 Generated with Claude Code

Closes the `DatasetSourceConfig max_rows / max_bytes` checkbox of RFC `rfc-corpus-growth-controls` (P1). Storage-only this PR — the eviction runtime in `ingest_once` lands in a follow-up. Change ------ `corpus_forge/config.py`: two new optional fields on `DatasetSourceConfig`: - `max_rows: int | None = Field(default=None, gt=0)` — per-source row cap. - `max_bytes: int | None = Field(default=None, gt=0)` — per-source byte cap. Default `None` on both = "uncapped." When set, the future eviction loop will enforce these AFTER each batch insert and evict the lowest-scoring rows from that source (LRU + score). The eviction honours `GrowthConfig.per_source_cap_default_rows` from PR #37 as a fallback when `max_rows is None` but the global default is non-zero; that fallback wiring belongs in the eviction-loop PR, not this one. Plain ints (no human-readable string parsing). Future-compat: the `_parse_bytes` helper on PR #37's branch can be adopted here once both PRs merge if reviewers want `"10G"`-style strings on per-source caps too — for now the global cap accepts strings, the per-source caps take raw ints. Tests ----- `tests/unit/test_dataset_source_caps.py` — 11 tests across five classes: - `TestDefaults` — both fields default to `None`. - `TestMaxRowsBounds` — positive int accepted, zero / negative rejected. - `TestMaxBytesBounds` — same. - `TestCoexistence` — independently or both set. - `TestBackwardsCompat` — existing configs without these fields still validate. Verified -------- - `uv run pytest tests/unit/test_dataset_source_caps.py tests/unit/test_config.py -q` → 17/17 in 0.17s. - `uv run ruff check`, `uv run ruff format --check`, `uv run pyrefly check corpus_forge/config.py` → all clean. The pre-existing `Lint + format + typecheck` CI failure on `main` (PR #31 fixes) will still fail this PR's lint job until #31 lands and this rebases. Same blast pattern as the other open Nightly PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the prune-tuned weighted combination from `corpus_forge/admin/prune.py` into a new public function on `corpus_forge.curation.selector` so future callers (CLI verb, MCP surface, EDA dashboards) can score candidates the same way without re-implementing the rubric. - `corpus_forge.curation.selector.score_for_pruning(candidate, *, sub_scores, weights=None) -> tuple[float, dict[str, float]]` - Default weights exposed as the public alias `PRUNE_WEIGHTS` (the internal `_PRUNE_WEIGHTS` stays private for the module's own use). - Validates: weight keys (extra/missing), weight sum (within 1e-6 of 1.0), sub_score completeness. Raises `ValueError` on any mismatch. - `corpus_forge/admin/prune.py` refactored: drops its local `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new function. All 22 tests in `tests/unit/test_prune_scorer.py` pass UNMODIFIED (the regression net for the refactor). 12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`; no regressions in curation suites. Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will retarget to `main` automatically when PR #50 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…st-caps-022000Z # Conflicts: # CHANGELOG.md

Fourth task of `.planning/rfcs/rfc-corpus-growth-controls.md`. When a `DatasetSourceConfig` has `max_rows` or `max_bytes` set, `ingest_once` now calls `enforce_source_caps()` once per source per ingest cycle. If the source is over its cap, the lowest-scoring chunks (per `score_for_pruning(...)`) are evicted until back under the cap. Cap-check failures are logged at WARNING level but never break ingest. - New module `corpus_forge/admin/source_caps.py`: - `derive_source_uri_prefix(source_config) -> str | None` — per-plugin URI prefix table covering markdown_vault, claude_code (+ claude-code-history), opencode, gemini_cli, codex_cli, chatgpt_export, jsonl_chat, zotero (with user_id/group_id), and filesystem. Unknown plugins → None (skipped with WARNING). - `count_source_rows(backend, dataset_id, prefix)` — paramstyle- aware COUNT + SUM(LENGTH) across documents and conversations, correctly scoped to dataset_id and schema-prefix-aware (`corpus.chunks` for Postgres, bare `chunks` for SQLite). - `enforce_source_caps(backend, dataset_id, source_config)` — returns a `CapEnforcementReport` with structured outcome. - `_evict_lowest_scoring(...)` — resolves dataset_id → dataset_name via `find_dataset_name_by_id` (probe) or `_execute` (fallback), then passes the name to `_iter_curation_candidates(...)` so cross-dataset URI collisions cannot delete from the wrong dataset. Reuses prune.py's `_delete_chunks` dispatch. 21 unit tests cover URI-prefix derivation, no-op branches, both eviction branches (rows + bytes), claude_code dual-scheme summation, and a dataset-scope regression that locks the cross-dataset deletion safety guard. Stacks on PR #51 (`score_for_pruning`) and PR #42 (`DatasetSourceConfig.max_rows/max_bytes`), merged into the base branch as the merge commit on top of the existing prune stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-23T18:04:39Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 262cd69b-7514-4a73-a992-58843e97f43b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch nightly/ingest-caps-022000Z

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ulmentflam and others added 4 commits May 22, 2026 09:42

Merge branch 'nightly/source-caps-20260522T132535Z' into nightly/inge…

1bc092e

…st-caps-022000Z # Conflicts: # CHANGELOG.md

ulmentflam mentioned this pull request May 25, 2026

feat(config): per-source max_rows / max_bytes caps (RFC P1) #42

Closed

4 tasks

ulmentflam changed the base branch from nightly/score-for-pruning-015212Z to nightly/prune-admin-011742Z May 25, 2026 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9)#53

feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9)#53
ulmentflam wants to merge 4 commits into
nightly/prune-admin-011742Zfrom
nightly/ingest-caps-022000Z

ulmentflam commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ulmentflam commented May 23, 2026

Summary

Stacking

Test plan

Disclosures (uncertainty.md)

Uh oh!

coderabbitai Bot commented May 23, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Disclosures (`uncertainty.md`)