Skip to content

feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9)#53

Draft
ulmentflam wants to merge 4 commits into
nightly/prune-admin-011742Zfrom
nightly/ingest-caps-022000Z
Draft

feat(ingest): per-source cap enforcement — rfc-corpus-growth-controls (task 4/9)#53
ulmentflam wants to merge 4 commits into
nightly/prune-admin-011742Zfrom
nightly/ingest-caps-022000Z

Conversation

@ulmentflam
Copy link
Copy Markdown
Owner

Summary

Fourth task of rfc-corpus-growth-controls.md. When a DatasetSourceConfig has max_rows or max_bytes set, ingest_once now enforces the caps by evicting the lowest-scoring chunks (per score_for_pruning(...)) until the source is back under both caps.

  • New module corpus_forge/admin/source_caps.py:

    • derive_source_uri_prefix(source_config) -> str | None — per-plugin URI prefix table covering markdown_vault, claude_code (+ claude-code-history), opencode, gemini_cli, codex_cli, chatgpt_export, jsonl_chat, zotero (with user_id / group_id), and filesystem. Unknown plugins → None (skipped with one-line WARNING).
    • count_source_rows(backend, dataset_id, prefix) — paramstyle-aware COUNT + SUM(LENGTH) across documents and conversations, correctly scoped to dataset_id and schema-prefix-aware (corpus.chunks for Postgres, bare chunks for SQLite).
    • enforce_source_caps(backend, dataset_id, source_config) — returns a CapEnforcementReport with structured outcome (no_cap / no_prefix / under_cap / evicted_max_rows / evicted_max_bytes).
    • _evict_lowest_scoring(...)resolves dataset_id → dataset_name via find_dataset_name_by_id (probe) or _execute (fallback), then passes the name to _iter_curation_candidates(...) so cross-dataset URI collisions cannot delete from the wrong dataset. Reuses prune.py's _delete_chunks dispatch.
  • corpus_forge/ingest.py: cap enforcement runs once per source per ingest cycle (post-scan), not per-document. Failure-isolated try/except: cap-check errors emit a WARNING but never break ingest.

Stacking

This PR's base is nightly/score-for-pruning-015212Z (PR #51). The merge commit imports PR #42's DatasetSourceConfig.max_rows/max_bytes fields, which the cap-enforcement reads. When PR #51 and PR #42 merge, this PR's diff converges to just source_caps.py + the ingest.py wiring + the test file.

Dependency chain: #50 (prune module) → #51 (score_for_pruning) ← #42 (config caps) → #53 (this PR).

Test plan

  • tests/unit/test_source_caps_enforcement.py — 21 new tests, all pass.
    • URI-prefix derivation for every supported plugin + the unknown-plugin branch.
    • No-op branches: no_cap, no_prefix, under_cap.
    • Eviction branches: max_rows, max_bytes, both caps stricter-wins.
    • claude_code's two URI schemes (claude-code:// + claude-code-history://) both counted under one umbrella.
    • Dataset-scope regression: two datasets share the claude-code:// prefix; enforcing cap on dataset 1 evicts only dataset-1 chunks. (Caught by the reviewer; fixed before this PR landed.)
  • No regressions in tests/unit/test_prune_scorer.py (22), tests/unit/test_selector_score_for_pruning.py (12), or tests/unit/test_dataset_source_caps.py (12).
  • ruff check + pyrefly typecheck clean (pre-commit hooks pass).

Disclosures (uncertainty.md)

  • Per-source-once vs per-document enforcement — plan said "every insert"; implementer chose "once per source per ingest cycle" to avoid scoring every candidate per file. A chatty source can briefly exceed cap by the whole batch before eviction fires at end-of-scan. Acceptable for the RFC's soft-cap framing.
  • URI-scheme conflation across plugin instances — two claude_code sources in the same dataset (e.g. personal + work) share the claude-code:// scheme and would be conflated. Rare in practice; documented.
  • Candidate pool capped at 10,000_DEFAULT_SCORING_POOL=10_000. Caps above this would see a truncated scoring window. Fine for typical configs.

🤖 Generated with Claude Code

ulmentflam and others added 4 commits May 22, 2026 09:42
Closes the `DatasetSourceConfig max_rows / max_bytes` checkbox of
RFC `rfc-corpus-growth-controls` (P1). Storage-only this PR — the
eviction runtime in `ingest_once` lands in a follow-up.

Change
------

`corpus_forge/config.py`: two new optional fields on
`DatasetSourceConfig`:

- `max_rows: int | None = Field(default=None, gt=0)` — per-source
  row cap.
- `max_bytes: int | None = Field(default=None, gt=0)` — per-source
  byte cap.

Default `None` on both = "uncapped." When set, the future eviction
loop will enforce these AFTER each batch insert and evict the
lowest-scoring rows from that source (LRU + score). The eviction
honours `GrowthConfig.per_source_cap_default_rows` from PR #37 as a
fallback when `max_rows is None` but the global default is non-zero;
that fallback wiring belongs in the eviction-loop PR, not this one.

Plain ints (no human-readable string parsing). Future-compat: the
`_parse_bytes` helper on PR #37's branch can be adopted here once
both PRs merge if reviewers want `"10G"`-style strings on per-source
caps too — for now the global cap accepts strings, the per-source
caps take raw ints.

Tests
-----

`tests/unit/test_dataset_source_caps.py` — 11 tests across five
classes:

- `TestDefaults` — both fields default to `None`.
- `TestMaxRowsBounds` — positive int accepted, zero / negative
  rejected.
- `TestMaxBytesBounds` — same.
- `TestCoexistence` — independently or both set.
- `TestBackwardsCompat` — existing configs without these fields
  still validate.

Verified
--------

- `uv run pytest tests/unit/test_dataset_source_caps.py
  tests/unit/test_config.py -q` → 17/17 in 0.17s.
- `uv run ruff check`, `uv run ruff format --check`,
  `uv run pyrefly check corpus_forge/config.py` → all clean.

The pre-existing `Lint + format + typecheck` CI failure on `main`
(PR #31 fixes) will still fail this PR's lint job until #31 lands
and this rebases. Same blast pattern as the other open Nightly PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second task of `.planning/rfcs/rfc-corpus-growth-controls.md`. Move the
prune-tuned weighted combination from `corpus_forge/admin/prune.py`
into a new public function on `corpus_forge.curation.selector` so
future callers (CLI verb, MCP surface, EDA dashboards) can score
candidates the same way without re-implementing the rubric.

- `corpus_forge.curation.selector.score_for_pruning(candidate, *,
  sub_scores, weights=None) -> tuple[float, dict[str, float]]`
- Default weights exposed as the public alias `PRUNE_WEIGHTS` (the
  internal `_PRUNE_WEIGHTS` stays private for the module's own use).
- Validates: weight keys (extra/missing), weight sum (within 1e-6 of
  1.0), sub_score completeness. Raises `ValueError` on any mismatch.
- `corpus_forge/admin/prune.py` refactored: drops its local
  `_PRUNE_WEIGHTS` and inline weighted-sum, delegates to the new
  function. All 22 tests in `tests/unit/test_prune_scorer.py` pass
  UNMODIFIED (the regression net for the refactor).

12 new unit tests in `tests/unit/test_selector_score_for_pruning.py`;
no regressions in curation suites.

Stacked on `nightly/prune-admin-011742Z` (PR #50). GitHub will
retarget to `main` automatically when PR #50 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fourth task of `.planning/rfcs/rfc-corpus-growth-controls.md`. When
a `DatasetSourceConfig` has `max_rows` or `max_bytes` set,
`ingest_once` now calls `enforce_source_caps()` once per source per
ingest cycle. If the source is over its cap, the lowest-scoring
chunks (per `score_for_pruning(...)`) are evicted until back under
the cap. Cap-check failures are logged at WARNING level but never
break ingest.

- New module `corpus_forge/admin/source_caps.py`:
  - `derive_source_uri_prefix(source_config) -> str | None` —
    per-plugin URI prefix table covering markdown_vault, claude_code
    (+ claude-code-history), opencode, gemini_cli, codex_cli,
    chatgpt_export, jsonl_chat, zotero (with user_id/group_id), and
    filesystem. Unknown plugins → None (skipped with WARNING).
  - `count_source_rows(backend, dataset_id, prefix)` — paramstyle-
    aware COUNT + SUM(LENGTH) across documents and conversations,
    correctly scoped to dataset_id and schema-prefix-aware
    (`corpus.chunks` for Postgres, bare `chunks` for SQLite).
  - `enforce_source_caps(backend, dataset_id, source_config)` —
    returns a `CapEnforcementReport` with structured outcome.
  - `_evict_lowest_scoring(...)` — resolves dataset_id → dataset_name
    via `find_dataset_name_by_id` (probe) or `_execute` (fallback),
    then passes the name to `_iter_curation_candidates(...)` so
    cross-dataset URI collisions cannot delete from the wrong
    dataset. Reuses prune.py's `_delete_chunks` dispatch.

21 unit tests cover URI-prefix derivation, no-op branches, both
eviction branches (rows + bytes), claude_code dual-scheme summation,
and a dataset-scope regression that locks the cross-dataset deletion
safety guard.

Stacks on PR #51 (`score_for_pruning`) and PR #42
(`DatasetSourceConfig.max_rows/max_bytes`), merged into the base
branch as the merge commit on top of the existing prune stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 262cd69b-7514-4a73-a992-58843e97f43b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nightly/ingest-caps-022000Z

Comment @coderabbitai help to get the list of available commands and usage tips.

@ulmentflam ulmentflam changed the base branch from nightly/score-for-pruning-015212Z to nightly/prune-admin-011742Z May 25, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant