Skip to content

[recipes] Synthesis capture — Query-as-Ingest pattern#212

Draft
alanshurafa wants to merge 7 commits into
NateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/synthesis-capture
Draft

[recipes] Synthesis capture — Query-as-Ingest pattern#212
alanshurafa wants to merge 7 commits into
NateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/synthesis-capture

Conversation

@alanshurafa

Copy link
Copy Markdown
Collaborator

Depends on

This is opened as a draft. Flip to ready-for-review once its dependencies land on main.

What this adds

`recipes/synthesis-capture/` — Karpathy-inspired Query-as-Ingest pattern:

  • `capture_synthesis` MCP tool handler
  • `POST /synthesis` REST endpoint
  • Anti-loop rules: must have 3+ sources, at least one primary parent, sources cannot themselves be `source_type='synthesis'`
  • Stamped reserved metadata keys AFTER caller merge (prevents spoofing)
  • Provenance mirrored into `metadata.provenance` for stock-RPC compatibility
  • Input caps on `content` (50KB), `question` (2000 chars), `topics`/`tags` (20 items each)

Known limitation (documented)

Stock `search_thoughts` / `list_thoughts` MCP tools don't expose thought IDs, so a model following the README's "search → synthesize → capture" flow can't populate `source_thought_ids` without another tool. Users provide IDs manually or via MCP tool update.

Review history

2 fix rounds + 2 Codex verify rounds + 1 Claude review. Final Codex clean.

See `recipes/synthesis-capture/README.md` + `DEPENDENCIES.md`.

Adds capture_synthesis MCP tool + POST /synthesis REST endpoint so complex query results can be captured back as new thoughts with full provenance. Enforces anti-loop (no synthesis-of-synthesis), primary-parent, and 3+ source constraints.
…erge

Previously the REST handler set source: rest_synthesis and then called
Object.assign(mergedMetadata, body.metadata), letting a caller overwrite
the reserved source identifier (and by extension spoof the write channel
for downstream filtering/reporting recipes).

Restructure the merge so body.metadata is applied first, handler-controlled
fields (question/topics/tags) next, and reserved provenance keys
(source, source_type, derivation_layer, derivation_method, derived_from)
stamped last via explicit key-assign. Caller-supplied metadata can no
longer spoof provenance fields.
Add upper bounds on user/LLM-controlled input to prevent accidental or
adversarial floods into Postgres and the embedding API:

- content:         max 50KB (~10k words)
- source_thought_ids: max 50 items (keeps .in() query plan sane)
- question:        max 2000 chars
- topics/tags:     max 20 entries, 200 chars each

MCP path enforces via Zod max() / array.max(); REST path mirrors the same
limits imperatively (returns 413 Payload Too Large) since it has no Zod
schema. Both caps are documented inline; adjust both sides together if
longer syntheses are needed.
…bility

The stock upsert_thought RPC on origin/main only persists
p_payload.metadata and silently drops top-level keys like source_type,
derivation_layer, derivation_method, and derived_from. Until the sibling
provenance-chains recipe lands with an updated RPC, every synthesis
written against the stock install loses its provenance fields — and
the anti-loop safety check becomes vacuous because no synthesis row
ever gets tagged source_type='synthesis' in a queryable column.

Fix: mirror the same four provenance fields into metadata.provenance.*
in both handlers. On the patched RPC, top-level fields still populate
the dedicated columns (no regression). On the stock RPC, the metadata
mirror is the ONLY durable copy — callers and future readers can fall
back to thoughts.metadata->'provenance' to reconstruct chains.

Also aligns MCP-side embedding soft-fail with REST: both paths now
return a success result with a warning message when the embedding
patch write fails. Previously MCP returned isError: true which would
trigger spurious retries of an already-successful capture.
Add DEPENDENCIES.md covering two unresolved couplings:

1. Sibling recipe provenance-chains is unmerged. Explains the stock
   upsert_thought RPC dropping top-level keys, the metadata.provenance.*
   mirror that keeps this recipe workable in the meantime, and the exact
   cleanup path (TODO(synthesis-capture) markers) once the patched RPC
   lands.

2. Stock search_thoughts / list_thoughts don't expose row IDs, so the
   advertised MCP flow (search, synthesize, capture) can't actually be
   completed by a model on its own. Document workarounds (manual ID
   injection, custom read tool) until a base update lands.

Also add a Known Limitations section to README.md summarizing both
blockers plus the new input caps and embedding soft-fail semantics,
linking out to DEPENDENCIES.md for detail.
@github-actions github-actions Bot added the recipe Contribution: step-by-step recipe label Apr 19, 2026
@alanshurafa alanshurafa added area: recipes Review area: recipes alan-reviewed Reviewed by Alan Shurafa in Community Reviewer role labels May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

alan-reviewed Reviewed by Alan Shurafa in Community Reviewer role area: recipes Review area: recipes recipe Contribution: step-by-step recipe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant