Skip to content

fix(query): re-probe Parquet schema when cache changes underneath a running engine#422

Merged
wesm merged 4 commits into
kenn-io:mainfrom
Lazare-42:fix/parquet-reprobe-on-cache-change
Jun 27, 2026
Merged

fix(query): re-probe Parquet schema when cache changes underneath a running engine#422
wesm merged 4 commits into
kenn-io:mainfrom
Lazare-42:fix/parquet-reprobe-on-cache-change

Conversation

@Lazare-42

Copy link
Copy Markdown
Contributor

What

The DuckDB query engine probes each Parquet table's optional columns once at construction and trusts that snapshot for the rest of the process lifetime. A long-running process (the serve daemon, the MCP server) goes stale when build-cache/sync rewrites the analytics cache with a different column set.

When that happens, the cached "column present" verdict puts a now-absent column into a SELECT * REPLACE (... AS message_type) list, which DuckDB rejects:

Binder Error: Column "message_type" in REPLACE list not found in FROM clause

Every Search/Aggregate call then fails until the process is restarted.

Why

Mixed-version writers (or full rebuilds with an older builder) can drop optional columns from the cache. The engine assumed the schema was immutable for its lifetime; it isn't for a server whose cache is rebuilt out of band.

How

  • Fingerprint the cache (per-probed-table file count + size + mtime).
  • When the fingerprint changes, re-probe optional columns and re-register the SQL views on demand (cheap no-op on the common unchanged path).
  • Guard optionalCols with an RWMutex since refresh can now race with reads.
  • Regression test rebuilds the cache underneath a live engine and asserts the query recovers instead of crashing.

🤖 Generated with Claude Code

The DuckDB engine probed each Parquet table's optional columns once at
construction and trusted that snapshot for the process lifetime. A
long-running mcp-http server therefore went stale when build-cache/sync
rewrote the analytics cache with a different column set: the cached
"column present" verdict put a now-absent column into a SELECT * REPLACE
list, which DuckDB rejects with

  Binder Error: Column "message_type" in REPLACE list not found in FROM clause

crashing every search_messages and aggregate call until restart.

Fingerprint the cache (per-table file count + size + mtime) and re-probe
optional columns (and re-register views) on demand when it changes.
Guard optionalCols with a RWMutex since refresh now happens concurrently
with reads. Add a regression test that rebuilds the cache underneath a
live engine and asserts the query recovers instead of crashing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (393f632)

Stale DuckDB search cache remains a Medium correctness issue; no security findings reported.

Medium

  • internal/query/duckdb.go:2446 - SearchFastWithStats serves cached temp tables keyed only by search conditions. After ensureFreshOptionalCols detects an analytics cache rebuild, repeating the same search can return stale rows, counts, and stats from pre-rebuild Parquet data. Refresh/check the cache fingerprint before the cache-hit branch and include it in the search cache key, or drop/re-materialize the search cache whenever the Parquet fingerprint changes.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 2m33s), codex_security (codex/security, done, 40s) | Total: 3m19s

@wesm

wesm commented Jun 26, 2026

Copy link
Copy Markdown
Member

looking

SearchFastWithStats kept a materialized temp table alive for pagination and keyed it only by search predicates. When build-cache or sync rewrote the Parquet analytics files under a long-running engine, the optional-column probe could refresh while the search path still served rows, counts, and stats from the pre-rebuild temp table.\n\nInclude the observed Parquet cache fingerprint in the search cache key and refresh that fingerprint before considering a cache hit, so repeated searches rematerialize against current analytics data after an out-of-band cache rebuild. The regression rewrites Parquet beneath a live engine and repeats the same fast search to prove counts, stats, and row ordering come from the rebuilt cache.\n\nGenerated with Codex (GPT-5)\nCo-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (1dd0bb7)

Summary verdict: Changes need fixes before merge due to stale cache invalidation risks.

Medium

  • internal/query/duckdb.go:177NewDuckDBEngine records the cache fingerprint after probing Parquet schemas. If the cache is rebuilt between the probe and assignment, optionalCols can describe old files while cacheFP matches new files, causing ensureFreshOptionalCols to treat stale schema data as fresh.

    • Fix: Capture the fingerprint before probing, or compare fingerprints before and after the probe and retry until stable before storing optionalCols and cacheFP.
  • internal/query/duckdb.go:287 — The fingerprint used to invalidate SearchFastWithStats cache only includes messages, participants, conversations, and sources. Results and cached stats also depend on message_recipients, labels, message_labels, and attachments, so rebuilding only those files can leave cached rows or stats stale.

    • Fix: Include all RequiredParquetDirs in the fingerprint, with the messages directory using the partitioned glob.

Panel: ci_default_security | Synthesis: codex, 12s | Members: codex_default (codex/default, done, 3m56s), codex_security (codex/security, done, 1m4s) | Total: 5m12s

Search cache invalidation now depends on the same Parquet directory set required for a complete analytics cache. Without the auxiliary tables in the fingerprint, attachment, label, recipient, or relationship rebuilds could leave SearchFastWithStats serving stale cached stats or rows for repeated searches.\n\nSchema probing now also records optional columns only from a stable fingerprint window. If build-cache rewrites files while a long-running engine is probing schemas, the probe retries instead of storing columns from one cache generation with the fingerprint from another.\n\nGenerated with Codex (GPT-5)\nCo-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 27, 2026

Copy link
Copy Markdown

roborev: Combined Review (e673436)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 4m23s), codex_security (codex/security, done, 12s) | Total: 4m35s

CI runs the custom testify-helper-check analyzer as part of lint-ci, while the local lint target only runs golangci-lint. The cache drift regression tests crossed the direct-package-call threshold, so the CI-only analyzer rejected them even though the tests themselves passed.\n\nUse local assert and require helpers in the assertion-heavy tests so the drift coverage follows the repository's testify style rule.\n\nGenerated with Codex (GPT-5)\nCo-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 27, 2026

Copy link
Copy Markdown

roborev: Combined Review (d52c39e)

No Medium, High, or Critical findings were reported.

Both reviews are clean at the requested severity threshold.


Panel: ci_default_security | Synthesis: codex, 5s | Members: codex_default (codex/default, done, 6m21s), codex_security (codex/security, done, 14s) | Total: 6m40s

@wesm wesm merged commit a120738 into kenn-io:main Jun 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants