Skip to content

Fixes for "semcode-index --lore"#18

Open
chucklever wants to merge 4 commits intofacebookexperimental:mainfrom
chucklever:main
Open

Fixes for "semcode-index --lore"#18
chucklever wants to merge 4 commits intofacebookexperimental:mainfrom
chucklever:main

Conversation

@chucklever
Copy link
Contributor

Address several recently introduced inefficiencies and a few long-standing bugs in the "--lore" command line option.

I'm not certain if I've completely worked out the CLA issues. Let me know.

chucklever and others added 4 commits February 11, 2026 16:32
Lore indexing required scanning the email table to determine which
commits had already been processed. The lore table contains parsed
email records, not a direct mapping of indexed commits, making
duplicate detection both slow and unreliable.

A dedicated lore_indexed_commits table now tracks processed git
commit SHAs. After successful insertion of lore emails, commit
SHAs are recorded in this table. Subsequent runs load the full
table into a HashSet to skip already-processed commits, avoiding
redundant downloads and parsing of mailing list archives. The
table contains only short SHA strings, so reading it entirely
into memory is inexpensive. The table has a single git_commit_sha
column and integrates into schema initialization and repair.

Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives")
Signed-off-by: Chuck Lever <cel@kernel.org>
The --lore refresh path uses buffer_unordered() to index
up to four archives concurrently.  Each archive pipeline
spawns its own set of database inserter tasks, all sharing
the same LanceDB connection and its underlying DataFusion
memory pool.

With large-row archives such as oe-kbuild-all, concurrent
merge_insert operations from separate pipelines exhaust
the memory pool simultaneously.  Neither pipeline can
make progress because each holds a portion of the pool
while waiting for more, producing a resource deadlock
visible as two frozen progress bars with unchanging
"Inserted N emails" counts.

Replace buffer_unordered() with a sequential loop,
matching the approach already used by the --lore <args>
initial-clone path.  The git fetch for each archive still
runs inline, so network latency is the only cost; the
database insertion -- which dominates wall-clock time --
no longer contends for the shared memory pool.

Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives")
Signed-off-by: Chuck Lever <cel@kernel.org>
insert_lore_emails() feeds an entire pipeline batch (up to
1024 emails) into a single LanceDB merge_insert call. Each
lore email carries full headers and body text, so the
resulting RecordBatch is far larger than a typical
code-analysis batch. LanceDB merge_insert uses DataFusion's
RepartitionExec internally, and the oversized batch exhausts
the DataFusion memory pool -- particularly when two inserter
tasks submit concurrently.

The failure manifests as:

  Resources exhausted: Failed to allocate additional
  11.6 MB for RepartitionExec[0] with 11.9 MB already
  allocated for this reservation

Split the deduplicated email indices into chunks of 128 and
issue a separate merge_insert per chunk to bound peak memory
per operation. When a chunk still fails (e.g. a single
email is large enough to exhaust the pool on its own), fall
back to inserting each email in the chunk individually so
that only genuinely uninsertable messages are skipped.

Signed-off-by: Chuck Lever <cel@kernel.org>
LanceDB compaction encounters a pathological case when a table
accumulates thousands of small fragments. The compact operation
enters a CPU loop where the main thread spins at 100% CPU
utilization while worker threads remain idle. This condition
arises after repeated incremental lore refreshes, each of which
appends a new fragment to the table.

A check now examines fragment count before compaction proceeds.
When fragment count exceeds 500, compaction is skipped and a
warning directs the user to rebuild the database with --clear.
This threshold prevents the hang condition while allowing normal
compaction for tables with moderate fragmentation. Prune, index,
and checkout operations remain unaffected; only the compact step
is gated by this fragment limit.

Fixes: 4a16e15 ("semcode-index: optimize database periodically during long-running indexing")
Signed-off-by: Chuck Lever <cel@kernel.org>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant