Fixes for "semcode-index --lore" by chucklever · Pull Request #18 · facebookexperimental/semcode

chucklever · 2026-02-11T21:38:47Z

Address several recently introduced inefficiencies and a few long-standing bugs in the "--lore" command line option.

I'm not certain if I've completely worked out the CLA issues. Let me know.

Lore indexing required scanning the email table to determine which commits had already been processed. The lore table contains parsed email records, not a direct mapping of indexed commits, making duplicate detection both slow and unreliable. A dedicated lore_indexed_commits table now tracks processed git commit SHAs. After successful insertion of lore emails, commit SHAs are recorded in this table. Subsequent runs load the full table into a HashSet to skip already-processed commits, avoiding redundant downloads and parsing of mailing list archives. The table contains only short SHA strings, so reading it entirely into memory is inexpensive. The table has a single git_commit_sha column and integrates into schema initialization and repair. Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives") Signed-off-by: Chuck Lever <cel@kernel.org>

The --lore refresh path uses buffer_unordered() to index up to four archives concurrently. Each archive pipeline spawns its own set of database inserter tasks, all sharing the same LanceDB connection and its underlying DataFusion memory pool. With large-row archives such as oe-kbuild-all, concurrent merge_insert operations from separate pipelines exhaust the memory pool simultaneously. Neither pipeline can make progress because each holds a portion of the pool while waiting for more, producing a resource deadlock visible as two frozen progress bars with unchanging "Inserted N emails" counts. Replace buffer_unordered() with a sequential loop, matching the approach already used by the --lore <args> initial-clone path. The git fetch for each archive still runs inline, so network latency is the only cost; the database insertion -- which dominates wall-clock time -- no longer contends for the shared memory pool. Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives") Signed-off-by: Chuck Lever <cel@kernel.org>

insert_lore_emails() feeds an entire pipeline batch (up to 1024 emails) into a single LanceDB merge_insert call. Each lore email carries full headers and body text, so the resulting RecordBatch is far larger than a typical code-analysis batch. LanceDB merge_insert uses DataFusion's RepartitionExec internally, and the oversized batch exhausts the DataFusion memory pool -- particularly when two inserter tasks submit concurrently. The failure manifests as: Resources exhausted: Failed to allocate additional 11.6 MB for RepartitionExec[0] with 11.9 MB already allocated for this reservation Split the deduplicated email indices into chunks of 128 and issue a separate merge_insert per chunk to bound peak memory per operation. When a chunk still fails (e.g. a single email is large enough to exhaust the pool on its own), fall back to inserting each email in the chunk individually so that only genuinely uninsertable messages are skipped. Signed-off-by: Chuck Lever <cel@kernel.org>

LanceDB compaction encounters a pathological case when a table accumulates thousands of small fragments. The compact operation enters a CPU loop where the main thread spins at 100% CPU utilization while worker threads remain idle. This condition arises after repeated incremental lore refreshes, each of which appends a new fragment to the table. A check now examines fragment count before compaction proceeds. When fragment count exceeds 500, compaction is skipped and a warning directs the user to rebuild the database with --clear. This threshold prevents the hang condition while allowing normal compaction for tables with moderate fragmentation. Prune, index, and checkout operations remain unaffected; only the compact step is gated by this fragment limit. Fixes: 4a16e15 ("semcode-index: optimize database periodically during long-running indexing") Signed-off-by: Chuck Lever <cel@kernel.org>

chucklever and others added 4 commits February 11, 2026 16:32

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for "semcode-index --lore"#18

Fixes for "semcode-index --lore"#18
chucklever wants to merge 4 commits intofacebookexperimental:mainfrom
chucklever:main

chucklever commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chucklever commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant