Fixes for "semcode-index --lore"#18
Open
chucklever wants to merge 4 commits intofacebookexperimental:mainfrom
Open
Fixes for "semcode-index --lore"#18chucklever wants to merge 4 commits intofacebookexperimental:mainfrom
chucklever wants to merge 4 commits intofacebookexperimental:mainfrom
Conversation
Lore indexing required scanning the email table to determine which commits had already been processed. The lore table contains parsed email records, not a direct mapping of indexed commits, making duplicate detection both slow and unreliable. A dedicated lore_indexed_commits table now tracks processed git commit SHAs. After successful insertion of lore emails, commit SHAs are recorded in this table. Subsequent runs load the full table into a HashSet to skip already-processed commits, avoiding redundant downloads and parsing of mailing list archives. The table contains only short SHA strings, so reading it entirely into memory is inexpensive. The table has a single git_commit_sha column and integrates into schema initialization and repair. Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives") Signed-off-by: Chuck Lever <cel@kernel.org>
The --lore refresh path uses buffer_unordered() to index up to four archives concurrently. Each archive pipeline spawns its own set of database inserter tasks, all sharing the same LanceDB connection and its underlying DataFusion memory pool. With large-row archives such as oe-kbuild-all, concurrent merge_insert operations from separate pipelines exhaust the memory pool simultaneously. Neither pipeline can make progress because each holds a portion of the pool while waiting for more, producing a resource deadlock visible as two frozen progress bars with unchanging "Inserted N emails" counts. Replace buffer_unordered() with a sequential loop, matching the approach already used by the --lore <args> initial-clone path. The git fetch for each archive still runs inline, so network latency is the only cost; the database insertion -- which dominates wall-clock time -- no longer contends for the shared memory pool. Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives") Signed-off-by: Chuck Lever <cel@kernel.org>
insert_lore_emails() feeds an entire pipeline batch (up to 1024 emails) into a single LanceDB merge_insert call. Each lore email carries full headers and body text, so the resulting RecordBatch is far larger than a typical code-analysis batch. LanceDB merge_insert uses DataFusion's RepartitionExec internally, and the oversized batch exhausts the DataFusion memory pool -- particularly when two inserter tasks submit concurrently. The failure manifests as: Resources exhausted: Failed to allocate additional 11.6 MB for RepartitionExec[0] with 11.9 MB already allocated for this reservation Split the deduplicated email indices into chunks of 128 and issue a separate merge_insert per chunk to bound peak memory per operation. When a chunk still fails (e.g. a single email is large enough to exhaust the pool on its own), fall back to inserting each email in the chunk individually so that only genuinely uninsertable messages are skipped. Signed-off-by: Chuck Lever <cel@kernel.org>
LanceDB compaction encounters a pathological case when a table accumulates thousands of small fragments. The compact operation enters a CPU loop where the main thread spins at 100% CPU utilization while worker threads remain idle. This condition arises after repeated incremental lore refreshes, each of which appends a new fragment to the table. A check now examines fragment count before compaction proceeds. When fragment count exceeds 500, compaction is skipped and a warning directs the user to rebuild the database with --clear. This threshold prevents the hang condition while allowing normal compaction for tables with moderate fragmentation. Prune, index, and checkout operations remain unaffected; only the compact step is gated by this fragment limit. Fixes: 4a16e15 ("semcode-index: optimize database periodically during long-running indexing") Signed-off-by: Chuck Lever <cel@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Address several recently introduced inefficiencies and a few long-standing bugs in the "--lore" command line option.
I'm not certain if I've completely worked out the CLA issues. Let me know.