feat: cross-file entity links + Tier-3 entity questions#10
Merged
Conversation
Extend the entities extractor so one `entity` artifact spans multiple files — the structural reason knowbase beats RAG — mirroring how the API extractor grounds a route on its handler + response_model across files. - entities.py: two-pass extract. Pass 1 classifies every class and indexes entities by short name; pass 2 resolves each entity's field-type references and SQLAlchemy relationship() targets against that index, adding `related_entity` grounding edges (cross-file when the target lives elsewhere) + a `related_entities` payload list. extractor_version 1 -> 2 (derived_from set changes, so ids rotate; gated per DESIGN §6). FK / transitive imports are documented gaps. - embed/text.py: enrich entity embed text (qualified name, field + related names) so entity questions rank in search_knowledge. - tier1_entities_test.py: add a cross-file Cart -> Order pair and assert the artifact is grounded on both files (role related_entity). - questions.py + tier3_rag_test.py: a two-file Order/LineItem entity fixture and 3 entity questions; Tier-3 indexes both extractors and asserts knowbase cross-file recall@5 == 1.0 for entity questions too (now 11 questions). 52 eval tests pass; ruff + mypy --strict clean. End-to-end on knowbase itself: 9/25 entities resolve links incl. a true cross-file one (ExtractContext -> ParsedSpan).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extends the entities extractor so a single
entityartifact spans multiple files — the structural reason knowbase beats RAG — mirroring how the API extractor grounds a route on its handler +response_modelacross files. Also brings the Tier-3 A/B to a second knowledge type.What
entities.py(two-pass). Pass 1 classifies every class and indexes entities by short name; pass 2 resolves each entity's field-type references (list[LineItem],User | None,Mapped[list["Order"]]) and SQLAlchemyrelationship()targets against that index, addingrelated_entitygrounding edges (cross-file when the target lives elsewhere) + arelated_entitiespayload.extractor_version1 → 2 (derived-from set changes → ids rotate, gated per DESIGN §6; one-hop invalidation now links referenced→referencing entity).ForeignKey(...), transitive/aliased imports, association tables are documented gaps.embed/text.py. Enriched entity embed text (qualified name + field + related-entity names) so entity questions rank insearch_knowledge.Cart → Orderpair;test_cross_file_entity_links_groundedasserts the artifact is grounded on both files (rolerelated_entity) and the reference resolves toshop.models.Order.Order/LineItementity fixture + 3 entity questions; the harness indexes[FastAPIExtractor, EntityExtractor]and the generic recall loop asserts knowbase cross-file recall@5 == 1.0 for entity questions too (now 11 questions). RAG arm stays tracked/non-asserted.Verification
mypy --strictclean.knowbase cross-file recall@5 == 1.000 for all 11 questions; recall@1 separator knowbase 0.682 vs RAG 0.409.ExtractContext(base.py) →ParsedSpan(structural/interface.py); same-file links correctly add no extra file.Store/queries unchanged (kind-opaque). Out of scope / documented follow-ups:
ForeignKey("table.col")resolution, transitive imports, a separateentity_relationgraph kind.