Skip to content

feat: deterministic entities extractor (pydantic / dataclass / SQLAlchemy)#9

Merged
v0ropaev merged 1 commit into
masterfrom
feat/entities-extractor
Jun 20, 2026
Merged

feat: deterministic entities extractor (pydantic / dataclass / SQLAlchemy)#9
v0ropaev merged 1 commit into
masterfrom
feat/entities-extractor

Conversation

@v0ropaev

Copy link
Copy Markdown
Owner

The next deterministic family after imports + the API contract (DESIGN.md §14 #1, dropped from the README roadmap in the v0.2.0 refresh). A fully static EntityExtractor emits one entity artifact per domain class — pydantic BaseModel, @dataclass, SQLAlchemy declarative model — with its fields, grounded on the class-definition span. Cheapest, highest-trust way to broaden the knowledge surface; immediately served via MCP get_knowledge/search_knowledge.

Approach

  • tree-sitter, no execution — re-parses each class span's raw_text, mirroring fastapi_contract. No new dependency.
  • Detection (best-effort, signals recorded in payload): dataclass decorator; pydantic BaseModel/BaseSettings base; SQLAlchemy __tablename__ / Mapped[...] / mapped_column(...) / Column(...). A bare declarative Base is correctly not an entity. Transitive bases & imperative SQLAlchemy mapping are documented gaps, not silent losses.
  • framework_versions (pydantic/sqlalchemy) folded into the artifact key per DESIGN §6.

Gate — HARD #8

tier1_entities_test.py (hand-labeled oracle, the imports-gate pattern): extracted entities + fields match the oracle; Base is not an entity; a create_model(...) model is a known gap; every entity grounded on a class span.

Verification

  • 51 eval tests pass (was 46); ruff + mypy --strict clean.
  • End-to-end on knowbase itself: 23 entities (17 dataclass, 6 pydantic, 0 false positives — the repo uses SQLAlchemy Core, not declarative classes), correct field counts, clean summarize().

Touch-points

New entities.py + tier1_entities_test.py; register in cli.py; one summarize branch in mcp/records.py; docs (README eight-gates/architecture/status, DESIGN §11/§14, CHANGELOG [Unreleased]). Store/queries unchanged (kind-opaque).

Out of scope (documented follow-ups): cross-file entity links (relationship/FK target grounding, like the API extractor's response_model), Enum/TypedDict/attrs, Tier-3 entity questions.

…hemy)

The next deterministic family after imports + the API contract (DESIGN §14 #1).
A fully static EntityExtractor emits one `entity` artifact per domain class —
pydantic BaseModel, @DataClass, and SQLAlchemy declarative model — with its
fields, grounded on the class-definition span. No code execution (tree-sitter
re-parse, mirroring fastapi_contract); detection signals + limits recorded in
the payload; framework_versions (pydantic/sqlalchemy) folded into the key.

- src/kb/extract/deterministic/entities.py — detection (dataclass decorator;
  pydantic BaseModel/BaseSettings base; SQLAlchemy __tablename__ / Mapped[] /
  mapped_column/Column), field parsing, artifact assembly.
- src/kb/eval/tier1_entities_test.py — HARD gate (#8): hand-labeled oracle for
  entities + fields, a bare declarative Base is NOT an entity, create_model() is
  a known gap, every entity grounded on a `class` span.
- register in the index pipeline (cli.py); add an `entity` branch to MCP
  records.summarize.
- docs: README (eight gates, architecture row, status bullet), DESIGN §11/§14,
  CHANGELOG [Unreleased].

51 eval tests pass (was 46); ruff + mypy --strict clean. End-to-end on knowbase
itself yields 23 entities (17 dataclass, 6 pydantic; 0 false positives — it uses
SQLAlchemy Core, not declarative).
@v0ropaev v0ropaev merged commit be80d62 into master Jun 20, 2026
1 check passed
@v0ropaev v0ropaev deleted the feat/entities-extractor branch June 20, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant