diff --git a/CHANGELOG.md b/CHANGELOG.md index c96a8e5..c288541 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,7 +9,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -- (nothing yet) +- **Deterministic entities extractor** (`kb.extract.deterministic.entities`): a fully static + (tree-sitter) extractor that emits one `entity` artifact per domain class — pydantic `BaseModel`, + `@dataclass`, and SQLAlchemy declarative model — with its fields, grounded on the class-definition + span. Detection signals and limits are recorded in the payload (transitive bases / imperative + SQLAlchemy mapping are documented gaps, not silent losses); `framework_versions` (pydantic / + sqlalchemy) is folded into the artifact key. Surfaced via MCP `get_knowledge`/`search_knowledge`. +- **Tier-1 entities gate** (`kb.eval.tier1_entities_test`): a hand-labeled HARD gate — extracted + entities + fields match the oracle, a bare declarative `Base` is not an entity, a `create_model(...)` + model is asserted as a known gap, and every entity is grounded on a `class` span. Brings the headline + HARD gates to **eight**. ## [0.2.0] - 2026-06-02 diff --git a/DESIGN.md b/DESIGN.md index f8145fd..b33d72c 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -305,7 +305,7 @@ freshness(current|stale@sha)`, with a deterministic tie-break for reproducible e | Module | Responsibility | Key tech | |--------|----------------|----------| | `kb.structural` | Parse Python without executing it; enumerate symbols/imports/call-sites with per-SHA byte/line ranges; compute content-addressed span identity; incremental reparse. Hidden behind a `StructuralIndex`/`PathEngine` interface so a SCIP backend can replace tree-sitter later. | tree-sitter + tree-sitter-python (canonical bindings) | -| `kb.extract.deterministic` | No-LLM extractors → exact artifacts (confidence=1.0): import graph; FastAPI API contract (static, cross-file grounded); griffe library surface (planned). | grimp, tree-sitter queries, griffe (static) | +| `kb.extract.deterministic` | No-LLM extractors → exact artifacts (confidence=1.0): import graph; FastAPI API contract (static, cross-file grounded); domain entities (pydantic/dataclass/SQLAlchemy, static, hand-labeled gate); griffe library surface (planned). | grimp, tree-sitter queries, griffe (static) | | `kb.introspect` | Eval-only runtime oracle: runs a FastAPI app in a network-blocked sandbox and emits `app.openapi()` for the Tier-1 API gate. Never on the index path. | subprocess sandbox, fastapi | | `kb.embed` | Replaceable embedding adapters + snapshot population for `search_knowledge`. Torch isolated behind the `embed` extra and a lazy import. | sentence-transformers (default), OpenAI (optional), pgvector | | `kb.rag` | Frozen pgvector RAG-over-source baseline — the "other arm" of the knowledge-vs-RAG A/B (no provenance/grounding). | deterministic line-window chunker, pgvector | @@ -382,8 +382,8 @@ Review fact-checked these against current (2026) sources. Caveats are first-clas ## 14. Roadmap (post-MVP, indicative) -1. Second deterministic family fully (entities via griffe/SQLAlchemy/pydantic; events where a - real oracle exists). +1. Second deterministic family: **entities (pydantic/dataclass/SQLAlchemy) — shipped** (static + tree-sitter, hand-labeled Tier-1 gate); events where a real oracle exists (next). 2. The **one** grounded business-process extractor (named real path + labeler + validator + deterministic sub-property gate). 3. Recursive invalidation (`artifact_depends_on`), multi-branch dedup, freshness precompute. diff --git a/README.md b/README.md index 646888d..cad81f7 100644 --- a/README.md +++ b/README.md @@ -83,12 +83,12 @@ flowchart LR **v0.2 — spine + the first knowledge extractors, MCP serving, and the knowledge-vs-RAG gate.** Everything here grounds what it claims, and nothing it cannot: - **Provenance spine** — content-addressed `span_id` (LOCKED); tree-sitter spans with a normalized S-expression fingerprint and per-SHA location; a single-Postgres, Alembic-managed store with content-addressed idempotent writes; the ≥ 1 `derived_from` anti-hallucination invariant enforced in-app *and* by a deferred DB trigger; pygit2 git ingest (no checkout) with a diff-based invalidation seed. -- **Deterministic extractors** — the **import / dependency graph** (grimp resolves the edge, tree-sitter grounds it on the exact import statement, with an honest `approximate` fallback for re-exports / relative / unmappable imports — never a silent loss), and the **FastAPI API-contract** extractor, which grounds a single route **across files** (handler in `routes.py` + `response_model` class in `schemas.py`). +- **Deterministic extractors** — the **import / dependency graph** (grimp resolves the edge, tree-sitter grounds it on the exact import statement, with an honest `approximate` fallback for re-exports / relative / unmappable imports — never a silent loss); the **FastAPI API-contract** extractor, which grounds a single route **across files** (handler in `routes.py` + `response_model` class in `schemas.py`); and the **domain-entity** extractor (pydantic / dataclass / SQLAlchemy classes and their fields, grounded on the class definition — purely static, with documented detection limits). - **`kb introspect`** — a sandboxed, network-blocked `app.openapi()` oracle, eval-only and never on the index path, that the API gate scores the static contract against. - **Read-only MCP server** — `find_provenance`, `get_knowledge`, and `search_knowledge`, each returning provenance-carrying units (method + confidence + freshness). - **pgvector embeddings + semantic search** — a replaceable embedding provider (sentence-transformers by default, OpenAI optional) populated by a separate `kb embed` pass; torch stays out of the index path. - **A frozen RAG-over-source baseline** and the **Tier-3 knowledge-vs-RAG recall gate** — the honest A/B that backs the "knowledge > RAG" thesis. -- **Seven HARD CI eval gates** (see [Development](#development)). +- **Eight HARD CI eval gates** (see [Development](#development)). **Not done yet** (and deliberately not faked): the semantic / **LLM-grounded** extraction layer, the nightly LLM-judged A/B, ADR mining from git history, grounded business-process extraction, incremental re-index on git push, and languages beyond Python. See the [Roadmap](#roadmap). @@ -112,7 +112,7 @@ The base `--extra dev` install stays torch-free; the `embed` extra pulls sentenc ### Run the gates ```bash -uv run pytest src/kb/eval -q # the seven HARD gates (spins an ephemeral local Postgres) +uv run pytest src/kb/eval -q # the eight HARD gates (spins an ephemeral local Postgres) ``` ### Index a commit @@ -173,12 +173,13 @@ A Python package `kb` (uv, src-layout). Modules and their responsibilities: | `kb.git` | pygit2 ingest — reads blobs at a SHA (no checkout) — plus the diff-based invalidation seed. | | `kb.extract.deterministic.imports` | Deterministic import / dependency edges: tree-sitter spans grounded by line, grimp edge resolution. | | `kb.extract.deterministic.fastapi_contract` | Static FastAPI API-contract extractor; grounds a route across files (handler + `response_model` class), never imports user code. | +| `kb.extract.deterministic.entities` | Static domain-entity extractor — pydantic / dataclass / SQLAlchemy classes + their fields, grounded on the class definition; detection signals and limits recorded in the payload. | | `kb.introspect` | Sandboxed, network-blocked `app.openapi()` oracle — eval-only ground truth for the API gate, never on the index path. | | `kb.mcp` | Read-only MCP server and its provenance-carrying records: `find_provenance`, `get_knowledge`, `search_knowledge`. | | `kb.embed` | Replaceable embedding adapters (sentence-transformers default, OpenAI optional) + snapshot population. Torch isolated behind the `embed` extra and a lazy import. | | `kb.rag` | The frozen pgvector RAG-over-source baseline — the "other arm" of the knowledge-vs-RAG A/B (no provenance, no grounding). | | `kb.daemon.cli` | The `kb` CLI: `index`, `embed`, `serve` (MCP), and `introspect` — all functional. | -| `kb.eval` | Seven HARD CI gates (identity reproducibility, adversarial grounding, Tier-1 import oracle, Tier-1 API oracle, Tier-3 knowledge-vs-RAG recall, Tier-4 one-hop invalidation, invariants) plus the supporting MCP / embed / store suite. | +| `kb.eval` | Eight HARD CI gates (identity reproducibility, adversarial grounding, Tier-1 import oracle, Tier-1 API oracle, Tier-1 entities oracle, Tier-3 knowledge-vs-RAG recall, Tier-4 one-hop invalidation, invariants) plus the supporting MCP / embed / store suite. | Core tables: `commit_ref`, `branch_ref`, `code_span`, `span_occurrence`, `artifact` (now with `embedding vector(384)` + `embedding_model_id`), `artifact_derived_from`, `snapshot_entry`, and `rag_chunk` (the baseline arm). @@ -188,18 +189,19 @@ Core tables: `commit_ref`, `branch_ref`, `code_span`, `span_occurrence`, `artifa uv sync --extra dev # venv + install uv run ruff check src/kb # lint uv run mypy # strict type-check -uv run pytest src/kb/eval -q # the seven HARD eval gates +uv run pytest src/kb/eval -q # the eight HARD eval gates ``` -CI (GitHub Actions, workflow **"CI"**, `.github/workflows/ci.yml`) runs ruff, `mypy --strict`, and the eval gates against a `pgvector/pgvector:pg17` service (with the embedding model cached). The **seven HARD gates** that block a merge: +CI (GitHub Actions, workflow **"CI"**, `.github/workflows/ci.yml`) runs ruff, `mypy --strict`, and the eval gates against a `pgvector/pgvector:pg17` service (with the embedding model cached). The **eight HARD gates** that block a merge: 1. **Identity reproducibility** — formatting / comment / docstring / location changes must NOT change `span_id`; a rename MUST. Pure identity core, no database. 2. **Adversarial grounding** — an ungrounded artifact is rejected by *both* layers (the app's `GroundingError` and the DB's deferred `artifact_grounded_check` trigger); a genuinely grounded artifact commits cleanly. 3. **Tier-1 import oracle** — extracted import edges match a hand-labeled oracle, grounded on the actual import statement span; a dynamic import is asserted as a *known* gap, not a silent loss. 4. **Tier-1 API oracle** — the statically-extracted FastAPI contract equals the app's own `openapi()` (from the sandboxed introspect oracle), and the route's cross-file grounding (handler + `response_model`) is asserted. -5. **Tier-3 knowledge-vs-RAG recall** — knowbase cross-file recall@k == 1.0 for every contract question (a *structural* floor: one artifact already spans both files, so it holds regardless of embedding quality); the RAG arm is reported but **never asserted**, so a model bump can't redden CI. -6. **Tier-4 one-hop invalidation** — a content diff invalidates *exactly* the artifacts whose grounding span changed (set-equality: no over-invalidation, no stale survivors); a version bump invalidates everything. -7. **Invariants** — zero orphans (every snapshot artifact is grounded), and re-indexing the same SHA yields the identical set of artifact ids. +5. **Tier-1 entities oracle** — extracted pydantic / dataclass / SQLAlchemy entities + their fields match a hand-labeled oracle, each grounded on its class span; a bare declarative `Base` is correctly *not* an entity and a `create_model(...)` model is asserted as a *known* gap. +6. **Tier-3 knowledge-vs-RAG recall** — knowbase cross-file recall@k == 1.0 for every contract question (a *structural* floor: one artifact already spans both files, so it holds regardless of embedding quality); the RAG arm is reported but **never asserted**, so a model bump can't redden CI. +7. **Tier-4 one-hop invalidation** — a content diff invalidates *exactly* the artifacts whose grounding span changed (set-equality: no over-invalidation, no stale survivors); a version bump invalidates everything. +8. **Invariants** — zero orphans (every snapshot artifact is grounded), and re-indexing the same SHA yields the identical set of artifact ids. The identity rules in `kb.ids` (and `kb.structural`) are **LOCKED**: changing one is a breaking change, gated behind a `NORMALIZATION_VERSION` / `extractor_version` bump so existing digests are invalidated rather than silently colliding. diff --git a/src/kb/daemon/cli.py b/src/kb/daemon/cli.py index d8372b9..dd8703b 100644 --- a/src/kb/daemon/cli.py +++ b/src/kb/daemon/cli.py @@ -11,6 +11,7 @@ import typer from kb.daemon.pipeline import index_commit +from kb.extract.deterministic.entities import EntityExtractor from kb.extract.deterministic.fastapi_contract import FastAPIExtractor from kb.extract.deterministic.imports import ImportExtractor from kb.introspect import introspect_app @@ -27,7 +28,9 @@ def index( ) -> None: """Index one commit: ingest, parse spans, run deterministic extractors, write the snapshot.""" engine = make_engine(db_url) - result = index_commit(engine, repo, sha, extractors=[ImportExtractor(), FastAPIExtractor()]) + result = index_commit( + engine, repo, sha, extractors=[ImportExtractor(), FastAPIExtractor(), EntityExtractor()] + ) engine.dispose() typer.echo( f"indexed {result.sha[:12]}: {result.files_indexed} files, {result.spans} spans, " diff --git a/src/kb/eval/tier1_entities_test.py b/src/kb/eval/tier1_entities_test.py new file mode 100644 index 0000000..fe6ecbb --- /dev/null +++ b/src/kb/eval/tier1_entities_test.py @@ -0,0 +1,134 @@ +"""HARD GATE — Tier 1: domain entities vs a hand-labeled oracle (DESIGN.md §4, §9). + +The hand-labeled ``EXPECTED_ENTITIES`` / ``EXPECTED_FIELDS`` are the real oracle (importing the +models to introspect them would execute user code). A bare declarative ``Base`` must NOT be an +entity, and a dynamically-built model (``create_model``) is a deliberate static-analysis blind spot, +asserted as a KNOWN gap — not a silent loss. Every entity is grounded on its class-definition span. +""" + +from __future__ import annotations + +from pathlib import Path + +from sqlalchemy import Engine, select + +from kb.daemon.pipeline import index_commit +from kb.eval._fixtures import make_git_repo +from kb.extract.deterministic.entities import EntityExtractor +from kb.store import models as m + +# A src-layout module: a pydantic model, a dataclass, a SQLAlchemy model (plus a bare declarative +# Base that is NOT an entity), and a dynamically-built model (invisible to static parsing). +FILES = { + "src/shop/__init__.py": "", + "src/shop/models.py": ( + "from dataclasses import dataclass\n" + "from pydantic import BaseModel, create_model\n" + "from sqlalchemy import Column, Integer\n" + "from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column\n" + "\n\n" + "class Order(BaseModel):\n" + " id: int\n" + " total: float = 0.0\n" + " note: str | None = None\n" + "\n\n" + "@dataclass\n" + "class LineItem:\n" + " sku: str\n" + " qty: int = 1\n" + "\n\n" + "class Base(DeclarativeBase):\n" + " pass\n" + "\n\n" + "class User(Base):\n" + ' __tablename__ = "users"\n' + " id: Mapped[int] = mapped_column(primary_key=True)\n" + " name: Mapped[str] = mapped_column()\n" + " legacy = Column(Integer)\n" + "\n\n" + 'Dynamic = create_model("Dynamic", x=(int, ...))\n' + ), +} + +# Hand-labeled oracle: (framework, fq class). `Base` and `Dynamic` are deliberately absent. +EXPECTED_ENTITIES = { + ("pydantic", "shop.models.Order"), + ("dataclass", "shop.models.LineItem"), + ("sqlalchemy", "shop.models.User"), +} +EXPECTED_FIELDS = { + "shop.models.Order": {"id", "total", "note"}, + "shop.models.LineItem": {"sku", "qty"}, + "shop.models.User": {"id", "name", "legacy"}, # __tablename__ is metadata, not a field +} +KNOWN_GAP = "shop.models.Dynamic" # create_model(): dynamic, invisible to static analysis + + +def _index(engine: Engine, tmp_path: Path) -> str: + sha = make_git_repo(tmp_path, [FILES])[0] + index_commit(engine, str(tmp_path), sha, extractors=[EntityExtractor()], first_party_root="src") + return sha + + +def _entity_payloads(engine: Engine, sha: str) -> list[dict]: + join = m.snapshot_entry.join( + m.artifact, m.artifact.c.artifact_id == m.snapshot_entry.c.artifact_id + ) + with engine.connect() as conn: + return list( + conn.execute( + select(m.artifact.c.payload) + .select_from(join) + .where(m.snapshot_entry.c.sha == sha, m.artifact.c.kind == "entity") + ).scalars() + ) + + +def test_entities_match_oracle(engine: Engine, tmp_path: Path) -> None: + sha = _index(engine, tmp_path) + found = {(p["framework"], p["qualified_name"]) for p in _entity_payloads(engine, sha)} + assert found == EXPECTED_ENTITIES + + +def test_fields_match_oracle(engine: Engine, tmp_path: Path) -> None: + sha = _index(engine, tmp_path) + by_key = {p["qualified_name"]: p for p in _entity_payloads(engine, sha)} + for qualified_name, expected in EXPECTED_FIELDS.items(): + names = {f["name"] for f in by_key[qualified_name]["fields"]} + assert names == expected, qualified_name + + +def test_bare_declarative_base_is_not_an_entity(engine: Engine, tmp_path: Path) -> None: + sha = _index(engine, tmp_path) + keys = {p["qualified_name"] for p in _entity_payloads(engine, sha)} + assert "shop.models.Base" not in keys # no __tablename__, no columns -> not a domain entity + + +def test_dynamic_model_is_a_known_gap(engine: Engine, tmp_path: Path) -> None: + sha = _index(engine, tmp_path) + keys = {p["qualified_name"] for p in _entity_payloads(engine, sha)} + assert KNOWN_GAP not in keys # documented blind spot, surfaced — not silently "found" + + +def test_entities_grounded_on_class_spans(engine: Engine, tmp_path: Path) -> None: + sha = _index(engine, tmp_path) + join = ( + m.snapshot_entry.join( + m.artifact, m.artifact.c.artifact_id == m.snapshot_entry.c.artifact_id + ) + .join( + m.artifact_derived_from, + m.artifact_derived_from.c.artifact_id == m.artifact.c.artifact_id, + ) + .join(m.code_span, m.code_span.c.span_id == m.artifact_derived_from.c.span_id) + ) + with engine.connect() as conn: + rows = conn.execute( + select(m.artifact.c.payload, m.code_span.c.span_kind) + .select_from(join) + .where(m.snapshot_entry.c.sha == sha, m.artifact.c.kind == "entity") + ).all() + assert rows # every entity is grounded (>=1 derived_from) + for row in rows: + assert row.span_kind == "class" + assert row.payload["span_mapping"] == "exact" diff --git a/src/kb/extract/deterministic/entities.py b/src/kb/extract/deterministic/entities.py new file mode 100644 index 0000000..9c10f9f --- /dev/null +++ b/src/kb/extract/deterministic/entities.py @@ -0,0 +1,372 @@ +"""Deterministic domain-entity extractor — pydantic / dataclass / SQLAlchemy (DESIGN.md §4, §14). + +Produces one ``entity`` artifact per domain class, grounded on that class's span (role +``class_definition``). Fully static: re-parses each class span's source with tree-sitter (the same +discipline as the FastAPI contract extractor); it never imports or executes user code. + +Detection is best-effort and the signals are recorded in the payload (never a silent guess): + * **dataclass** — a decorator whose dotted name ends in ``dataclass``. + * **pydantic** — a direct base named ``BaseModel`` / ``BaseSettings``. + * **sqlalchemy** — a ``__tablename__`` assignment, or a field via ``Mapped[...]`` / + ``mapped_column(...)`` / ``Column(...)`` (so a bare declarative ``Base`` with + neither is correctly NOT treated as an entity). +``framework_versions`` (pydantic / sqlalchemy) is read from the ANALYZED repo at the SHA and folded +into the artifact key, since field interpretation can shift across major versions (DESIGN.md §6). +""" + +from __future__ import annotations + +import textwrap +import tomllib +from collections.abc import Sequence +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +import tree_sitter_python as tsp +from tree_sitter import Language, Node, Parser + +from kb.extract.base import DerivedEdge, ExtractContext, ExtractedArtifact +from kb.structural.interface import ParsedSpan + +EXTRACTOR_ID = "entities" +EXTRACTOR_VERSION = "1" + +_LANGUAGE = Language(tsp.language()) +_PYDANTIC_BASES = frozenset({"BaseModel", "BaseSettings"}) +_SA_COLUMN_CALLS = frozenset({"Column", "mapped_column"}) +_OPTIONAL_MARKERS = ("Optional[", "| None", "None |") +_VERSIONED = ("pydantic", "sqlalchemy") + + +@dataclass(frozen=True) +class _RawField: + name: str + annotation: str | None + has_default: bool + value_callee: str | None # innermost name of a call on the RHS, e.g. "mapped_column" | "Column" + + +class EntityExtractor: + extractor_id = EXTRACTOR_ID + extractor_version = EXTRACTOR_VERSION + + def __init__(self) -> None: + self._parser = Parser(_LANGUAGE) + + def extract(self, ctx: ExtractContext) -> list[ExtractedArtifact]: + versions = _framework_versions(ctx, _VERSIONED) + artifacts: list[ExtractedArtifact] = [] + for module, spans in ctx.spans_by_module.items(): + for span in spans: + if span.span_kind != "class": + continue + art = self._build_artifact(module, span, versions) + if art is not None: + artifacts.append(art) + return artifacts + + def _build_artifact( + self, module: str, span: ParsedSpan, versions: dict[str, str] + ) -> ExtractedArtifact | None: + root = self._parser.parse(textwrap.dedent(span.raw_text).encode("utf-8")).root_node + deco = _first_child_of_type(root, "decorated_definition") + cls = ( + _first_child_of_type(deco, "class_definition") + if deco is not None + else _first_child_of_type(root, "class_definition") + ) + if cls is None: + return None + + decorators = _decorator_names(deco) if deco is not None else [] + bases = _base_names(cls) + body = cls.child_by_field_name("body") + tablename, raw_fields, relationships = _parse_body(body) + + framework, signals, limitations = _classify(decorators, bases, tablename, raw_fields) + if framework is None: + return None + + fields = _select_fields(framework, raw_fields) + payload: dict[str, Any] = { + "framework": framework, + "class_name": span.fq_symbol_path.rsplit(".", 1)[-1], + "qualified_name": span.fq_symbol_path, + "module": module, + "bases": bases, + "fields": [ + { + "name": f.name, + "annotation": f.annotation, + "has_default": f.has_default, + "required": f.required, + "source": f.source, + } + for f in fields + ], + "tablename": tablename, + "relationships": relationships, + "detection_signals": signals, + "span_mapping": "exact", + "limitations": limitations, + } + framework_versions = ( + {} if framework == "dataclass" else {framework: versions.get(framework, "unknown")} + ) + return ExtractedArtifact( + kind="entity", + logical_key=f"entity:{span.fq_symbol_path}", + payload=payload, + derived_from=[DerivedEdge(span.span_id, "class_definition")], + extractor_id=self.extractor_id, + extractor_version=self.extractor_version, + framework_versions=framework_versions, + ) + + +# --- selected field (post-classification) ---------------------------------- + + +@dataclass(frozen=True) +class _Field: + name: str + annotation: str | None + has_default: bool + required: bool + source: str # "annotated" | "column" + + +def _select_fields(framework: str, raw: Sequence[_RawField]) -> list[_Field]: + out: list[_Field] = [] + for rf in raw: + if _is_dunder(rf.name): + continue + annotated = rf.annotation is not None + is_column = rf.value_callee in _SA_COLUMN_CALLS + if framework == "sqlalchemy": + is_mapped = rf.annotation is not None and rf.annotation.startswith("Mapped[") + if not (is_mapped or is_column): + continue + source = "annotated" if annotated else "column" + else: # pydantic / dataclass fields are always annotated + if not annotated: + continue + source = "annotated" + out.append( + _Field( + name=rf.name, + annotation=rf.annotation, + has_default=rf.has_default, + required=not rf.has_default and not _is_optional(rf.annotation), + source=source, + ) + ) + return out + + +def _classify( + decorators: Sequence[str], + bases: Sequence[str], + tablename: str | None, + raw: Sequence[_RawField], +) -> tuple[str | None, list[str], list[str]]: + is_dataclass = any(d.rsplit(".", 1)[-1] == "dataclass" for d in decorators) + has_column_field = any( + rf.value_callee in _SA_COLUMN_CALLS + or (rf.annotation is not None and rf.annotation.startswith("Mapped[")) + for rf in raw + ) + is_sqlalchemy = tablename is not None or has_column_field + is_pydantic = any(b in _PYDANTIC_BASES for b in bases) + + signals: list[str] = [] + if is_dataclass: + signals.append("dataclass_decorator") + if tablename is not None: + signals.append("sqlalchemy_tablename") + if has_column_field: + signals.append("sqlalchemy_column_field") + if is_pydantic: + signals.append("pydantic_base") + + limitations: list[str] = [] + if sum((is_dataclass, is_sqlalchemy, is_pydantic)) > 1: + limitations.append("multiple_framework_signals") + + # precedence: a dataclass decorator wins; then SQLAlchemy table/columns; then a pydantic base. + if is_dataclass: + return "dataclass", signals, limitations + if is_sqlalchemy: + return "sqlalchemy", signals, limitations + if is_pydantic: + return "pydantic", signals, limitations + return None, signals, limitations + + +# --- tree-sitter parsing of the class body ---------------------------------- + + +def _parse_body( + body: Node | None, +) -> tuple[str | None, list[_RawField], list[dict[str, str | None]]]: + """Return ``(__tablename__ literal, raw fields, relationships)`` from a class ``block``. + + Only DIRECT body statements are inspected, so assignments inside method bodies are not mistaken + for fields. + """ + if body is None: + return None, [], [] + tablename: str | None = None + fields: list[_RawField] = [] + relationships: list[dict[str, str | None]] = [] + for stmt in body.named_children: + assign = _unwrap_assignment(stmt) + if assign is None: + continue + left = assign.child_by_field_name("left") + if left is None or left.type != "identifier": + continue + name = _text(left) + if name is None: + continue + right = assign.child_by_field_name("right") + callee = _innermost_call_name(right) if right is not None else None + if name == "__tablename__": + tablename = _string_value(right) if right is not None else None + continue + if callee == "relationship": + relationships.append({"name": name, "target": _first_argument_text(right)}) + fields.append( + _RawField( + name=name, + annotation=_text(assign.child_by_field_name("type")), + has_default=right is not None, + value_callee=callee, + ) + ) + return tablename, fields, relationships + + +def _unwrap_assignment(stmt: Node) -> Node | None: + """A class-body field is an ``assignment`` (possibly wrapped in an ``expression_statement``).""" + if stmt.type == "assignment": + return stmt + if stmt.type == "expression_statement": + inner = _first_child_of_type(stmt, "assignment") + if inner is not None: + return inner + return None + + +def _decorator_names(deco: Node) -> list[str]: + names: list[str] = [] + for child in deco.named_children: + if child.type != "decorator": + continue + target = child.named_children[0] if child.named_children else None + if target is None: + continue + if target.type == "call": + target = target.child_by_field_name("function") + text = _text(target) + if text is not None: + names.append(text) + return names + + +def _base_names(cls: Node) -> list[str]: + supers = cls.child_by_field_name("superclasses") + if supers is None: + return [] + names: list[str] = [] + for arg in supers.named_children: + if arg.type == "keyword_argument": # e.g. metaclass=... + continue + text = _text(arg) + if text is not None: + names.append(text.rsplit(".", 1)[-1]) + return names + + +# --- small tree-sitter helpers (kept local; mirror the fastapi extractor) ---- + + +def _first_child_of_type(node: Node, type_name: str) -> Node | None: + for child in node.named_children: + if child.type == type_name: + return child + return None + + +def _innermost_call_name(node: Node) -> str | None: + """If ``node`` is (or wraps) a call, return the innermost identifier of its callee.""" + if node.type != "call": + return None + fn = node.child_by_field_name("function") + if fn is None: + return None + text = _text(fn) + return text.rsplit(".", 1)[-1] if text is not None else None + + +def _first_argument_text(node: Node | None) -> str | None: + if node is None or node.type != "call": + return None + args = node.child_by_field_name("arguments") + if args is None: + return None + first = next((c for c in args.named_children), None) + return _text(first) if first is not None else None + + +def _string_value(node: Node) -> str | None: + if node.type != "string": + return None + contents = [ + (child.text or b"").decode("utf-8", errors="replace") + for child in node.named_children + if child.type == "string_content" + ] + if contents: + return "".join(contents) + return (node.text or b"").decode("utf-8", errors="replace").strip("\"'") + + +def _is_optional(annotation: str | None) -> bool: + return annotation is not None and any(marker in annotation for marker in _OPTIONAL_MARKERS) + + +def _is_dunder(name: str) -> bool: + return name.startswith("__") and name.endswith("__") + + +def _text(node: Node | None) -> str | None: + if node is None or node.text is None: + return None + return node.text.decode("utf-8") + + +def _framework_versions(ctx: ExtractContext, names: tuple[str, ...]) -> dict[str, str]: + """Best-effort versions of ``names`` from the repo's lockfiles / pyproject at the SHA.""" + root = Path(ctx.materialized_root) + targets = {name.lower(): name for name in names} + found: dict[str, str] = {} + for lock in ("uv.lock", "poetry.lock"): + path = root / lock + if path.exists(): + data = tomllib.loads(path.read_text()) + for pkg in data.get("package", []): + key = str(pkg.get("name", "")).lower() + if key in targets and "version" in pkg and targets[key] not in found: + found[targets[key]] = str(pkg["version"]) + pyproject = root / "pyproject.toml" + if pyproject.exists() and any(name not in found for name in names): + data = tomllib.loads(pyproject.read_text()) + deps = data.get("project", {}).get("dependencies", []) or [] + for spec in deps: + normalized = spec.replace("-", "_").lower() + for key, canonical in targets.items(): + if canonical not in found and normalized.startswith(key): + found[canonical] = f"spec:{spec}" + return {name: found.get(name, "unknown") for name in names} diff --git a/src/kb/mcp/records.py b/src/kb/mcp/records.py index 8b91b60..0a6fa34 100644 --- a/src/kb/mcp/records.py +++ b/src/kb/mcp/records.py @@ -69,6 +69,10 @@ def summarize(kind: str, payload: dict[str, Any]) -> str: if kind == "api_route": model = payload.get("response_model_base") or "?" return f"{payload.get('method', '?')} {payload.get('path', '?')} -> {model}" + if kind == "entity": + framework = payload.get("framework", "?") + n_fields = len(payload.get("fields", [])) + return f"{payload.get('qualified_name', '?')} ({framework}, {n_fields} fields)" return kind