Skip to content

databricks-genie skill: add data-grounded Genie Space authoring playbook#560

Open
ryanbates99 wants to merge 1 commit into
databricks-solutions:mainfrom
ryanbates99:enrich-genie-space-authoring
Open

databricks-genie skill: add data-grounded Genie Space authoring playbook#560
ryanbates99 wants to merge 1 commit into
databricks-solutions:mainfrom
ryanbates99:enrich-genie-space-authoring

Conversation

@ryanbates99

Copy link
Copy Markdown

What

Adds databricks-skills/databricks-genie/authoring.md — a playbook for authoring a high-quality Genie Space via a fully-curated serialized_space, and wires it into SKILL.md and spaces.md at the natural decision points.

Why

The databricks-genie skill's create guidance stops at the basics: display name, table list, description, and a few sample questions. The manage_genie tool already accepts a complete serialized_space payload, but the skill never teaches the model to author one. As a result, the curation that actually drives Genie answer quality gets skipped:

  • column synonyms / value vocabulary
  • structured instructions
  • certified example question→SQL pairs
  • join specs
  • reusable measures / filters / expressions
  • benchmarks

Spaces created via the skill therefore answer worse than they could, because the model improvises filter values from column names (e.g. status = 'ACTIVE' when the data holds 'active') and leaves the rich layers empty.

What the playbook adds

  • Golden rule: ground everything in real data. Inspect with get_table_stats_and_schema, then pull actual distinct values with execute_sql before generating any SQL/instructions — never invent status/category/tier values.
  • Per-layer authoring guidance with sensible default counts: table/column descriptions + synonyms, 5 sample questions, GSL-structured text instructions (the five canonical section headers, including the verbatim "Instructions you must follow when providing summaries" string), ~12 example question→SQL pairs with usage_guidance and parameters, 5 measures / 5 filters / 3 expressions, join specs for 2+ tables, 10 benchmarks.
  • Exact serialized_space field shapes for every layer (the array-of-lines SQL convention, the --rt=FROM_RELATIONSHIP_TYPE_*-- join tag, parameter type_hint normalization, excluded-column handling).
  • API constraints to respect while assembling: at most 1 text_instructions object, ≤25,000 chars per string, ≤30 tables, ≤3.5 MB total, 32-char lowercase-hex unique IDs, and array sorting rules.
  • A SQL-validation step — test every example query and benchmark with execute_sql and fix or drop failures before embedding.

This is documentation only and adapts the approach to the kit's existing MCP tool surface (get_table_stats_and_schema, execute_sql, manage_genie). No code changes: the skill installer auto-discovers extra files in the skill directory, and the test manifest asserts no specific file list.

Testing

  • Verified the installer (install_genie_code_skills.py) uploads all non-SKILL.md files in a skill directory via the git tree listing, so authoring.md ships automatically.
  • Verified the databricks-genie test manifest uses expected_files: [] and does not assert on a file list, so adding a reference file does not break the skill test baseline.
  • Cross-checked every documented serialized_space field shape and API constraint against the Genie Conversation API docs (serialized_space schema + validation rules) and the canonical GSL instruction section vocabulary.

This pull request and its description were written by Isaac.

…skill

The databricks-genie skill's create guidance stops at the basics (name,
tables, description, sample questions) and leaves Genie to infer the rest.
The manage_genie tool already accepts a full serialized_space payload, but
the skill never teaches the model to author one — so curated instructions,
column synonyms, certified example SQL, join specs, reusable
measures/filters, and benchmarks all get skipped, and spaces answer worse
than they could.

Add authoring.md: a playbook for building a high-quality serialized_space,
grounded in the table's real values rather than invented ones. Covers the
canonical GSL text-instruction sections, ~12 example question->SQL pairs
with usage guidance and parameters, 5 measures / 5 filters / 3 expressions,
join specs, 10 benchmarks, exact serialized_space field shapes, the API
constraints (1 text instruction, 25k chars/field, 30 tables, 3.5MB, 32-char
hex IDs, array sorting), and a SQL-validation step before embedding.

Wire it into SKILL.md (when-to-use, Quick Start, Reference Files) and
spaces.md (creation workflow, poor-query-generation troubleshooting). No
code change needed: the installer auto-discovers extra skill files and the
test manifest asserts no specific file list.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant