Refactor databricks-apps around capability composition + warehouse mutations by MarioCadenas · Pull Request #132 · databricks/databricks-agent-skills

MarioCadenas · 2026-06-09T16:44:26Z

Summary

Unifies the former #135 + #132 stack into a single PR (based on main). Refactors databricks-apps so agents compose apps from capabilities (reads_warehouse, writes_oltp, genie, files, …) instead of monolithic archetype docs, adds the warehouse-mutations write path, and teaches the skill to handle two environments (local vs agentic mode).

Capability refactor

Add warehouse-mutations.md — Delta/UC DML via appkit.analytics.query() in custom routes
Add data-patterns.md — canonical capability catalog, conditional gates, write/read paths, recipes, checklist slices
Add lifecycle.md — dev / validate / deploy ordering
Slim SKILL.md to a thin orchestrator
Split lakebase.md into router + lakebase-oltp.md + lakebase-synced-reads.md
Strip duplicate pattern tables from plugin guides; trim custom-endpoints.md → points at data-patterns; mark proto-first.md advanced-only

Agentic mode (`DATABRICKS_APPS_AGENTIC_MODE=true`)

Add environments.md as the canonical Local-vs-agentic delta; Step-0 detection branch in SKILL.md
In agentic mode the app is pre-scaffolded and all plugin resources are provisioned: read enabled plugins from appkit.plugins.json / app.yaml (don't infer); ambient auth (no profile, omit --profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests; npm run dev hits live resources; still run databricks apps validate. Stop and surface if a needed plugin isn't wired.
Short agentic callouts across lifecycle, data-patterns, lakebase-oltp, genie, model-serving, files, jobs, overview, sql-queries, warehouse-mutations

Review fixes (P1–P3)

Capability flags marked as concepts, not --features values
One canonical write-path table; downstream guides guard-and-link instead of restating it
warehouse-mutations.md leads with the simple inline pattern (generic optional); non-mutating smoke check; simplified lifecycle matrix; standardized await createApp

Supersedes #135 (its commits are included here).

Test plan

python3 scripts/skills.py generate && python3 scripts/skills.py validate
Verify appkit.analytics.query() supports DML on the shipped AppKit version before relying on warehouse-mutations.md
Spot-check local flow: multi-plugin request → data-patterns gates → correct deep guide
Spot-check agentic flow: DATABRICKS_APPS_AGENTIC_MODE=true → reads appkit.plugins.json, no scaffold/deploy, ambient auth
Spot-check: form/CRUD → Lakebase; Delta write → warehouse mutations; dashboard → SQL queries

Document Delta/UC DML via custom routes, unify write-path guidance across skills, and expand Lakebase scaffolding and deployment notes.

The prior reorder conflated Lakebase deploy-first with the full lifecycle; keep Scaffold → Develop → Validate → Deploy and call out the OLTP exception.

Add data-patterns and lifecycle guides, slim SKILL.md to a 5-step agent workflow, dedupe overview and plugin guides, and broaden skill frontmatter for multi-plugin apps.

Extract OLTP and synced-read guides from the monolithic lakebase doc, add a thin router, point data-patterns and cross-skill links at the right targets, and trim custom-endpoints/proto-first duplication.

Adds a Local-vs-agentic-mode split keyed to DATABRICKS_APPS_AGENTIC_MODE, plus the P1-P3 review fixes from the data-path refactor. Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true): - New references/appkit/environments.md as the canonical Local-vs-agentic delta; Step-0 detection branch in SKILL.md. - In agentic mode the app is pre-scaffolded and all plugin resources are provisioned: read enabled plugins from appkit.plugins.json / app.yaml (don't infer); ambient auth (no profile, omit --profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests; npm run dev hits live resources; still run databricks apps validate. Stop and surface if a needed plugin isn't wired. - Short agentic callouts in lifecycle, data-patterns, lakebase-oltp, genie, model-serving, files, jobs, overview, sql-queries, warehouse-mutations. Doc fixes: - Capability flags marked as concepts, not --features values. - Single canonical write-path table in data-patterns; custom-endpoints and warehouse-mutations now guard-and-link instead of restating it. - warehouse-mutations leads with the inline pattern; generic is optional. - Reframed the warehouse smoke test to a non-mutating check. - Simplified the lifecycle phase matrix; standardized await createApp. Co-authored-by: Isaac

keugenek · 2026-06-10T11:47:31Z

🧪 Dev eval run kicked off for this PR

Running the app-evals pipeline with the databricks-apps skill pinned to this branch, to check the capability-composition refactor doesn't regress generated-app quality.

Setup

Skill loaded from this branch — confirmed on the eval cluster:

Installing Databricks skills with ref override: refactor-app-capability-composition
Using skills version refactor-app-capability-composition
Installed 9 skills (databricks-core, databricks-apps, databricks-lakebase, databricks-model-serving, …)

Preset: custom-pr — 10 apps spanning every promptset (warehouse reads, Lakebase OLTP cb_brickhouse, Genie, model serving, devhub) + 5 edit tasks. Covers the surfaces this PR reorganizes (capability composition, warehouse mutations, lakebase split, genie/serving/files guides).
Env: dev-dogfood, CLI v1.2.1, run 119124075098867 (internal Databricks workspace).

Status: generation in progress (~45–60 min). I'll follow up with per-app appeval_100 (build / unit / smoke / typecheck / apps validate / local runability), the average against the ≥ 0.85 health gate, and any edit regressions — compared against last night's stock-skills nightly as a baseline for the overlapping apps.

Note: a prod nightly is running concurrently and shares the Anthropic API key, so an isolated wall_clock_timeout on a heavy app may be capacity contention rather than a skill regression — I'll flag it if that happens.

keugenek · 2026-06-10T12:33:44Z

✅ Eval results — no generation-quality regression from this PR

Run 119124075098867 — custom-pr preset (10 apps across every promptset + 5 edits), databricks-apps skill pinned to this branch.

Metric	Result
Apps at `appeval_100` = 1.0	9 / 10
Avg `appeval_100`	0.90 (health gate ≥ 0.85 ✓)
Edit build regressions	0

Generation — all 1.0 (build + unit + smoke + typecheck + apps validate + local runability): property_search_app, parts_catalog_app, taxi_zones_map, city_performance_app, cb_brickhouse_simple (Lakebase), cb_genie_chat_advanced, cb_pixels_simple, serving_chat, devhub_saas_tracker.

The one miss (`genie_taxi_chat` = 0.0) was a non-reproducible scaffold slip, not a regression

In the main run the agent scaffolded the app one directory too deep (genie-taxi-chat/genie-taxi-chat/), so the harness couldn't find package.json → build skipped → 0.0. Generation otherwise reported success but took 48 turns (high), consistent with the agent fumbling the layout. I re-ran it both ways on the same appkit (0.38.1):

Run	Skill	`genie_taxi_chat`	Layout
original	this branch	0.0	`genie-taxi-chat/``genie-taxi-chat/` ✗
re-run	this branch	1.0	`genie-taxi-chat/` ✓
control	stock `main`	1.0	`genie-taxi-chat/` ✓

On re-run the PR skill produced a correct app identical to stock — the double-nesting was a one-off. Soft flag: worth a glance at whether the lifecycle/scaffold guidance reorg makes an extra wrapper directory slightly more likely, but it is not deterministic.

Edits — 0 build regressions

Edit	Δ appeval	Note
`property_search_app` · add_emoji	−0.17	smoke test pass→fail (build + unit OK) — likely flaky selector
`city_performance_app` · fix_critical_issue	0	no critical issue found (legit no-op)
`taxi_zones_map` · simplify_code	0	clean
`parts_catalog_app` · drop_unrequested_feature	0	clean
`parts_catalog_app` · multi_turn_additive	0	clean

Bottom line

The capability-composition refactor generates apps on par with stock skills across warehouse-read, Lakebase OLTP, Genie, model-serving, and devhub prompts — no build or quality regression. Skill confirmed installed from this branch (Using skills version refactor-app-capability-composition, 9 skills, no rate-limit on CLI v1.2.1).

Caveat: a prod nightly shared the Anthropic API key during the main run; it didn't materially affect results (the single transient slip cleared on re-run).

keugenek · 2026-06-10T13:20:43Z

🧪 Full eval set now running on this PR

Following the custom-pr smoke result above, I kicked off the full nightly-lakebase set (89 apps) with the skill pinned to this branch — the same comprehensive sweep the canonical prod nightly runs (73 nightly + 16 Lakebase, every promptset), plus the edit suite (~50 edit tasks across all 5 edit types).

Run: 526143550950576 on dev-dogfood (internal Databricks workspace), CLI v1.2.1.
Skill: confirmed installed from this branch again (Using skills version refactor-app-capability-composition).
Status: generation in progress (89 apps; ~1.5–2 h end-to-end).

I'll follow up with the aggregate appeval_100 vs the ≥ 0.85 health gate, a per-promptset breakdown, every generation failure / edit regression with a skill-vs-appkit-vs-flaky attribution, and a comparison against the stock-skills nightly. Any ambiguous skill-suspected failure gets a matched stock re-run to confirm (as with genie_taxi_chat above).

MarioCadenas requested review from a team, lennartkats-db and simonfaltum as code owners June 9, 2026 16:44

MarioCadenas force-pushed the replace-trpc-with-custom-endpoints branch from cf2c711 to 4c43aa3 Compare June 10, 2026 09:15

MarioCadenas added 4 commits June 10, 2026 12:09

Add warehouse mutations guide and sharpen app data-path docs.

eea9958

Document Delta/UC DML via custom routes, unify write-path guidance across skills, and expand Lakebase scaffolding and deployment notes.

Restore develop-before-validate workflow in AppKit overview.

76f7afe

The prior reorder conflated Lakebase deploy-first with the full lifecycle; keep Scaffold → Develop → Validate → Deploy and call out the OLTP exception.

Refactor databricks-apps skills around capability composition.

ce35cb6

Add data-patterns and lifecycle guides, slim SKILL.md to a 5-step agent workflow, dedupe overview and plugin guides, and broaden skill frontmatter for multi-plugin apps.

Split Lakebase docs and finish capability-refactor polish.

2cc1f99

Extract OLTP and synced-read guides from the monolithic lakebase doc, add a thin router, point data-patterns and cross-skill links at the right targets, and trim custom-endpoints/proto-first duplication.

MarioCadenas force-pushed the refactor-app-capability-composition branch from f238e3f to 2cc1f99 Compare June 10, 2026 10:10

MarioCadenas changed the base branch from replace-trpc-with-custom-endpoints to app-data-path-docs June 10, 2026 10:11

MarioCadenas changed the title ~~Refactor databricks-apps around capability composition~~ Refactor databricks-apps around capability composition + warehouse mutations Jun 10, 2026

MarioCadenas mentioned this pull request Jun 10, 2026

Add warehouse mutations guide and sharpen app data-path docs #135

Closed

2 tasks

MarioCadenas changed the base branch from app-data-path-docs to main June 10, 2026 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor databricks-apps around capability composition + warehouse mutations#132

Refactor databricks-apps around capability composition + warehouse mutations#132
MarioCadenas wants to merge 5 commits into
mainfrom
refactor-app-capability-composition

MarioCadenas commented Jun 9, 2026 •

edited

Loading

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MarioCadenas commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Capability refactor

Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true)

Review fixes (P1–P3)

Test plan

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 10, 2026

✅ Eval results — no generation-quality regression from this PR

The one miss (genie_taxi_chat = 0.0) was a non-reproducible scaffold slip, not a regression

Edits — 0 build regressions

Bottom line

Uh oh!

keugenek commented Jun 10, 2026

🧪 Full eval set now running on this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MarioCadenas commented Jun 9, 2026 •

edited

Loading

Agentic mode (`DATABRICKS_APPS_AGENTIC_MODE=true`)

The one miss (`genie_taxi_chat` = 0.0) was a non-reproducible scaffold slip, not a regression