Skip to content

Refactor databricks-apps around capability composition + warehouse mutations#132

Open
MarioCadenas wants to merge 5 commits into
mainfrom
refactor-app-capability-composition
Open

Refactor databricks-apps around capability composition + warehouse mutations#132
MarioCadenas wants to merge 5 commits into
mainfrom
refactor-app-capability-composition

Conversation

@MarioCadenas

@MarioCadenas MarioCadenas commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Unifies the former #135 + #132 stack into a single PR (based on main). Refactors databricks-apps so agents compose apps from capabilities (reads_warehouse, writes_oltp, genie, files, …) instead of monolithic archetype docs, adds the warehouse-mutations write path, and teaches the skill to handle two environments (local vs agentic mode).

Capability refactor

  • Add warehouse-mutations.md — Delta/UC DML via appkit.analytics.query() in custom routes
  • Add data-patterns.md — canonical capability catalog, conditional gates, write/read paths, recipes, checklist slices
  • Add lifecycle.md — dev / validate / deploy ordering
  • Slim SKILL.md to a thin orchestrator
  • Split lakebase.md into router + lakebase-oltp.md + lakebase-synced-reads.md
  • Strip duplicate pattern tables from plugin guides; trim custom-endpoints.md → points at data-patterns; mark proto-first.md advanced-only

Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true)

  • Add environments.md as the canonical Local-vs-agentic delta; Step-0 detection branch in SKILL.md
  • In agentic mode the app is pre-scaffolded and all plugin resources are provisioned: read enabled plugins from appkit.plugins.json / app.yaml (don't infer); ambient auth (no profile, omit --profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests; npm run dev hits live resources; still run databricks apps validate. Stop and surface if a needed plugin isn't wired.
  • Short agentic callouts across lifecycle, data-patterns, lakebase-oltp, genie, model-serving, files, jobs, overview, sql-queries, warehouse-mutations

Review fixes (P1–P3)

  • Capability flags marked as concepts, not --features values
  • One canonical write-path table; downstream guides guard-and-link instead of restating it
  • warehouse-mutations.md leads with the simple inline pattern (generic optional); non-mutating smoke check; simplified lifecycle matrix; standardized await createApp

Supersedes #135 (its commits are included here).

Test plan

  • python3 scripts/skills.py generate && python3 scripts/skills.py validate
  • Verify appkit.analytics.query() supports DML on the shipped AppKit version before relying on warehouse-mutations.md
  • Spot-check local flow: multi-plugin request → data-patterns gates → correct deep guide
  • Spot-check agentic flow: DATABRICKS_APPS_AGENTIC_MODE=true → reads appkit.plugins.json, no scaffold/deploy, ambient auth
  • Spot-check: form/CRUD → Lakebase; Delta write → warehouse mutations; dashboard → SQL queries

@MarioCadenas MarioCadenas force-pushed the replace-trpc-with-custom-endpoints branch from cf2c711 to 4c43aa3 Compare June 10, 2026 09:15
Document Delta/UC DML via custom routes, unify write-path guidance across skills, and expand Lakebase scaffolding and deployment notes.
The prior reorder conflated Lakebase deploy-first with the full lifecycle;
keep Scaffold → Develop → Validate → Deploy and call out the OLTP exception.
Add data-patterns and lifecycle guides, slim SKILL.md to a 5-step agent workflow, dedupe overview and plugin guides, and broaden skill frontmatter for multi-plugin apps.
Extract OLTP and synced-read guides from the monolithic lakebase doc,
add a thin router, point data-patterns and cross-skill links at the
right targets, and trim custom-endpoints/proto-first duplication.
@MarioCadenas MarioCadenas force-pushed the refactor-app-capability-composition branch from f238e3f to 2cc1f99 Compare June 10, 2026 10:10
@MarioCadenas MarioCadenas changed the base branch from replace-trpc-with-custom-endpoints to app-data-path-docs June 10, 2026 10:11
@MarioCadenas MarioCadenas changed the title Refactor databricks-apps around capability composition Refactor databricks-apps around capability composition + warehouse mutations Jun 10, 2026
@MarioCadenas MarioCadenas changed the base branch from app-data-path-docs to main June 10, 2026 10:28
Adds a Local-vs-agentic-mode split keyed to DATABRICKS_APPS_AGENTIC_MODE,
plus the P1-P3 review fixes from the data-path refactor.

Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true):
- New references/appkit/environments.md as the canonical Local-vs-agentic
  delta; Step-0 detection branch in SKILL.md.
- In agentic mode the app is pre-scaffolded and all plugin resources are
  provisioned: read enabled plugins from appkit.plugins.json / app.yaml
  (don't infer); ambient auth (no profile, omit --profile); run only
  design+discovery gates; skip provisioning gates, scaffold, deploy, and
  smoke tests; npm run dev hits live resources; still run databricks apps
  validate. Stop and surface if a needed plugin isn't wired.
- Short agentic callouts in lifecycle, data-patterns, lakebase-oltp,
  genie, model-serving, files, jobs, overview, sql-queries,
  warehouse-mutations.

Doc fixes:
- Capability flags marked as concepts, not --features values.
- Single canonical write-path table in data-patterns; custom-endpoints
  and warehouse-mutations now guard-and-link instead of restating it.
- warehouse-mutations leads with the inline pattern; generic is optional.
- Reframed the warehouse smoke test to a non-mutating check.
- Simplified the lifecycle phase matrix; standardized await createApp.

Co-authored-by: Isaac
@keugenek

Copy link
Copy Markdown
Contributor

🧪 Dev eval run kicked off for this PR

Running the app-evals pipeline with the databricks-apps skill pinned to this branch, to check the capability-composition refactor doesn't regress generated-app quality.

Setup

  • Skill loaded from this branch — confirmed on the eval cluster:
    Installing Databricks skills with ref override: refactor-app-capability-composition
    Using skills version refactor-app-capability-composition
    Installed 9 skills (databricks-core, databricks-apps, databricks-lakebase, databricks-model-serving, …)
    
  • Preset: custom-pr — 10 apps spanning every promptset (warehouse reads, Lakebase OLTP cb_brickhouse, Genie, model serving, devhub) + 5 edit tasks. Covers the surfaces this PR reorganizes (capability composition, warehouse mutations, lakebase split, genie/serving/files guides).
  • Env: dev-dogfood, CLI v1.2.1, run 119124075098867 (internal Databricks workspace).

Status: generation in progress (~45–60 min). I'll follow up with per-app appeval_100 (build / unit / smoke / typecheck / apps validate / local runability), the average against the ≥ 0.85 health gate, and any edit regressions — compared against last night's stock-skills nightly as a baseline for the overlapping apps.

Note: a prod nightly is running concurrently and shares the Anthropic API key, so an isolated wall_clock_timeout on a heavy app may be capacity contention rather than a skill regression — I'll flag it if that happens.

@keugenek

Copy link
Copy Markdown
Contributor

✅ Eval results — no generation-quality regression from this PR

Run 119124075098867custom-pr preset (10 apps across every promptset + 5 edits), databricks-apps skill pinned to this branch.

Metric Result
Apps at appeval_100 = 1.0 9 / 10
Avg appeval_100 0.90 (health gate ≥ 0.85 ✓)
Edit build regressions 0

Generation — all 1.0 (build + unit + smoke + typecheck + apps validate + local runability): property_search_app, parts_catalog_app, taxi_zones_map, city_performance_app, cb_brickhouse_simple (Lakebase), cb_genie_chat_advanced, cb_pixels_simple, serving_chat, devhub_saas_tracker.

The one miss (genie_taxi_chat = 0.0) was a non-reproducible scaffold slip, not a regression

In the main run the agent scaffolded the app one directory too deep (genie-taxi-chat/genie-taxi-chat/), so the harness couldn't find package.json → build skipped → 0.0. Generation otherwise reported success but took 48 turns (high), consistent with the agent fumbling the layout. I re-ran it both ways on the same appkit (0.38.1):

Run Skill genie_taxi_chat Layout
original this branch 0.0 genie-taxi-chat/genie-taxi-chat/
re-run this branch 1.0 genie-taxi-chat/
control stock main 1.0 genie-taxi-chat/

On re-run the PR skill produced a correct app identical to stock — the double-nesting was a one-off. Soft flag: worth a glance at whether the lifecycle/scaffold guidance reorg makes an extra wrapper directory slightly more likely, but it is not deterministic.

Edits — 0 build regressions

Edit Δ appeval Note
property_search_app · add_emoji −0.17 smoke test pass→fail (build + unit OK) — likely flaky selector
city_performance_app · fix_critical_issue 0 no critical issue found (legit no-op)
taxi_zones_map · simplify_code 0 clean
parts_catalog_app · drop_unrequested_feature 0 clean
parts_catalog_app · multi_turn_additive 0 clean

Bottom line

The capability-composition refactor generates apps on par with stock skills across warehouse-read, Lakebase OLTP, Genie, model-serving, and devhub prompts — no build or quality regression. Skill confirmed installed from this branch (Using skills version refactor-app-capability-composition, 9 skills, no rate-limit on CLI v1.2.1).

Caveat: a prod nightly shared the Anthropic API key during the main run; it didn't materially affect results (the single transient slip cleared on re-run).

@keugenek

Copy link
Copy Markdown
Contributor

🧪 Full eval set now running on this PR

Following the custom-pr smoke result above, I kicked off the full nightly-lakebase set (89 apps) with the skill pinned to this branch — the same comprehensive sweep the canonical prod nightly runs (73 nightly + 16 Lakebase, every promptset), plus the edit suite (~50 edit tasks across all 5 edit types).

  • Run: 526143550950576 on dev-dogfood (internal Databricks workspace), CLI v1.2.1.
  • Skill: confirmed installed from this branch again (Using skills version refactor-app-capability-composition).
  • Status: generation in progress (89 apps; ~1.5–2 h end-to-end).

I'll follow up with the aggregate appeval_100 vs the ≥ 0.85 health gate, a per-promptset breakdown, every generation failure / edit regression with a skill-vs-appkit-vs-flaky attribution, and a comparison against the stock-skills nightly. Any ambiguous skill-suspected failure gets a matched stock re-run to confirm (as with genie_taxi_chat above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants