Skip to content

Add sdp-quarantine-pattern asset (reference pipeline + companion agent skill)#12

Merged
vmariiechko merged 4 commits into
mainfrom
feature/sdp-quarantine-pattern-asset
Jun 6, 2026
Merged

Add sdp-quarantine-pattern asset (reference pipeline + companion agent skill)#12
vmariiechko merged 4 commits into
mainfrom
feature/sdp-quarantine-pattern-asset

Conversation

@vmariiechko

@vmariiechko vmariiechko commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

Adds sdp-quarantine-pattern, a dual-door asset to the asset library: a validated reference Lakeflow Spark Declarative Pipeline that demonstrates the inverse-expectations quarantine pattern on the public samples.nyctaxi.trips dataset, plus a companion agent skill that adapts the pattern to the user's own dataset and verifies it. Critical (drop) expectations route violating rows to a quarantine_trips table while valid rows flow to clean_trips; advisory (warn) expectations annotate the event log without dropping.

The central lesson the asset teaches is the NULL trap: expect_all_or_drop keeps a row whose predicate evaluates to SQL NULL, so a naive inverse expectation double-counts NULL rows into both tables and breaks the partition. Every drop predicate is written NULL-safe so clean and quarantine partition the input exactly once.

Changes

  • New asset assets/sdp-quarantine-pattern/: pipeline notebook, declarative expectations.json, pure expectations.py helper (loads rules, derives the clean predicate and its inverse), DABs pipeline resource with a published event log and event_log_queries.sql, source parameterization via quarantine.source, and an in-bundle usage doc.
  • Offline unit-test suite shipped inside the asset (<target_dir>/tests/): a real local-Spark pytest that proves the partition invariant and NULL routing on crafted edge rows, modeling expect_all_or_drop keep-on-NULL semantics with IS NOT FALSE.
  • Companion agent skill at <skill_dir>/skills/sdp-quarantine-pattern/ (SKILL.md + references/adapt-the-pattern.md + references/self-verify.md): enforces NULL-safe drop predicates, splits work at a deploy/run trust boundary (Phase A adapt + offline unit tests with no workspace; Phase B gated deploy + live-verify, default human-in-the-middle), and ships a three-tier verification ladder.
  • Repo-level tests tests/assets/test_sdp_quarantine_pattern.py and config tests/configs/assets/sdp_quarantine_pattern.json.
  • Catalog and docs: ASSETS.md, ROADMAP.md, CHANGELOG.md ([1.9.0]), pyproject.toml (Google docstring convention).

Change Area

  • Asset Library (assets/<name>/)

Configuration Axes Affected

  • Unity Catalog / schemas
  • Template schema (databricks_template_schema.json)
  • Asset Library (new asset, asset schema, or framework changes)

Testing

  • All tests pass (pytest tests/ -V)
  • Manual template generation tested (databricks bundle init . --template-dir assets/sdp-quarantine-pattern)
  • New tests added for new functionality (if applicable)

Additionally, the offline unit-test suite was run with a real local Spark session (PySpark 4.x, JDK 17) and the pattern was validated live on serverless SDP: the partition invariant held exactly (21,847 clean + 85 quarantine = 21,932 raw at baseline) and a NULL drop column routed to quarantine without leaking into clean.

Asset Changes (if applicable)

  • Asset installs standalone via databricks bundle init . --template-dir assets/<name> --output-dir <dir>
  • Asset is self-contained (no references to library/helpers.tmpl or other assets)
  • tests/configs/assets/<name>.json added
  • Asset appears in ASSETS.md catalog

Checklist

  • Go template syntax is valid (no unclosed {{ }} blocks)
  • No .tmpl files appear in generated output
  • Generated YAML files are valid
  • Documentation updated (if behavior changed)

Lakeflow SDP pipeline demonstrating the inverse-expectations quarantine pattern on samples.nyctaxi.trips: drop expectations route violators to a separate quarantine table, valid rows flow to silver, warn expectations log to the event log. Drop predicates are NULL-safe so silver and quarantine partition the input exactly once; Split tables across medallion schemas (bronze_trips in bronze; silver_trips + quarantine_trips in silver); published a queryable event log with parsing queries, and added cost/trace tags. Sanitized all external-course references. Full suite: 2352 passed / 163 skipped.
Turn the asset into a dual-door deliverable: the validated reference pipeline plus a companion agent skill that adapts the inverse-expectations quarantine pattern to the user's own dataset and verifies it.

Skill ships a three-tier verification ladder led by offline local-Spark unit tests (partition invariant + NULL-trap, no workspace), then live read-only audits, then an optional source-parameterized integration test. Workflow splits at a deploy/run trust boundary (Phase A autonomous incl. running the unit tests; Phase B gated, default human-in-the-middle) and delegates deploy/run/query mechanics to the runtime. Validated live by dogfooding on samples.bakehouse via Claude Code + Sonnet 4.6. Also: name all three example tables consistently (raw/clean/quarantine) with fully-qualified identifiers, parameterize the source via quarantine.source, rename the event log table, and ship an offline unit-test suite. New target_dir/tests/ and skill_dir prompt.
Under the asset's keep-on-NULL expect_all_or_drop routing, a NULL-producing drop predicate keeps the row in both the clean and quarantine tables (double-counted), not 'both or neither'. Correct the phrasing in the asset README and SKILL.md and cite the Databricks `is false` operator. Replace 'production' with 'live (main-target)' wording in the self-verify reference.
@vmariiechko vmariiechko merged commit d895f08 into main Jun 6, 2026
1 check passed
@vmariiechko vmariiechko deleted the feature/sdp-quarantine-pattern-asset branch June 6, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant