Add sdp-quarantine-pattern asset (reference pipeline + companion agent skill)#12
Merged
Merged
Conversation
Lakeflow SDP pipeline demonstrating the inverse-expectations quarantine pattern on samples.nyctaxi.trips: drop expectations route violators to a separate quarantine table, valid rows flow to silver, warn expectations log to the event log. Drop predicates are NULL-safe so silver and quarantine partition the input exactly once; Split tables across medallion schemas (bronze_trips in bronze; silver_trips + quarantine_trips in silver); published a queryable event log with parsing queries, and added cost/trace tags. Sanitized all external-course references. Full suite: 2352 passed / 163 skipped.
Turn the asset into a dual-door deliverable: the validated reference pipeline plus a companion agent skill that adapts the inverse-expectations quarantine pattern to the user's own dataset and verifies it. Skill ships a three-tier verification ladder led by offline local-Spark unit tests (partition invariant + NULL-trap, no workspace), then live read-only audits, then an optional source-parameterized integration test. Workflow splits at a deploy/run trust boundary (Phase A autonomous incl. running the unit tests; Phase B gated, default human-in-the-middle) and delegates deploy/run/query mechanics to the runtime. Validated live by dogfooding on samples.bakehouse via Claude Code + Sonnet 4.6. Also: name all three example tables consistently (raw/clean/quarantine) with fully-qualified identifiers, parameterize the source via quarantine.source, rename the event log table, and ship an offline unit-test suite. New target_dir/tests/ and skill_dir prompt.
Under the asset's keep-on-NULL expect_all_or_drop routing, a NULL-producing drop predicate keeps the row in both the clean and quarantine tables (double-counted), not 'both or neither'. Correct the phrasing in the asset README and SKILL.md and cite the Databricks `is false` operator. Replace 'production' with 'live (main-target)' wording in the self-verify reference.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
sdp-quarantine-pattern, a dual-door asset to the asset library: a validated reference Lakeflow Spark Declarative Pipeline that demonstrates the inverse-expectations quarantine pattern on the publicsamples.nyctaxi.tripsdataset, plus a companion agent skill that adapts the pattern to the user's own dataset and verifies it. Critical (drop) expectations route violating rows to aquarantine_tripstable while valid rows flow toclean_trips; advisory (warn) expectations annotate the event log without dropping.The central lesson the asset teaches is the NULL trap:
expect_all_or_dropkeeps a row whose predicate evaluates to SQLNULL, so a naive inverse expectation double-counts NULL rows into both tables and breaks the partition. Every drop predicate is written NULL-safe so clean and quarantine partition the input exactly once.Changes
assets/sdp-quarantine-pattern/: pipeline notebook, declarativeexpectations.json, pureexpectations.pyhelper (loads rules, derives the clean predicate and its inverse), DABs pipeline resource with a published event log andevent_log_queries.sql, source parameterization viaquarantine.source, and an in-bundle usage doc.<target_dir>/tests/): a real local-Sparkpytestthat proves the partition invariant and NULL routing on crafted edge rows, modelingexpect_all_or_dropkeep-on-NULL semantics withIS NOT FALSE.<skill_dir>/skills/sdp-quarantine-pattern/(SKILL.md+references/adapt-the-pattern.md+references/self-verify.md): enforces NULL-safe drop predicates, splits work at a deploy/run trust boundary (Phase A adapt + offline unit tests with no workspace; Phase B gated deploy + live-verify, default human-in-the-middle), and ships a three-tier verification ladder.tests/assets/test_sdp_quarantine_pattern.pyand configtests/configs/assets/sdp_quarantine_pattern.json.ASSETS.md,ROADMAP.md,CHANGELOG.md([1.9.0]),pyproject.toml(Google docstring convention).Change Area
assets/<name>/)Configuration Axes Affected
databricks_template_schema.json)Testing
pytest tests/ -V)databricks bundle init . --template-dir assets/sdp-quarantine-pattern)Additionally, the offline unit-test suite was run with a real local Spark session (PySpark 4.x, JDK 17) and the pattern was validated live on serverless SDP: the partition invariant held exactly (21,847 clean + 85 quarantine = 21,932 raw at baseline) and a NULL drop column routed to quarantine without leaking into clean.
Asset Changes (if applicable)
databricks bundle init . --template-dir assets/<name> --output-dir <dir>library/helpers.tmplor other assets)tests/configs/assets/<name>.jsonaddedChecklist
{{ }}blocks).tmplfiles appear in generated output