databricks-template — agentic development for Databricks + production-ready ETL

🚀 Overview

Stop spending weeks on boilerplate. This PySpark project template for Databricks gives you medallion architecture, Python packaging, unit + integration + load tests, CI/CD via Declarative Automation Bundles, DQX data quality, and service-principal-based production deploys — all wired together and ready to ship. Whether you're starting a new Databricks ETL project or looking for a reference implementation of production-ready PySpark pipelines, fork this and go.

If this saves you time, a star helps others find it. Let's connect on LinkedIn.

🧪 Technologies

Databricks Free Edition (Serverless)
Databricks Runtime 18.0 LTS
Databricks Unity Catalog
Databricks Declarative Automation Bundles (former Asset Bundles)
Databricks CLI
Databricks Python SDK
Databricks DQX
Databricks AI Dev Kit
Databricks Dashboards
Claude Code
PySpark 4.1
Spark Declarative Pipelines (SDP)
Python 3.12+
GitHub Actions
Pytest

📦 Features

This project template demonstrates how to:

use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a CLAUDE.md and a specs/ folder documenting the project's conventions.
structure PySpark code inside classes/packages, deploy it as a Python wheel (instead of notebooks), and manage the project with uv.
package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use GitHub Actions to automate CI/CD pipeline.
utilize Databricks Lakeflow Jobs to execute a DAG - Yes, you don't need Airflow to manage your DAGs here!!!. Generate job definitions to run with environment-specific conditions using Databricks SDK.
isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
separate deploy-time config (environment variables, CI secrets) from runtime config (job parameters overridable from the Databricks UI), keeping jobs flexible without coupling them to the build process.
utilize job tags to track issues, costs, and ownership.
use the medallion architecture to organize your data.
use a Lakeflow Spark Declarative Pipeline to run the same ETL logic side-by-side with the PySpark job, demonstrating both paradigms from one codebase.
apply Delta liquid clustering and incremental load to build more efficient pipelines.
run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
run integration tests by setting the input data and validating the output data.
run load tests to exercise both the initial bulk load and incremental daily updates, validating that the pipeline handles production-scale data.
use Databricks AI/BI Dashboards to visualize the gold layer.
utilize the coverage package to generate test coverage reports.
use structured logging giving you full observability during incidents without a code change.
lint and format code with ruff and pre-commit.
use a Makefile to automate repetitive tasks.
utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
utilize service principals to run production code.
utilize the Databricks SDK for Python to manage catalogs, schemas, workspaces, and accounts. Refer to the scripts folder for examples.
utilize Databricks Unity Catalog to manage permissions and get data lineage.
enforce production guardrails out of the box — identity-locked CI deploys, a health-check task, wheel version pinning, per-task timeouts, schema-drift guards, queued runs, and on-call alerting.
track project cloud spend with cost reports from AWS (Cost Explorer) and Databricks (system.billing.usage).
utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.

📐 Specs

Deep technical detail lives in specs/ (the README stays a landing page):

Architecture — wheel/CLI surface, jobs DAG, job generation, CI/CD, job-level params, deploy-time env vars, logging, production guardrails, folder structure.
Data model — catalog/schema isolation, medallion data flow (diagram), table schemas, price-freeze semantics, liquid clustering, DQX/quarantine, lineage.
Test plan — unit, integration, and load tests.

🧠 Resources

Agentic development:

Debates on the use of notebooks vs. Python packaging:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other resources:

Dashboard

Instructions

(Optional) Install Databricks AI Dev Kit and Claude Code.
Create a Databricks Free Edition workspace.
Install and configure the Databricks CLI on your local machine. Check the current version in databricks.yml. Follow the instructions here.
Set up the Python environment and run unit tests on your local machine.
```
 make sync && make unit-test
```
Initialize the workspace. Create an external location in Databricks and update the storage-root parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:
```
 make init
```

Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

 [dev]
 host          = https://xxxx.cloud.databricks.com/
 token         = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                 
 [staging]
 host          = https://xxxx.cloud.databricks.com/
 client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
 client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

 [prod]
 host          = https://xxxx.cloud.databricks.com/
 client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
 client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Deploy and execute on the dev workspace.
```
 make deploy env=dev
```
Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).
(Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
docs		docs
resources		resources
scripts		scripts
specs		specs
src/template		src/template
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
databricks.yml		databricks.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

databricks-template — agentic development for Databricks + production-ready ETL

🚀 Overview

🧪 Technologies

📦 Features

📐 Specs

🧠 Resources

Dashboard

Instructions

Star History

About

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

databricks-template — agentic development for Databricks + production-ready ETL

🚀 Overview

🧪 Technologies

📦 Features

📐 Specs

🧠 Resources

Dashboard

Instructions

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages