Stop spending weeks on boilerplate. This PySpark project template for Databricks gives you medallion architecture, Python packaging, unit + integration + load tests, CI/CD via Declarative Automation Bundles, DQX data quality, and service-principal-based production deploys — all wired together and ready to ship. Whether you're starting a new Databricks ETL project or looking for a reference implementation of production-ready PySpark pipelines, fork this and go.
If this saves you time, a star helps others find it. Let's connect on LinkedIn.
- Databricks Free Edition (Serverless)
- Databricks Runtime 18.0 LTS
- Databricks Unity Catalog
- Databricks Declarative Automation Bundles (former Asset Bundles)
- Databricks CLI
- Databricks Python SDK
- Databricks DQX
- Databricks AI Dev Kit
- Databricks Dashboards
- Claude Code
- PySpark 4.1
- Spark Declarative Pipelines (SDP)
- Python 3.12+
- GitHub Actions
- Pytest
This project template demonstrates how to:
- use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a
CLAUDE.mdand aspecs/folder documenting the project's conventions. - structure PySpark code inside classes/packages, deploy it as a Python wheel (instead of notebooks), and manage the project with uv.
- package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use GitHub Actions to automate CI/CD pipeline.
- utilize Databricks Lakeflow Jobs to execute a DAG - Yes, you don't need Airflow to manage your DAGs here!!!. Generate job definitions to run with environment-specific conditions using Databricks SDK.
- isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
- separate deploy-time config (environment variables, CI secrets) from runtime config (job parameters overridable from the Databricks UI), keeping jobs flexible without coupling them to the build process.
- utilize job tags to track issues, costs, and ownership.
- use the medallion architecture to organize your data.
- use a Lakeflow Spark Declarative Pipeline to run the same ETL logic side-by-side with the PySpark job, demonstrating both paradigms from one codebase.
- apply Delta liquid clustering and incremental load to build more efficient pipelines.
- run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
- run integration tests by setting the input data and validating the output data.
- run load tests to exercise both the initial bulk load and incremental daily updates, validating that the pipeline handles production-scale data.
- use Databricks AI/BI Dashboards to visualize the gold layer.
- utilize the coverage package to generate test coverage reports.
- use structured logging giving you full observability during incidents without a code change.
- lint and format code with ruff and pre-commit.
- use a Makefile to automate repetitive tasks.
- utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
- utilize service principals to run production code.
- utilize the Databricks SDK for Python to manage catalogs, schemas, workspaces, and accounts. Refer to the
scriptsfolder for examples. - utilize Databricks Unity Catalog to manage permissions and get data lineage.
- enforce production guardrails out of the box — identity-locked CI deploys, a health-check task, wheel version pinning, per-task timeouts, schema-drift guards, queued runs, and on-call alerting.
- track project cloud spend with cost reports from AWS (Cost Explorer) and Databricks (
system.billing.usage). - utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.
Deep technical detail lives in specs/ (the README stays a landing page):
- Architecture — wheel/CLI surface, jobs DAG, job generation, CI/CD, job-level params, deploy-time env vars, logging, production guardrails, folder structure.
- Data model — catalog/schema isolation, medallion data flow (diagram), table schemas, price-freeze semantics, liquid clustering, DQX/quarantine, lineage.
- Test plan — unit, integration, and load tests.
Agentic development:
- Claude Code: 5 Essentials for Data Engineering
- Mastering Claude Code in 30 minutes
- Introducing Databricks AI Dev Kit - Skills, MCP server, Builder App
Debates on the use of notebooks vs. Python packaging:
- The Rise of The Notebook Engineer
- Please don’t make me use Databricks notebooks
- this Linkedin thread by Daniel Beach
- this Linkedin thread by Ryan Chynoweth
- this Linkedin thread by Jaco van Gelder
Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:
- CI/CD for Databricks: Advanced Asset Bundles and GitHub Actions
- Deploying Databricks Asset Bundles (DABs) at Scale
- A Prescription for Success: Leveraging DABs for Faster Deployment and Better Patient Outcomes
Other resources:
- Goodbye Pip and Poetry. Why UV Might Be All You Need
- The Spark Revolution You Didn’t See Coming: How Apache Spark 4.0 in Databricks Just Changed Everything
-
(Optional) Install Databricks AI Dev Kit and Claude Code.
-
Create a Databricks Free Edition workspace.
-
Install and configure the Databricks CLI on your local machine. Check the current version in
databricks.yml. Follow the instructions here. -
Set up the Python environment and run unit tests on your local machine.
make sync && make unit-test -
Initialize the workspace. Create an external location in Databricks and update the
storage-rootparameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:make init -
Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:
[dev] host = https://xxxx.cloud.databricks.com/ token = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb [staging] host = https://xxxx.cloud.databricks.com/ client_id = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [prod] host = https://xxxx.cloud.databricks.com/ client_id = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -
Deploy and execute on the dev workspace.
make deploy env=dev -
Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).
-
(Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

