Skip to content

andre-salvati/databricks-template

Repository files navigation

databricks-template — agentic development for Databricks + production-ready ETL

Databricks PySpark CI/CD Claude Code Stars

🚀 Overview

Stop spending weeks on boilerplate. This PySpark project template for Databricks gives you medallion architecture, Python packaging, unit + integration + load tests, CI/CD via Declarative Automation Bundles, DQX data quality, and service-principal-based production deploys — all wired together and ready to ship. Whether you're starting a new Databricks ETL project or looking for a reference implementation of production-ready PySpark pipelines, fork this and go.

If this saves you time, a star helps others find it. Let's connect on LinkedIn.

🧪 Technologies

  • Databricks Free Edition (Serverless)
  • Databricks Runtime 18.0 LTS
  • Databricks Unity Catalog
  • Databricks Declarative Automation Bundles (former Asset Bundles)
  • Databricks CLI
  • Databricks Python SDK
  • Databricks DQX
  • Databricks AI Dev Kit
  • Databricks Dashboards
  • Claude Code
  • PySpark 4.1
  • Spark Declarative Pipelines (SDP)
  • Python 3.12+
  • GitHub Actions
  • Pytest

📦 Features

This project template demonstrates how to:

  • use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a CLAUDE.md and a specs/ folder documenting the project's conventions.
  • structure PySpark code inside classes/packages, deploy it as a Python wheel (instead of notebooks), and manage the project with uv.
  • package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use GitHub Actions to automate CI/CD pipeline.
  • utilize Databricks Lakeflow Jobs to execute a DAG - Yes, you don't need Airflow to manage your DAGs here!!!. Generate job definitions to run with environment-specific conditions using Databricks SDK.
  • isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
  • separate deploy-time config (environment variables, CI secrets) from runtime config (job parameters overridable from the Databricks UI), keeping jobs flexible without coupling them to the build process.
  • utilize job tags to track issues, costs, and ownership.
  • use the medallion architecture to organize your data.
  • use a Lakeflow Spark Declarative Pipeline to run the same ETL logic side-by-side with the PySpark job, demonstrating both paradigms from one codebase.
  • apply Delta liquid clustering and incremental load to build more efficient pipelines.
  • run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
  • run integration tests by setting the input data and validating the output data.
  • run load tests to exercise both the initial bulk load and incremental daily updates, validating that the pipeline handles production-scale data.
  • use Databricks AI/BI Dashboards to visualize the gold layer.
  • utilize the coverage package to generate test coverage reports.
  • use structured logging giving you full observability during incidents without a code change.
  • lint and format code with ruff and pre-commit.
  • use a Makefile to automate repetitive tasks.
  • utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
  • utilize service principals to run production code.
  • utilize the Databricks SDK for Python to manage catalogs, schemas, workspaces, and accounts. Refer to the scripts folder for examples.
  • utilize Databricks Unity Catalog to manage permissions and get data lineage.
  • enforce production guardrails out of the box — identity-locked CI deploys, a health-check task, wheel version pinning, per-task timeouts, schema-drift guards, queued runs, and on-call alerting.
  • track project cloud spend with cost reports from AWS (Cost Explorer) and Databricks (system.billing.usage).
  • utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.

📐 Specs

Deep technical detail lives in specs/ (the README stays a landing page):

  • Architecture — wheel/CLI surface, jobs DAG, job generation, CI/CD, job-level params, deploy-time env vars, logging, production guardrails, folder structure.
  • Data model — catalog/schema isolation, medallion data flow (diagram), table schemas, price-freeze semantics, liquid clustering, DQX/quarantine, lineage.
  • Test plan — unit, integration, and load tests.

🧠 Resources

Agentic development:

Debates on the use of notebooks vs. Python packaging:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other resources:

Dashboard



Instructions

  1. (Optional) Install Databricks AI Dev Kit and Claude Code.

  2. Create a Databricks Free Edition workspace.

  3. Install and configure the Databricks CLI on your local machine. Check the current version in databricks.yml. Follow the instructions here.

  4. Set up the Python environment and run unit tests on your local machine.

     make sync && make unit-test
    
  5. Initialize the workspace. Create an external location in Databricks and update the storage-root parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:

     make init
    
  6. Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

     [dev]
     host          = https://xxxx.cloud.databricks.com/
     token         = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                     
     [staging]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
     [prod]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
  7. Deploy and execute on the dev workspace.

     make deploy env=dev
    
  8. Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).

  9. (Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Star History

Star History Chart

Packages

 
 
 

Contributors