Skip to content

Remove deployment logic from build script#5235

Open
zschira wants to merge 23 commits into
mainfrom
rewire-deployment
Open

Remove deployment logic from build script#5235
zschira wants to merge 23 commits into
mainfrom
rewire-deployment

Conversation

@zschira

@zschira zschira commented May 13, 2026

Copy link
Copy Markdown
Member

Overview

Closes #5116.

What problem does this address?

This PR removes deployment logic from the nightly build script, so we can have isolate deployment and builds into their own workflows.

What did you change?

  • Removes deployment logic from build script
  • Rename build-deploy-pudl workflow to build-pudl
  • Sets up deployment workflow to run when build-pudl completes successfully

Documentation

Make sure to update relevant aspects of the documentation:

  • Update the release notes: reference the PR and related issues.
  • Update dev docs

Good to do, but out of scope

  • Migrate build logic to python / dagster
  • Send notifications with zulip instead of importing from slack
  • Separate EQR build / deployment logic

@zschira zschira moved this from New to In progress in Catalyst Megaproject May 13, 2026
@zschira zschira self-assigned this May 13, 2026
@zschira zschira marked this pull request as ready for review May 13, 2026 15:00
Comment thread .github/workflows/deploy-pudl.yml Outdated

workflow_run:
workflows: ["build-pudl"]
branches:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be cool to have this automatically kick off a staging deployment if a build completes from a non-main branch, although there wouldn't be an associated tag, so we'd have to rework some things to handle that.

@zschira zschira requested review from jdangerx and zaneselvans May 13, 2026 16:01
Comment thread builds/pudl_batch.sh
echo "Copying outputs to GCP bucket $PUDL_GCS_OUTPUT" &&
gcloud storage --quiet cp -r "$PUDL_OUTPUT" "$PUDL_GCS_OUTPUT" &&
gcloud storage --quiet cp -r "dbt/seeds/etl_full_row_counts.csv" "$PUDL_GCS_OUTPUT" &&
gcloud storage --quiet cp "$LOGFILE" "$PUDL_GCS_OUTPUT" &&

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to wait to copy the logfile, but copying outputs to GCS is now the last step, so we can add the logfile to this function

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on what we want to do with the deploy-pudl logs? I guess we can just go look at them directly in Batch 🤷

@jdangerx jdangerx left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, and ripping out all that code is so satisfying :)

A few blocking concerns about the wiring of the workflow, plus a few changes we need to make to pudl.scripts.deploy itself before we can fully rip out the functionality in pudl_batch.sh.

One option we have, actually, is just to have both the pudl.scripts.deploy flow and the pudl_batch.sh flow running in parallel first - don't delete anything out of pudl_batch yet, and just force pudl.scripts.deploy to deploy to staging. Then we can see if the GHA wiring is working properly.

If we do that, it might be worth adding a staging version of the data.catalyst.coop service if we wanted to really be able to verify that things work. Something that only logged-in cats can access, and auto-scales down to 0 (like our user metrics internal app).

Finally, we should probably do something to zip up the new fercX_xbrl/* files (probably in pudl.scripts.deploy) before triggering the zenodo release so we don't run into the subdirectories/50 file limit - but I'm ok with punting that to another PR.

Comment thread .github/workflows/deploy-pudl.yml Outdated
--container-arg="${{ inputs.git_tag }}" \
--container-arg="${{ env.GIT_TAG }}" \
--container-arg="--staging" \
--container-arg="${{ inputs.staging }}" \

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: by my read, this would call pudl_deploy $GIT_TAG --staging with no actual staging value in the workflow_run case - does that actually deploy to s3://.../staging? or /nightly? or something else?

If we have no workflow_run trigger because we end up switching to workflow_dispatch from the pudl_batch.sh, then this is moot 😅

Comment thread builds/pudl_batch.sh
return 1
}

function deploy_data_viewer() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: looks like update_pudl_viewer function in deploy script isn't called, and uses the gcloud cli directly instead of triggering the workflow - we should update.

Comment thread builds/pudl_batch.sh
echo "Copying outputs to GCP bucket $PUDL_GCS_OUTPUT" &&
gcloud storage --quiet cp -r "$PUDL_OUTPUT" "$PUDL_GCS_OUTPUT" &&
gcloud storage --quiet cp -r "dbt/seeds/etl_full_row_counts.csv" "$PUDL_GCS_OUTPUT" &&
gcloud storage --quiet cp "$LOGFILE" "$PUDL_GCS_OUTPUT" &&

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on what we want to do with the deploy-pudl logs? I guess we can just go look at them directly in Batch 🤷

Comment thread pixi.lock

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. As main changed out from under us we probably need to do another re-lock. If you use pixi self-update && git checkout main -- pixi.lock && pixi lock that should get you the cleanest diff (remember to get your local main up to date!)

Comment thread .github/workflows/deploy-pudl.yml Outdated
branches:
- main
types:
- completed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't build-pudl complete right after submitting the batch job? In which case when we run this workflow via this trigger, the build artifacts won't be there yet... should we switch to workflow_dispatch from the pudl_batch.sh script like we do for zenodo?

@jdangerx

Copy link
Copy Markdown
Member

Zach & I talked synchronously, here are the decisions

  • The safest/most confident path is: add pudl-viewer staging env, then make pudl_batch.sh deploy to staging with the deploy-pudl workflow as well as deploying to production with the legacy workflow, then once the staging deploy is verified, we can cut over to only using deploy-pudl. We will do this.
  • we do need to trigger the deploy-pudl workflow via GH API request, not via workflow_run because of the timing / GCP delegation.
  • we should remember to use the workflow trigger for deploying pudl viewer in deploy-pudl workflow.
  • we are OK with just having deploy-pudl logs on Batch until further notice
  • We probably need to put in a Zulip notification for deploy-pudl status.
  • we'll do the XBRL parquet cleanup for Zenodo as a separate change to deploy-pudl after this works.

@zschira

zschira commented May 15, 2026

Copy link
Copy Markdown
Member Author

@jdangerx I've been thinking about creating a pudl-viewer staging environment and wanted to run my current plan by you before leaving.

Plan

  • Refactor pudl-viewer.tf to a reusable terraform module
  • Use module to create prod/staging resources where staging has a few changes from prod:
    • iap_enabled = true so only catalysters have access
    • min_instance_count = 0 to scale down when not in use
  • Modify build-deploy workflow with a staging path that will produce a docker image tagged as staging. Also needs to point viewer at staging parquet files in s3

Open questions

  • Do we reuse the prod cloud sql instance / auth0 instances?
    • I lean towards yes for simplicity, although it could be safer to have separate instances. Could mitigate danger by making the staging instance have read only access to cloud sql

@jdangerx

Copy link
Copy Markdown
Member

@jdangerx I've been thinking about creating a pudl-viewer staging environment and wanted to run my current
plan by you before leaving.

Great! I'm going to answer out of order. Mostly feels right 😄

Open questions

* Do we reuse the prod cloud sql instance / auth0 instances?
  
  * I lean towards yes for simplicity, although it could be safer to have separate instances. Could mitigate danger by making the staging instance have read only access to cloud sql

Re: Postgres:

Re: Auth0:

  • it's not too annoying to set up a new auth0 app
  • the simplest thing to do is just set PUDL_INTEGRATION_TEST=true in the env, but then we won't be able to test login/logout flow on staging, but also maybe that doesn't matter too much.

Plan

* Refactor `pudl-viewer.tf` to a reusable terraform module

I wonder if making pudl-viewer-staging.tf as a separate module would be easier honestly - if we're not doing CloudSQL or Auth0 then we don't need all that secret-wiring machinery, and also if we're doing a docker-compose setup then it will be kind of different anyways.

* Use module to create prod/staging resources where staging has a few changes from prod:
  
  * `iap_enabled = true` so only catalysters have access
  * `min_instance_count = 0` to scale down when not in use

These settings sound great to me!

* Modify `build-deploy` workflow with a staging path that will produce a docker image tagged as `staging`. Also needs to point viewer at staging parquet files in s3

Looks like we already tag the image with the git ref in build-deploy, and then refer to that same git ref when we're running deploy, so we probably don't need to add a special staging tag for build. I sort of think of staging as "special low stakes deploy target" but build probably doesn't need to know about it...

Pointing viewer at staging parquet files in S3 is a good catch - we should definitely make that S3 base path configurable at runtime, via env var or something.

@zaneselvans zaneselvans added dagster Issues related to our use of the Dagster orchestrator nightly-builds Anything having to do with nightly builds or continuous deployment. labels May 15, 2026
@zschira

zschira commented May 15, 2026

Copy link
Copy Markdown
Member Author

Looks like Cloud Run supports docker compose

Oh cool, I didn't realize they'd started to support compose! It looks like this is a new feature and there's not any terraform support yet, which is not ideal, but be ok to have a staging env that's not fully managed by terraform. I definitely think the simplest path would be to use compose with no cloud sql / auth0, and we can always add auth0 down the line if needed.

@zaneselvans

zaneselvans commented May 21, 2026

Copy link
Copy Markdown
Member

@zschira Given that @jdangerx did a pretty thorough review, what role would you like me to play on this PR now? Should I do my own separate review before you work on addressing the things DX brought up? Or wait for the first round of changes to land and then look at it?

@zschira zschira marked this pull request as draft May 28, 2026 15:43
@zaneselvans zaneselvans mentioned this pull request Jun 5, 2026
13 tasks
@zschira zschira marked this pull request as ready for review June 11, 2026 20:16
@zschira

zschira commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Ok @zaneselvans this is finally ready for another look. The big changes I've made since @jdangerx's review are:

  • Kick off deployment in build script so it happens after build is fully complete
  • Actually trigger eel-hole deployment (including staging deployment, which I need to create a PR for over there, but it is ready)
  • Zip FERC XBRL files to stay up-to-date with old build script
  • Added a branch build/deploy pipeline
    • Now if you manually trigger a build with the build-pudl action, it will run a "branch build" will trigger the deployment with staging=True so we can fully inspect outputs without overwriting our nightly or stable data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dagster Issues related to our use of the Dagster orchestrator nightly-builds Anything having to do with nightly builds or continuous deployment.

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Remove deployment from build-deploy-pudl job

3 participants