Improve FERC EQR sensor, deployment, testability, & notifications by zaneselvans · Pull Request #5273 · catalyst-cooperative/pudl

zaneselvans · 2026-05-25T09:20:39Z

Overview

Okay, what the hell happened in here? It's +2,500 lines (?!) but ~1000 of them are tests, and ~500 are a simple utility script that checks permissions on a UPath. The core of the changes can be found in:

src/pudl/dagster/resources.py
src/pudl/dagster/sensors.py
src/pudl/dagster/assets/deploy/ferceqr.py
.github/workflows/build-deploy-ferceqr.yml
builds/ferceqr_batch.sh

For the May, 2026 update, after getting a clean archive, I tried launching a new FERC EQR run, and found that there was some stuff that needed to be aligned with the new Dagster setup:

The code location being referenced in the builds/ferceqr_batch.sh script needed to be updated from pudl.etl to just plain pudl
The FERC EQR sensor errored out if it runs and zero partitions have ever been materialized, because it gets a scalar None rather than a dictionary of values back. This doesn't cause the sensor to fail, but the error looks concerning / distracting.
The logfile output path wasn't actually including the BUILD_ID so it kept overwriting a file called ferceqr_logs rather than writing to sub-paths beneath that "directory" like path.
The core_ferceqr__transactions table was, sadly, choking in the pandera schema checks (though not 100% of the time) so I added it to the "high memory" asset list. Polars subsequently fixed their memory leak bug... but it didn't magically fix this problem.

These were all minor tweaks, but in the process of addressing them, I realized how hard it was to test anything in the EQR build + deployment end-to-end without processing everything and (potentially) writing it out to the production distribution bucket.

I also wanted to learn more about how Dagster's sensors and partitioned assets work, and was interested in switching the notifications over from Slack to Zulip.

I did a lot but not all of this on my own time.

Changes

1. Backfill-aware run-status sensors (`sensors.py`)

The old ferceqr_deployment_sensor (a single @dg.sensor polling partition statuses) is replaced with two @dg.run_status_sensor sensors:

ferceqr_success_sensor — triggers on SUCCESS runs of the ferceqr job
ferceqr_failure_sensor — triggers on FAILURE runs of the ferceqr job

Both sensors use the _backfill_sensor_skip_reason_or_runs() function, which ensures exactly one notification and RunRequest is generated per backfill, by waiting until all sibling runs in a backfill have reached a terminal state before producing a RunRequest.

Non-backfill runs are silently skipped (the sensors only react to backfill-tagged runs).
The sensors are aware of the specific subset of partitions that have been requested in the backfill, making it possible to do limited test runs.
While any run in the backfill is still non-terminal, the sensor returns a SkipReason.
When all runs finish: if any failed, only the failure sensor produces a RunRequest. If none failed, then the success sensor produces a RunRequest.
The RunRequest carries partition list and source-run-ID tags so the deployment asset can build rich status tables.

2. Configurable deployment targets (`resources.py`)

Deployment destinations are now configurable resources rather than hardcoded in the deployment assets:

FercEqrDeploymentTargetConfig (dg.Config): A single destination with a path (local, file://, s3://, or gs://), optional storage_options, and an optional append_build_id flag for writing to our test bucket. Includes a Pydantic field_validator that validates local paths exist and are writable, and remote URLs use supported schemes.
FercEqrDeploymentResource (ConfigurableResource): A collection of targets, loadable from a YAML file (via PUDL_FERCEQR_DEPLOYMENT_CONFIG_PATH) or inline config. The resolved_targets() method returns UPath objects ready for copying.

Two packaged YAML configs are provided:

ferceqr_deployment_targets.yml — production targets (GCS + S3)
ferceqr_test_deployment_targets.yml — test target (single GCS testing bucket with append_build_id: true)

The deployment mode can be selected in CI via a deployment_mode workflow input (production / test / none).

Generalizing the deployment target(s) with UPath and making it configurable made it easy to do local test deployments to a filesystem path, or cloud-based test deployments to gs://test.catalyst.coop/BUILD_ID/

3. Richer, centralized notifications (`deploy/ferceqr.py`)

The new build_ferceqr_notification() function centralizes notification formatting, and prepares markdown output to feed into the recently implemented ZenodoNotificationResource. Notifications now include:

Asset-by-partition status table: a Markdown table showing each asset × each partition with status emoji.
Backfill duration: computed from earliest start to latest end across all source runs.
Deployment target paths: resolved from the deployment resource.
Build ID and source run ID: for traceability.
Logfile pointers: links to GCS and Cloud Console log viewers.

The deployment_status_asset decorator was updated to pass a full AssetExecutionContext (instead of individual resources), which the handler uses to access run tags, instance APIs for step statuses, and resources.

4. Preflight path-permission checker (`check_path_permissions`)

Added a new CLI script src/pudl/scripts/check_path_permissions.py that validates read, write, and delete access for local and remote paths (via UPath). It supports:

--read, --write flags for selecting checks
--json structured output for machine consumption
--anon for anonymous filesystem access
--check-ferceqr-deployment-paths a shortcut to validate all configured deployment targets, based on the current configuration / environment.

This script is used in ferceqr_batch.sh as a preflight step to fail fast before the ETL starts... rather than at the end after it's been running for 6 hours.

5. FERC EQR pipeline integration test (`tests/pipeline/ferceqr_test.py`)

A session-scoped integration test that:

Starts dagster-daemon run in the background
Submits a dagster job backfill for two test partitions
Polls for FERCEQR_SUCCESS / FERCEQR_FAILURE sentinels
Asserts deployed Parquet outputs match expected partitions, columns, and row counts
Validates the ferceqr_parquet_datapackage.json is present and correct

Uses isolated temporary directories for DAGSTER_HOME, PUDL_OUTPUT, and deployment targets, so no cloud credentials are needed. Only requires a local FERC EQR archive at PUDL_FERCEQR_ARCHIVE_PATH.

6. Unit tests

tests/unit/dagster/ferceqr_deployment_test.py: Tests for both success and failure sensors covering: skip non-backfill runs, skip while backfill in progress, skip when counterpart sensor should handle, aggregated RunRequest production, and success sentinel behavior.
tests/unit/dagster/resources_test.py: Expanded with tests for FercEqrDeploymentTargetConfig validation (accepts valid paths, rejects invalid URIs, missing directories, non-writable directories), FercEqrDeploymentResource loading from YAML, build-id appending, and ZulipNotificationResource error handling.
tests/unit/scripts/check_path_permissions_test.py: Tests for read/write checks and the CLI interface.

7. EQR build script improvements (`ferceqr_batch.sh`)

Removed PostgreSQL. The FERC EQR job actually runs just fine with SQLite for Dagster storage.
Uses dagster-daemon run instead of dagster dev.
Uses trap cleanup_on_exit EXIT to ensure log uploads and cleanup happen regardless of exit path.
Added FERCEQR_START_PARTITION / FERCEQR_END_PARTITION support for partial backfills (e.g. via workflow_dispatch in GitHub Actions)
Added preflight validation: check_path_permissions for archive and deployment paths, plus an inline call to DagsterInstance.get() to verify Dagster storage is initialized (otherwise dagster daemon and the backfill fight over who gets to initialize it)
I accidentally merged similar updates into pudl_batch.sh (trap-based cleanup, consolidated log handling via exec > >(tee ...), and a separate dagster-pudl.yaml config) as well as disabling PostgreSQL -- I imagine these changes will conflict with whatever @zschira has going in the build + deploy splitting PR. I can split these changes out again easily but thought I'd leave them in to get feedback for the moment.

8. GHA workflow (`build-deploy-ferceqr.yml`)

Made the workflow_dispatch configurable so we can use it for partial runs & test deployments.
Added start_partition / end_partition workflow inputs for targeted backfills.
Added deployment_mode input (production / test / none).
Removed hardcoded GCS_OUTPUT_BUCKET / S3_OUTPUT_BUCKET env vars — deployment paths come from the YAML config now.
Added PUDL_FERCEQR_ARCHIVE_PATH and PUDL_FERCEQR_DEPLOYMENT_CONFIG_PATH as container env vars.

9. New / updated classes in `resources.py`:

FercEqrArchiveResource — re-named from FercEqrDataConfig since it has a different structure / job than the DataConfig classes (it's more like the Datastore -- where is the data, not what data to process)
FercEqrDeploymentTargetConfig / FercEqrDeploymentResource — deployment destinations

10. Job definition changes (`jobs.py`)

Named jobs extracted as module-level variables (ferceqr_job, ferceqr_deployment_job, etc.) rather than inline in default_jobs.
ferceqr_deployment_job added with deploy_ferceqr and handle_ferceqr_failure asset selection.
PUDL jobs now also exclude the ferceqr_deployment group, preventing accidental materialization during full ETL runs.

Files Changed

File	Nature of Change
`src/pudl/dagster/sensors.py`	Replaced single polling sensor with two coordinated run-status sensors
`src/pudl/dagster/assets/deploy/ferceqr.py`	Richer notifications, configurable deployment, step-status tables
`src/pudl/dagster/resources.py`	Added deployment, notification, and archive resources
`src/pudl/dagster/jobs.py`	Extracted named jobs, added `ferceqr_deployment_job`
`builds/ferceqr_batch.sh`	Simplified, added preflight checks, trap-based cleanup, partition range support
`builds/pudl_batch.sh`	Aligned patterns from `ferceqr_batch.sh` (trap cleanup, config selection)
`.github/workflows/build-deploy-ferceqr.yml`	Deployment mode, partition range, resource adjustments
`src/pudl/scripts/check_path_permissions.py`	New — generic path permission checker
`src/pudl/package_data/settings/ferceqr_deployment_targets.yml`	New — production deployment targets
`src/pudl/package_data/settings/ferceqr_test_deployment_targets.yml`	New — test deployment targets
`builds/dagster-ferceqr.yaml`	New — FERC EQR-specific Dagster config with `QueuedRunCoordinator`
`tests/pipeline/ferceqr_test.py`	New — end-to-end pipeline integration test
`tests/unit/dagster/ferceqr_deployment_test.py`	New — sensor and deployment handler unit tests
`tests/unit/scripts/check_path_permissions_test.py`	New — permission checker unit tests
`tests/unit/dagster/resources_test.py`	Expanded with deployment-target and Zulip resource tests
`pyproject.toml`	Added `check_path_permissions` script entry point; coverage threshold adjustment

Testing

Unit tests

pixi run pytest --no-cov \
    tests/unit/dagster/ferceqr_deployment_test.py \
    tests/unit/dagster/resources_test.py \
    tests/unit/scripts/check_path_permissions_test.py

FERC EQR pipeline test

Note: this requires a local FERC EQR archive.

pixi run pytest --no-cov -s tests/pipeline/ferceqr_test.py -v

I plan to split out the PUDL pipeline tests in a future PR (i.e. all the tests that depend on having actually run the PUDL fast ETL) so that we can have more clearly delineated unit & integration tests (neither of which require the ETL) then the pipeline tests (that actually run Dagster and the fast ETL) and the data validation tests (which depend on the pipeline tests having been run).

Addendum from Production Deployment

6hr Timeout & Truncated Logs

Yesterday I tried to do a full deployment and it hit the 6-hour timeout in the middle of uploading to test.catalyst.coop which resulted in invalid / incomplete outputs being deployed. Bad!
To avoid this or another kind of deployment failure, I added a staging step, where the big, slow upload sends data to a path adjacent to the ultimate destination with a random string in the name.
Only once the data has all been successfully uploaded does it get copied to the final destination, which is extremely fast because it's all server-side. This should make it much less likely that the deployment gets truncated.
Debugging this was more tedious than I expected because the killall dagster that gets run in the build container as soon as the FERCEQR_FAILURE sentinel file shows up is very aggressive... which meant that the logging output that would explain why the failure happened wasn't getting flushed. And it took a while for me to figure out the logging issue before finding and fixing the dumb "removing a file that's already gone" issue from the logging output.
This also highlighted the fact that while we trigger the deployment job (success or failure) based on the outcome of the ferceqr data processing job, we didn't have any kind of handling for the deployment job itself so I added some fallback clean up & notification so that we know wtf happened if the deployment fails.

OOM? Heartbeat timeout? Or something else?

Having fixed that, I doubled the VM resources (and concurrency) and tried to run another full ETL, but... ti failed with something that looked like an OOM error! Why?!
Dropping the concurrency back to 10 (on a 16-core VM) didn't make the ETL run any slower (it's ~3 hours) but did seem to fix the OOM issue.
However, I think there's still some kind of problem because only on the VM the dagster.code_server seems to keep dying and getting restarted over and over again for the entire build. This does not happen locally.
The previous failure was from being unable to connect to the gRPC server because it was apparently not running, which seemed maybe related to the code server either dying from OOM or taking so long to import (with 20 other things running int he background) that it hit the timeout.
But anyway, even with the low concurrency, the infinite karmic cycle of process death and respawning continues and seems broken. So I will make an issue.

Cloud Permissions

And then today I tried to run a production deployment and ran into cloud permissions issues, which (thankfully!) were immediately flagged by the check_path_permissions script that runs before the ETL (nevermind the fact that the permissions issues were also. uh. caused by that script, not effectively passing the storage_options all the way through). So now that's fixed.

Followup Issues

zaneselvans · 2026-05-26T18:10:04Z

 high_memory_assets = [
    "out_vcerare__hourly_available_capacity_factor",
    "core_epacems__hourly_emissions",
+    "core_ferceqr__transactions",


Avoid this chonky boi for the moment so we don't run out of memory during validation.

Replace the FERC EQR deployment control path with run-status sensors and introduce a dedicated deployment job for success/failure follow-up assets. - Switch from partition-status polling to SUCCESS/FAILURE run-status sensors for the `ferceqr` job, and pass source run metadata via run tags. - Keep FERC EQR deployment assets always present in Dagster definitions, with runtime gating via `FERCEQR_BUILD` in the deployment wrapper. - Migrate deployment notifications from Slack to Zulip, including reusable Markdown message construction and step-status summaries. - Add Dagster resources for Zulip notifications and FERC EQR bucket deployment settings, and wire them into default resources. - Refactor deploy/failure handlers to use resource-injected config and write SUCCESS/FAILURE sentinel files for batch orchestration. - Add focused unit tests for sensors, deployment handlers, and new resources.

Replace the direct-to-target file copying in `deploy_ferceqr` with a staging-then-rename pattern to prevent partial deployments on timeout or crash. All Parquet files and the datapackage JSON are first uploaded to a `._staging_{BUILD_ID}_{random}` subdirectory beneath each target, then atomically promoted to the final path via `UPath.rename()`. On GCS and S3 the rename is a server-side operation (no re-upload), so the promotion window is near-instant regardless of data volume. The datapackage JSON is now written to `pudl_output` alongside the Parquet data before deployment, so it can be copied like any other file and remains as a permanent record of the build. In `ferceqr_batch.sh`, capture `inotifywait`'s exit code to distinguish timeouts (exit 2) from other failures. On timeout, log the `PUDL_OUTPUT` directory contents for forensics and write `FERCEQR_FAILURE` so the build is correctly reported as failed. The timeout value is now configurable via the `FERCEQR_BUILD_TIMEOUT_HOURS` environment variable (default 8 hours). Add a `ferceqr_cleanup_staging.py` script that sweeps orphaned `._staging_*` directories from deployment targets (via UPath for both local paths and cloud URIs) if the build is interrupted mid-upload. Called from `cleanup_on_exit` in `ferceqr_batch.sh`. Add unit tests for the new staging helpers: `_staging_path`, `_deploy_to_staging`, `_promote_staging`, `_remove_staging`, and the cleanup-on-failure path.

zaneselvans · 2026-06-13T20:49:22Z

                    ) as filing_archive,
                    tempfile.TemporaryDirectory() as tmp_dir,
                ):
-                    logger.info(f"Extracting CSVs from {filing}.")


This was generating 150,000 lines of logs, but very little useful information.

Fix FERC EQR job launch & sensor issues

df5bdd9

zaneselvans self-assigned this May 25, 2026

zaneselvans added ferceqr Data from FERC's Electric Quarterly Review (EQR) dagster Issues related to our use of the Dagster orchestrator nightly-builds Anything having to do with nightly builds or continuous deployment. labels May 25, 2026

github-project-automation Bot added this to Catalyst Megaproject May 25, 2026

github-project-automation Bot moved this to New in Catalyst Megaproject May 25, 2026

zaneselvans moved this from New to In review in Catalyst Megaproject May 25, 2026

zaneselvans requested a review from pudlbot May 25, 2026 09:20

zaneselvans changed the title ~~hotfix: FERC EQR job launch & sensor issues~~ Fix FERC EQR job launch & sensor issues May 25, 2026

zaneselvans requested review from zschira and removed request for pudlbot May 25, 2026 09:34

zaneselvans added 4 commits May 25, 2026 03:45

Fix dagster location

513d30a

background it

16c3041

Omit --location because there is only one

0c571b9

have to use dagster dev not dg dev or it no work

3b91384

zaneselvans changed the title ~~Fix FERC EQR job launch & sensor issues~~ Fix FERC EQR batch job May 26, 2026

zaneselvans added 2 commits May 26, 2026 00:38

Merge branch 'main' into ferceqr-sensor-race

eb6d01a

Bigger EQR VM; fix log save path; transactions high memory table

bab8382

zaneselvans commented May 26, 2026

View reviewed changes

zaneselvans added 2 commits May 26, 2026 17:10

Merge branch 'main' into ferceqr-sensor-race

e3cc5b5

Merge branch 'main' into ferceqr-sensor-race

d1681c8

zaneselvans changed the title ~~Fix FERC EQR batch job~~ Upeate FERC EQR batch job for new Dagster setup May 27, 2026

zaneselvans added 3 commits May 27, 2026 02:48

Shush some logging noise

0ddf52c

Use ZULIP_API_KEY not SLACK_TOKEN, raise error on notification failure

905266e

zaneselvans marked this pull request as draft May 27, 2026 12:27

zaneselvans changed the title ~~Upeate FERC EQR batch job for new Dagster setup~~ Update FERC EQR batch job for new Dagster setup May 27, 2026

zaneselvans moved this from In review to In progress in Catalyst Megaproject May 27, 2026

Merge branch 'main' into ferceqr-sensor-race

5bf2ebf

zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 9, 2026

zaneselvans added 16 commits June 9, 2026 14:12

Consolidate and parametrize Dagster unit tests

2fbcb20

Temporarily bump VM resources for fast test run

68055c6

Re-order notifications in build-deploy-ferceqr workflow.

7b061ab

Merge branch 'main' into ferceqr-sensor-race

f112164

Bump pixi Dockerfile version

337e52e

Log traceback when _promote_staging() fails and send Zulip notification.

4e190c6

Fix staging cleanup and shush EQR extract logging

351fee9

Suppress DuckDB progress bars; try to fix cloud path cleanup issues

55f5ac8

Silence DuckDB progress bars in other places.

634856d

Guard against trying to remove paths that don't exist.

d0c4653

Drop concurrency to 10 runs; fix wayward logger name

8f81e9c

Add requester_pays=True to gcs deployment target args...

fb84fea

Actually pass storage_options all the way through function calls

bf4be5e

Add preflight-failure notification + target in launch notification

b952f43

Merge branch 'main' into ferceqr-sensor-race

b542162

zaneselvans mentioned this pull request Jun 12, 2026

Consolidate GCloud buckets and VMs in one region #5315

Open

zaneselvans added 2 commits June 11, 2026 23:05

Reduce devcontainer size by not installing coding harnesses.

3a21c32

Track and report FERC EQR deploy duration alongside backfill duration

6c21f62

zaneselvans linked an issue Jun 12, 2026 that may be closed by this pull request

Report both ETL and deployment durations for FERC EQR #5316

Open

zaneselvans mentioned this pull request Jun 12, 2026

Make FERC EQR deployment much faster #5317

Open

zaneselvans added testing Writing tests, creating test data, automating testing, etc. cloud Stuff that has to do with adapting PUDL to work in cloud computing context. labels Jun 12, 2026

zaneselvans added 4 commits June 12, 2026 15:22

Merge branch 'main' into ferceqr-sensor-race

0613f09

Add some missing datasets to README.rst

6d98dfd

Update README data sources & sustainer tiers

33163fc

Merge branch 'main' into ferceqr-sensor-race

67a45df

zaneselvans commented Jun 13, 2026

View reviewed changes

Merge branch 'main' into ferceqr-sensor-race

4081ebb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve FERC EQR sensor, deployment, testability, & notifications#5273

Improve FERC EQR sensor, deployment, testability, & notifications#5273
zaneselvans wants to merge 80 commits into
mainfrom
ferceqr-sensor-race

zaneselvans commented May 25, 2026 •

edited

Loading

Uh oh!

zaneselvans May 26, 2026

Uh oh!

zaneselvans Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zaneselvans commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

1. Backfill-aware run-status sensors (sensors.py)

2. Configurable deployment targets (resources.py)

3. Richer, centralized notifications (deploy/ferceqr.py)

4. Preflight path-permission checker (check_path_permissions)

5. FERC EQR pipeline integration test (tests/pipeline/ferceqr_test.py)

6. Unit tests

7. EQR build script improvements (ferceqr_batch.sh)

8. GHA workflow (build-deploy-ferceqr.yml)

9. New / updated classes in resources.py:

10. Job definition changes (jobs.py)

Files Changed

Testing

Unit tests

FERC EQR pipeline test

Addendum from Production Deployment

6hr Timeout & Truncated Logs

OOM? Heartbeat timeout? Or something else?

Cloud Permissions

Followup Issues

Uh oh!

zaneselvans May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zaneselvans Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zaneselvans commented May 25, 2026 •

edited

Loading

1. Backfill-aware run-status sensors (`sensors.py`)

2. Configurable deployment targets (`resources.py`)

3. Richer, centralized notifications (`deploy/ferceqr.py`)

4. Preflight path-permission checker (`check_path_permissions`)

5. FERC EQR pipeline integration test (`tests/pipeline/ferceqr_test.py`)

7. EQR build script improvements (`ferceqr_batch.sh`)

8. GHA workflow (`build-deploy-ferceqr.yml`)

9. New / updated classes in `resources.py`:

10. Job definition changes (`jobs.py`)