Skip to content

Improve FERC EQR sensor, deployment, testability, & notifications#5273

Open
zaneselvans wants to merge 80 commits into
mainfrom
ferceqr-sensor-race
Open

Improve FERC EQR sensor, deployment, testability, & notifications#5273
zaneselvans wants to merge 80 commits into
mainfrom
ferceqr-sensor-race

Conversation

@zaneselvans

@zaneselvans zaneselvans commented May 25, 2026

Copy link
Copy Markdown
Member

Overview

Okay, what the hell happened in here? It's +2,500 lines (?!) but ~1000 of them are tests, and ~500 are a simple utility script that checks permissions on a UPath. The core of the changes can be found in:

  • src/pudl/dagster/resources.py
  • src/pudl/dagster/sensors.py
  • src/pudl/dagster/assets/deploy/ferceqr.py
  • .github/workflows/build-deploy-ferceqr.yml
  • builds/ferceqr_batch.sh

For the May, 2026 update, after getting a clean archive, I tried launching a new FERC EQR run, and found that there was some stuff that needed to be aligned with the new Dagster setup:

  • The code location being referenced in the builds/ferceqr_batch.sh script needed to be updated from pudl.etl to just plain pudl
  • The FERC EQR sensor errored out if it runs and zero partitions have ever been materialized, because it gets a scalar None rather than a dictionary of values back. This doesn't cause the sensor to fail, but the error looks concerning / distracting.
  • The logfile output path wasn't actually including the BUILD_ID so it kept overwriting a file called ferceqr_logs rather than writing to sub-paths beneath that "directory" like path.
  • The core_ferceqr__transactions table was, sadly, choking in the pandera schema checks (though not 100% of the time) so I added it to the "high memory" asset list. Polars subsequently fixed their memory leak bug... but it didn't magically fix this problem.

These were all minor tweaks, but in the process of addressing them, I realized how hard it was to test anything in the EQR build + deployment end-to-end without processing everything and (potentially) writing it out to the production distribution bucket.

I also wanted to learn more about how Dagster's sensors and partitioned assets work, and was interested in switching the notifications over from Slack to Zulip.

I did a lot but not all of this on my own time.


Changes

1. Backfill-aware run-status sensors (sensors.py)

The old ferceqr_deployment_sensor (a single @dg.sensor polling partition statuses) is replaced with two @dg.run_status_sensor sensors:

  • ferceqr_success_sensor — triggers on SUCCESS runs of the ferceqr job
  • ferceqr_failure_sensor — triggers on FAILURE runs of the ferceqr job

Both sensors use the _backfill_sensor_skip_reason_or_runs() function, which ensures exactly one notification and RunRequest is generated per backfill, by waiting until all sibling runs in a backfill have reached a terminal state before producing a RunRequest.

  • Non-backfill runs are silently skipped (the sensors only react to backfill-tagged runs).
  • The sensors are aware of the specific subset of partitions that have been requested in the backfill, making it possible to do limited test runs.
  • While any run in the backfill is still non-terminal, the sensor returns a SkipReason.
  • When all runs finish: if any failed, only the failure sensor produces a RunRequest. If none failed, then the success sensor produces a RunRequest.
  • The RunRequest carries partition list and source-run-ID tags so the deployment asset can build rich status tables.

2. Configurable deployment targets (resources.py)

Deployment destinations are now configurable resources rather than hardcoded in the deployment assets:

  • FercEqrDeploymentTargetConfig (dg.Config): A single destination with a path (local, file://, s3://, or gs://), optional storage_options, and an optional append_build_id flag for writing to our test bucket. Includes a Pydantic field_validator that validates local paths exist and are writable, and remote URLs use supported schemes.
  • FercEqrDeploymentResource (ConfigurableResource): A collection of targets, loadable from a YAML file (via PUDL_FERCEQR_DEPLOYMENT_CONFIG_PATH) or inline config. The resolved_targets() method returns UPath objects ready for copying.

Two packaged YAML configs are provided:

  • ferceqr_deployment_targets.yml — production targets (GCS + S3)
  • ferceqr_test_deployment_targets.yml — test target (single GCS testing bucket with append_build_id: true)

The deployment mode can be selected in CI via a deployment_mode workflow input (production / test / none).

Generalizing the deployment target(s) with UPath and making it configurable made it easy to do local test deployments to a filesystem path, or cloud-based test deployments to gs://test.catalyst.coop/BUILD_ID/

3. Richer, centralized notifications (deploy/ferceqr.py)

The new build_ferceqr_notification() function centralizes notification formatting, and prepares markdown output to feed into the recently implemented ZenodoNotificationResource. Notifications now include:

  • Asset-by-partition status table: a Markdown table showing each asset × each partition with status emoji.
  • Backfill duration: computed from earliest start to latest end across all source runs.
  • Deployment target paths: resolved from the deployment resource.
  • Build ID and source run ID: for traceability.
  • Logfile pointers: links to GCS and Cloud Console log viewers.

The deployment_status_asset decorator was updated to pass a full AssetExecutionContext (instead of individual resources), which the handler uses to access run tags, instance APIs for step statuses, and resources.

4. Preflight path-permission checker (check_path_permissions)

Added a new CLI script src/pudl/scripts/check_path_permissions.py that validates read, write, and delete access for local and remote paths (via UPath). It supports:

  • --read, --write flags for selecting checks
  • --json structured output for machine consumption
  • --anon for anonymous filesystem access
  • --check-ferceqr-deployment-paths a shortcut to validate all configured deployment targets, based on the current configuration / environment.

This script is used in ferceqr_batch.sh as a preflight step to fail fast before the ETL starts... rather than at the end after it's been running for 6 hours.

5. FERC EQR pipeline integration test (tests/pipeline/ferceqr_test.py)

A session-scoped integration test that:

  1. Starts dagster-daemon run in the background
  2. Submits a dagster job backfill for two test partitions
  3. Polls for FERCEQR_SUCCESS / FERCEQR_FAILURE sentinels
  4. Asserts deployed Parquet outputs match expected partitions, columns, and row counts
  5. Validates the ferceqr_parquet_datapackage.json is present and correct

Uses isolated temporary directories for DAGSTER_HOME, PUDL_OUTPUT, and deployment targets, so no cloud credentials are needed. Only requires a local FERC EQR archive at PUDL_FERCEQR_ARCHIVE_PATH.

6. Unit tests

  • tests/unit/dagster/ferceqr_deployment_test.py: Tests for both success and failure sensors covering: skip non-backfill runs, skip while backfill in progress, skip when counterpart sensor should handle, aggregated RunRequest production, and success sentinel behavior.
  • tests/unit/dagster/resources_test.py: Expanded with tests for FercEqrDeploymentTargetConfig validation (accepts valid paths, rejects invalid URIs, missing directories, non-writable directories), FercEqrDeploymentResource loading from YAML, build-id appending, and ZulipNotificationResource error handling.
  • tests/unit/scripts/check_path_permissions_test.py: Tests for read/write checks and the CLI interface.

7. EQR build script improvements (ferceqr_batch.sh)

  • Removed PostgreSQL. The FERC EQR job actually runs just fine with SQLite for Dagster storage.
  • Uses dagster-daemon run instead of dagster dev.
  • Uses trap cleanup_on_exit EXIT to ensure log uploads and cleanup happen regardless of exit path.
  • Added FERCEQR_START_PARTITION / FERCEQR_END_PARTITION support for partial backfills (e.g. via workflow_dispatch in GitHub Actions)
  • Added preflight validation: check_path_permissions for archive and deployment paths, plus an inline call to DagsterInstance.get() to verify Dagster storage is initialized (otherwise dagster daemon and the backfill fight over who gets to initialize it)
  • I accidentally merged similar updates into pudl_batch.sh (trap-based cleanup, consolidated log handling via exec > >(tee ...), and a separate dagster-pudl.yaml config) as well as disabling PostgreSQL -- I imagine these changes will conflict with whatever @zschira has going in the build + deploy splitting PR. I can split these changes out again easily but thought I'd leave them in to get feedback for the moment.

8. GHA workflow (build-deploy-ferceqr.yml)

  • Made the workflow_dispatch configurable so we can use it for partial runs & test deployments.
  • Added start_partition / end_partition workflow inputs for targeted backfills.
  • Added deployment_mode input (production / test / none).
  • Removed hardcoded GCS_OUTPUT_BUCKET / S3_OUTPUT_BUCKET env vars — deployment paths come from the YAML config now.
  • Added PUDL_FERCEQR_ARCHIVE_PATH and PUDL_FERCEQR_DEPLOYMENT_CONFIG_PATH as container env vars.

9. New / updated classes in resources.py:

  • FercEqrArchiveResource — re-named from FercEqrDataConfig since it has a different structure / job than the DataConfig classes (it's more like the Datastore -- where is the data, not what data to process)
  • FercEqrDeploymentTargetConfig / FercEqrDeploymentResource — deployment destinations

10. Job definition changes (jobs.py)

  • Named jobs extracted as module-level variables (ferceqr_job, ferceqr_deployment_job, etc.) rather than inline in default_jobs.
  • ferceqr_deployment_job added with deploy_ferceqr and handle_ferceqr_failure asset selection.
  • PUDL jobs now also exclude the ferceqr_deployment group, preventing accidental materialization during full ETL runs.

Files Changed

File Nature of Change
src/pudl/dagster/sensors.py Replaced single polling sensor with two coordinated run-status sensors
src/pudl/dagster/assets/deploy/ferceqr.py Richer notifications, configurable deployment, step-status tables
src/pudl/dagster/resources.py Added deployment, notification, and archive resources
src/pudl/dagster/jobs.py Extracted named jobs, added ferceqr_deployment_job
builds/ferceqr_batch.sh Simplified, added preflight checks, trap-based cleanup, partition range support
builds/pudl_batch.sh Aligned patterns from ferceqr_batch.sh (trap cleanup, config selection)
.github/workflows/build-deploy-ferceqr.yml Deployment mode, partition range, resource adjustments
src/pudl/scripts/check_path_permissions.py New — generic path permission checker
src/pudl/package_data/settings/ferceqr_deployment_targets.yml New — production deployment targets
src/pudl/package_data/settings/ferceqr_test_deployment_targets.yml New — test deployment targets
builds/dagster-ferceqr.yaml New — FERC EQR-specific Dagster config with QueuedRunCoordinator
tests/pipeline/ferceqr_test.py New — end-to-end pipeline integration test
tests/unit/dagster/ferceqr_deployment_test.py New — sensor and deployment handler unit tests
tests/unit/scripts/check_path_permissions_test.py New — permission checker unit tests
tests/unit/dagster/resources_test.py Expanded with deployment-target and Zulip resource tests
pyproject.toml Added check_path_permissions script entry point; coverage threshold adjustment

Testing

Unit tests

pixi run pytest --no-cov \
    tests/unit/dagster/ferceqr_deployment_test.py \
    tests/unit/dagster/resources_test.py \
    tests/unit/scripts/check_path_permissions_test.py

FERC EQR pipeline test

Note: this requires a local FERC EQR archive.

pixi run pytest --no-cov -s tests/pipeline/ferceqr_test.py -v

I plan to split out the PUDL pipeline tests in a future PR (i.e. all the tests that depend on having actually run the PUDL fast ETL) so that we can have more clearly delineated unit & integration tests (neither of which require the ETL) then the pipeline tests (that actually run Dagster and the fast ETL) and the data validation tests (which depend on the pipeline tests having been run).

Addendum from Production Deployment

6hr Timeout & Truncated Logs

  • Yesterday I tried to do a full deployment and it hit the 6-hour timeout in the middle of uploading to test.catalyst.coop which resulted in invalid / incomplete outputs being deployed. Bad!
  • To avoid this or another kind of deployment failure, I added a staging step, where the big, slow upload sends data to a path adjacent to the ultimate destination with a random string in the name.
  • Only once the data has all been successfully uploaded does it get copied to the final destination, which is extremely fast because it's all server-side. This should make it much less likely that the deployment gets truncated.
  • Debugging this was more tedious than I expected because the killall dagster that gets run in the build container as soon as the FERCEQR_FAILURE sentinel file shows up is very aggressive... which meant that the logging output that would explain why the failure happened wasn't getting flushed. And it took a while for me to figure out the logging issue before finding and fixing the dumb "removing a file that's already gone" issue from the logging output.
  • This also highlighted the fact that while we trigger the deployment job (success or failure) based on the outcome of the ferceqr data processing job, we didn't have any kind of handling for the deployment job itself so I added some fallback clean up & notification so that we know wtf happened if the deployment fails.

OOM? Heartbeat timeout? Or something else?

  • Having fixed that, I doubled the VM resources (and concurrency) and tried to run another full ETL, but... ti failed with something that looked like an OOM error! Why?!
  • Dropping the concurrency back to 10 (on a 16-core VM) didn't make the ETL run any slower (it's ~3 hours) but did seem to fix the OOM issue.
  • However, I think there's still some kind of problem because only on the VM the dagster.code_server seems to keep dying and getting restarted over and over again for the entire build. This does not happen locally.
  • The previous failure was from being unable to connect to the gRPC server because it was apparently not running, which seemed maybe related to the code server either dying from OOM or taking so long to import (with 20 other things running int he background) that it hit the timeout.
  • But anyway, even with the low concurrency, the infinite karmic cycle of process death and respawning continues and seems broken. So I will make an issue.

Cloud Permissions

  • And then today I tried to run a production deployment and ran into cloud permissions issues, which (thankfully!) were immediately flagged by the check_path_permissions script that runs before the ETL (nevermind the fact that the permissions issues were also. uh. caused by that script, not effectively passing the storage_options all the way through). So now that's fixed.

Followup Issues

@zaneselvans zaneselvans self-assigned this May 25, 2026
@zaneselvans zaneselvans added ferceqr Data from FERC's Electric Quarterly Review (EQR) dagster Issues related to our use of the Dagster orchestrator nightly-builds Anything having to do with nightly builds or continuous deployment. labels May 25, 2026
@zaneselvans zaneselvans moved this from New to In review in Catalyst Megaproject May 25, 2026
@zaneselvans zaneselvans requested a review from pudlbot May 25, 2026 09:20
@zaneselvans zaneselvans changed the title hotfix: FERC EQR job launch & sensor issues Fix FERC EQR job launch & sensor issues May 25, 2026
@zaneselvans zaneselvans requested review from zschira and removed request for pudlbot May 25, 2026 09:34
@zaneselvans zaneselvans changed the title Fix FERC EQR job launch & sensor issues Fix FERC EQR batch job May 26, 2026
high_memory_assets = [
"out_vcerare__hourly_available_capacity_factor",
"core_epacems__hourly_emissions",
"core_ferceqr__transactions",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this chonky boi for the moment so we don't run out of memory during validation.

@zaneselvans zaneselvans changed the title Fix FERC EQR batch job Upeate FERC EQR batch job for new Dagster setup May 27, 2026
Replace the FERC EQR deployment control path with run-status sensors and
introduce a dedicated deployment job for success/failure follow-up assets.

- Switch from partition-status polling to SUCCESS/FAILURE run-status sensors
  for the `ferceqr` job, and pass source run metadata via run tags.
- Keep FERC EQR deployment assets always present in Dagster definitions, with
  runtime gating via `FERCEQR_BUILD` in the deployment wrapper.
- Migrate deployment notifications from Slack to Zulip, including reusable
  Markdown message construction and step-status summaries.
- Add Dagster resources for Zulip notifications and FERC EQR bucket deployment
  settings, and wire them into default resources.
- Refactor deploy/failure handlers to use resource-injected config and write
  SUCCESS/FAILURE sentinel files for batch orchestration.
- Add focused unit tests for sensors, deployment handlers, and new resources.
@zaneselvans zaneselvans marked this pull request as draft May 27, 2026 12:27
@zaneselvans zaneselvans changed the title Upeate FERC EQR batch job for new Dagster setup Update FERC EQR batch job for new Dagster setup May 27, 2026
@zaneselvans zaneselvans moved this from In review to In progress in Catalyst Megaproject May 27, 2026
@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 9, 2026
Replace the direct-to-target file copying in `deploy_ferceqr` with a staging-then-rename
pattern to prevent partial deployments on timeout or crash. All Parquet files and the
datapackage JSON are first uploaded to a `._staging_{BUILD_ID}_{random}` subdirectory
beneath each target, then atomically promoted to the final path via `UPath.rename()`. On
GCS and S3 the rename is a server-side operation (no re-upload), so the promotion window
is near-instant regardless of data volume.

The datapackage JSON is now written to `pudl_output` alongside the Parquet data before
deployment, so it can be copied like any other file and remains as a permanent record of
the build.

In `ferceqr_batch.sh`, capture `inotifywait`'s exit code to distinguish timeouts (exit
2) from other failures. On timeout, log the `PUDL_OUTPUT` directory contents for
forensics and write `FERCEQR_FAILURE` so the build is correctly reported as failed. The
timeout value is now configurable via the `FERCEQR_BUILD_TIMEOUT_HOURS` environment
variable (default 8 hours). Add a `ferceqr_cleanup_staging.py` script that sweeps
orphaned `._staging_*` directories from deployment targets (via UPath for both local
paths and cloud URIs) if the build is interrupted mid-upload. Called from
`cleanup_on_exit` in `ferceqr_batch.sh`.

Add unit tests for the new staging helpers: `_staging_path`, `_deploy_to_staging`,
`_promote_staging`, `_remove_staging`, and the cleanup-on-failure path.
@zaneselvans zaneselvans linked an issue Jun 12, 2026 that may be closed by this pull request
@zaneselvans zaneselvans added testing Writing tests, creating test data, automating testing, etc. cloud Stuff that has to do with adapting PUDL to work in cloud computing context. labels Jun 12, 2026
) as filing_archive,
tempfile.TemporaryDirectory() as tmp_dir,
):
logger.info(f"Extracting CSVs from {filing}.")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was generating 150,000 lines of logs, but very little useful information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cloud Stuff that has to do with adapting PUDL to work in cloud computing context. dagster Issues related to our use of the Dagster orchestrator ferceqr Data from FERC's Electric Quarterly Review (EQR) nightly-builds Anything having to do with nightly builds or continuous deployment. testing Writing tests, creating test data, automating testing, etc.

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Report both ETL and deployment durations for FERC EQR

1 participant