Improve FERC EQR sensor, deployment, testability, & notifications#5273
Open
zaneselvans wants to merge 80 commits into
Open
Improve FERC EQR sensor, deployment, testability, & notifications#5273zaneselvans wants to merge 80 commits into
zaneselvans wants to merge 80 commits into
Conversation
zaneselvans
commented
May 26, 2026
| high_memory_assets = [ | ||
| "out_vcerare__hourly_available_capacity_factor", | ||
| "core_epacems__hourly_emissions", | ||
| "core_ferceqr__transactions", |
Member
Author
There was a problem hiding this comment.
Avoid this chonky boi for the moment so we don't run out of memory during validation.
Replace the FERC EQR deployment control path with run-status sensors and introduce a dedicated deployment job for success/failure follow-up assets. - Switch from partition-status polling to SUCCESS/FAILURE run-status sensors for the `ferceqr` job, and pass source run metadata via run tags. - Keep FERC EQR deployment assets always present in Dagster definitions, with runtime gating via `FERCEQR_BUILD` in the deployment wrapper. - Migrate deployment notifications from Slack to Zulip, including reusable Markdown message construction and step-status summaries. - Add Dagster resources for Zulip notifications and FERC EQR bucket deployment settings, and wire them into default resources. - Refactor deploy/failure handlers to use resource-injected config and write SUCCESS/FAILURE sentinel files for batch orchestration. - Add focused unit tests for sensors, deployment handlers, and new resources.
Replace the direct-to-target file copying in `deploy_ferceqr` with a staging-then-rename
pattern to prevent partial deployments on timeout or crash. All Parquet files and the
datapackage JSON are first uploaded to a `._staging_{BUILD_ID}_{random}` subdirectory
beneath each target, then atomically promoted to the final path via `UPath.rename()`. On
GCS and S3 the rename is a server-side operation (no re-upload), so the promotion window
is near-instant regardless of data volume.
The datapackage JSON is now written to `pudl_output` alongside the Parquet data before
deployment, so it can be copied like any other file and remains as a permanent record of
the build.
In `ferceqr_batch.sh`, capture `inotifywait`'s exit code to distinguish timeouts (exit
2) from other failures. On timeout, log the `PUDL_OUTPUT` directory contents for
forensics and write `FERCEQR_FAILURE` so the build is correctly reported as failed. The
timeout value is now configurable via the `FERCEQR_BUILD_TIMEOUT_HOURS` environment
variable (default 8 hours). Add a `ferceqr_cleanup_staging.py` script that sweeps
orphaned `._staging_*` directories from deployment targets (via UPath for both local
paths and cloud URIs) if the build is interrupted mid-upload. Called from
`cleanup_on_exit` in `ferceqr_batch.sh`.
Add unit tests for the new staging helpers: `_staging_path`, `_deploy_to_staging`,
`_promote_staging`, `_remove_staging`, and the cleanup-on-failure path.
zaneselvans
commented
Jun 13, 2026
| ) as filing_archive, | ||
| tempfile.TemporaryDirectory() as tmp_dir, | ||
| ): | ||
| logger.info(f"Extracting CSVs from {filing}.") |
Member
Author
There was a problem hiding this comment.
This was generating 150,000 lines of logs, but very little useful information.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Okay, what the hell happened in here? It's +2,500 lines (?!) but ~1000 of them are tests, and ~500 are a simple utility script that checks permissions on a
UPath. The core of the changes can be found in:src/pudl/dagster/resources.pysrc/pudl/dagster/sensors.pysrc/pudl/dagster/assets/deploy/ferceqr.py.github/workflows/build-deploy-ferceqr.ymlbuilds/ferceqr_batch.shFor the May, 2026 update, after getting a clean archive, I tried launching a new FERC EQR run, and found that there was some stuff that needed to be aligned with the new Dagster setup:
builds/ferceqr_batch.shscript needed to be updated frompudl.etlto just plainpudlNonerather than a dictionary of values back. This doesn't cause the sensor to fail, but the error looks concerning / distracting.BUILD_IDso it kept overwriting a file calledferceqr_logsrather than writing to sub-paths beneath that "directory" like path.core_ferceqr__transactionstable was, sadly, choking in the pandera schema checks (though not 100% of the time) so I added it to the "high memory" asset list. Polars subsequently fixed their memory leak bug... but it didn't magically fix this problem.These were all minor tweaks, but in the process of addressing them, I realized how hard it was to test anything in the EQR build + deployment end-to-end without processing everything and (potentially) writing it out to the production distribution bucket.
I also wanted to learn more about how Dagster's sensors and partitioned assets work, and was interested in switching the notifications over from Slack to Zulip.
I did a lot but not all of this on my own time.
Changes
1. Backfill-aware run-status sensors (
sensors.py)The old
ferceqr_deployment_sensor(a single@dg.sensorpolling partition statuses) is replaced with two@dg.run_status_sensorsensors:ferceqr_success_sensor— triggers onSUCCESSruns of theferceqrjobferceqr_failure_sensor— triggers onFAILUREruns of theferceqrjobBoth sensors use the
_backfill_sensor_skip_reason_or_runs()function, which ensures exactly one notification andRunRequestis generated per backfill, by waiting until all sibling runs in a backfill have reached a terminal state before producing aRunRequest.SkipReason.RunRequest. If none failed, then the success sensor produces aRunRequest.RunRequestcarries partition list and source-run-ID tags so the deployment asset can build rich status tables.2. Configurable deployment targets (
resources.py)Deployment destinations are now configurable resources rather than hardcoded in the deployment assets:
FercEqrDeploymentTargetConfig(dg.Config): A single destination with apath(local,file://,s3://, orgs://), optionalstorage_options, and an optionalappend_build_idflag for writing to our test bucket. Includes a Pydanticfield_validatorthat validates local paths exist and are writable, and remote URLs use supported schemes.FercEqrDeploymentResource(ConfigurableResource): A collection of targets, loadable from a YAML file (viaPUDL_FERCEQR_DEPLOYMENT_CONFIG_PATH) or inline config. Theresolved_targets()method returnsUPathobjects ready for copying.Two packaged YAML configs are provided:
ferceqr_deployment_targets.yml— production targets (GCS + S3)ferceqr_test_deployment_targets.yml— test target (single GCS testing bucket withappend_build_id: true)The deployment mode can be selected in CI via a
deployment_modeworkflow input (production / test / none).Generalizing the deployment target(s) with UPath and making it configurable made it easy to do local test deployments to a filesystem path, or cloud-based test deployments to
gs://test.catalyst.coop/BUILD_ID/3. Richer, centralized notifications (
deploy/ferceqr.py)The new
build_ferceqr_notification()function centralizes notification formatting, and prepares markdown output to feed into the recently implementedZenodoNotificationResource. Notifications now include:The
deployment_status_assetdecorator was updated to pass a fullAssetExecutionContext(instead of individual resources), which the handler uses to access run tags, instance APIs for step statuses, and resources.4. Preflight path-permission checker (
check_path_permissions)Added a new CLI script
src/pudl/scripts/check_path_permissions.pythat validates read, write, and delete access for local and remote paths (viaUPath). It supports:--read,--writeflags for selecting checks--jsonstructured output for machine consumption--anonfor anonymous filesystem access--check-ferceqr-deployment-pathsa shortcut to validate all configured deployment targets, based on the current configuration / environment.This script is used in
ferceqr_batch.shas a preflight step to fail fast before the ETL starts... rather than at the end after it's been running for 6 hours.5. FERC EQR pipeline integration test (
tests/pipeline/ferceqr_test.py)A session-scoped integration test that:
dagster-daemon runin the backgrounddagster job backfillfor two test partitionsFERCEQR_SUCCESS/FERCEQR_FAILUREsentinelsferceqr_parquet_datapackage.jsonis present and correctUses isolated temporary directories for
DAGSTER_HOME,PUDL_OUTPUT, and deployment targets, so no cloud credentials are needed. Only requires a local FERC EQR archive atPUDL_FERCEQR_ARCHIVE_PATH.6. Unit tests
tests/unit/dagster/ferceqr_deployment_test.py: Tests for both success and failure sensors covering: skip non-backfill runs, skip while backfill in progress, skip when counterpart sensor should handle, aggregatedRunRequestproduction, and success sentinel behavior.tests/unit/dagster/resources_test.py: Expanded with tests forFercEqrDeploymentTargetConfigvalidation (accepts valid paths, rejects invalid URIs, missing directories, non-writable directories),FercEqrDeploymentResourceloading from YAML, build-id appending, andZulipNotificationResourceerror handling.tests/unit/scripts/check_path_permissions_test.py: Tests for read/write checks and the CLI interface.7. EQR build script improvements (
ferceqr_batch.sh)dagster-daemon runinstead ofdagster dev.trap cleanup_on_exit EXITto ensure log uploads and cleanup happen regardless of exit path.FERCEQR_START_PARTITION/FERCEQR_END_PARTITIONsupport for partial backfills (e.g. viaworkflow_dispatchin GitHub Actions)check_path_permissionsfor archive and deployment paths, plus an inline call toDagsterInstance.get()to verify Dagster storage is initialized (otherwise dagster daemon and the backfill fight over who gets to initialize it)pudl_batch.sh(trap-based cleanup, consolidated log handling viaexec > >(tee ...), and a separatedagster-pudl.yamlconfig) as well as disabling PostgreSQL -- I imagine these changes will conflict with whatever @zschira has going in the build + deploy splitting PR. I can split these changes out again easily but thought I'd leave them in to get feedback for the moment.8. GHA workflow (
build-deploy-ferceqr.yml)workflow_dispatchconfigurable so we can use it for partial runs & test deployments.start_partition/end_partitionworkflow inputs for targeted backfills.deployment_modeinput (production / test / none).GCS_OUTPUT_BUCKET/S3_OUTPUT_BUCKETenv vars — deployment paths come from the YAML config now.PUDL_FERCEQR_ARCHIVE_PATHandPUDL_FERCEQR_DEPLOYMENT_CONFIG_PATHas container env vars.9. New / updated classes in
resources.py:FercEqrArchiveResource— re-named fromFercEqrDataConfigsince it has a different structure / job than theDataConfigclasses (it's more like the Datastore -- where is the data, not what data to process)FercEqrDeploymentTargetConfig/FercEqrDeploymentResource— deployment destinations10. Job definition changes (
jobs.py)ferceqr_job,ferceqr_deployment_job, etc.) rather than inline indefault_jobs.ferceqr_deployment_jobadded withdeploy_ferceqrandhandle_ferceqr_failureasset selection.ferceqr_deploymentgroup, preventing accidental materialization during full ETL runs.Files Changed
src/pudl/dagster/sensors.pysrc/pudl/dagster/assets/deploy/ferceqr.pysrc/pudl/dagster/resources.pysrc/pudl/dagster/jobs.pyferceqr_deployment_jobbuilds/ferceqr_batch.shbuilds/pudl_batch.shferceqr_batch.sh(trap cleanup, config selection).github/workflows/build-deploy-ferceqr.ymlsrc/pudl/scripts/check_path_permissions.pysrc/pudl/package_data/settings/ferceqr_deployment_targets.ymlsrc/pudl/package_data/settings/ferceqr_test_deployment_targets.ymlbuilds/dagster-ferceqr.yamlQueuedRunCoordinatortests/pipeline/ferceqr_test.pytests/unit/dagster/ferceqr_deployment_test.pytests/unit/scripts/check_path_permissions_test.pytests/unit/dagster/resources_test.pypyproject.tomlcheck_path_permissionsscript entry point; coverage threshold adjustmentTesting
Unit tests
pixi run pytest --no-cov \ tests/unit/dagster/ferceqr_deployment_test.py \ tests/unit/dagster/resources_test.py \ tests/unit/scripts/check_path_permissions_test.pyFERC EQR pipeline test
Note: this requires a local FERC EQR archive.
I plan to split out the PUDL pipeline tests in a future PR (i.e. all the tests that depend on having actually run the PUDL fast ETL) so that we can have more clearly delineated unit & integration tests (neither of which require the ETL) then the pipeline tests (that actually run Dagster and the fast ETL) and the data validation tests (which depend on the pipeline tests having been run).
Addendum from Production Deployment
6hr Timeout & Truncated Logs
test.catalyst.coopwhich resulted in invalid / incomplete outputs being deployed. Bad!killall dagsterthat gets run in the build container as soon as theFERCEQR_FAILUREsentinel file shows up is very aggressive... which meant that the logging output that would explain why the failure happened wasn't getting flushed. And it took a while for me to figure out the logging issue before finding and fixing the dumb "removing a file that's already gone" issue from the logging output.ferceqrdata processing job, we didn't have any kind of handling for the deployment job itself so I added some fallback clean up & notification so that we know wtf happened if the deployment fails.OOM? Heartbeat timeout? Or something else?
dagster.code_serverseems to keep dying and getting restarted over and over again for the entire build. This does not happen locally.Cloud Permissions
check_path_permissionsscript that runs before the ETL (nevermind the fact that the permissions issues were also. uh. caused by that script, not effectively passing thestorage_optionsall the way through). So now that's fixed.Followup Issues