Add FERC provenance metadata to datapackage by zschira · Pull Request #5264 · catalyst-cooperative/pudl

zschira · 2026-05-19T18:47:31Z

Overview

Closes #5220.

What problem does this address?

This PR allows us to use ferc_to_sqlite outputs in nightly builds as a cache for local development and in CI. It also directly caches outputs in CI to bypass nightly builds in CI runs when possible.

What did you change?

This PR adds FercSqliteProvenanceRecord to the datapckage.json associated with the outputs for a specific FERC Form and data format (DBF or XBRL). It then updates the ferc_to_sqlite fllow so it will try to use existing outputs either locally from nightly builds. It does so as follows:

Check if there is a datapackage.json file locally
Check if provenance in datapackage.json is compatible with requirements of current run. If so go to step 6.
Download datapackage.json from nightly builds and check provenance compatibility. If not go to step 5.
Download all relevant outputs from nightly builds (just sqlite/datapckage for dbf, xbrl also includes parquet, duckdb, and taxonomy json)
Run extraction from scratch
Return dagster materialization metadata with provenance metadata

Documentation

Make sure to update relevant aspects of the documentation:

Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

Added unit tests for writing / reading metadata to / from SQLite DBs + compatibility checks using this new cached metadata

To-do list

Add a way to force re-extraction

zschira · 2026-05-20T15:24:12Z

+    @classmethod
+    def from_sqlite(cls, sqlite_path: Path) -> "FercSqliteProvenanceRecord":
+        """Read SQLite provenance metadata from DB."""
+        try:


This is wrapped in a try / except to handle cases where the DB doesn't exist or the _provenance_metadata table doesn't exist (for DBs created before this change).

zschira · 2026-05-20T15:42:19Z

            f"required={sorted(required_years)}"
        )

+    if stored.ferc_xbrl_extractor_version != provenance.ferc_xbrl_extractor_version:


I made this also check the version of catalystcoop.ferc_xbrl_extractor if data_format is xbrl. This seemed important if we're going to rely on cached DBs.

zschira · 2026-05-20T15:45:23Z

-        )
-        return
-
-    if instance is None:


Separated out the "get provenance from instance" so we can call this method with provenance records from SQLite as well.

…nto ferc-caching

zaneselvans

Some things that seem like bugs (which maybe also point at new tests to add):

We unconditionally delete existing SQLite and DuckDB databases, which causes downstream problems, either because that means we can't ever use the local SQLite even if it would have been compatible, and we lose the DuckDB outputs which are now only conditionally re-created. There's also no management of the Parquet outputs in here at all which seems sketchy. The filesystem consequences of all the materialization paths should be identical.
We don't set the ferc_xbrl_extractor_version in the XBRL case, but we do set it in the DBF case. Seems like they got switched accidentally.
Not necessarily a bug but the sqlite_path will always be a local path that pertains only to the system that the DB was created on, so it seems like we probably should not be writing it into the DB and passing it around and relying on it -- it was okay when this was just part of the Dagster metadata and could never move out of that context, but now it seems like it should maybe be handled separately.
Not a bug but it sure would be nice if we could check the provenance without needing to download hundreds of megabytes of database first -- this is something we can do with DuckDB if/when we switch to using that as the primary input. And actually we're already producing the DuckDB databases in the nightly builds, so if we wanted to, we could read the provenance from the DuckDB outputs (if it's being written there, which I think we should do in order to keep the two DBs equivalent) and use that to decide whether we want to download the SQLite outputs (since in the case of the nightly build outputs we'll have very high confidence that the two DBs are from the same run).

Side note: if/when we switch to Parquet being the primary output, where do we store the provenance, and how do we ensure that it is valid for all of the hundreds of tables simultaneously?

There's a lot in here. Let me know if you would like to talk through it and come up with a todo list together.

zaneselvans · 2026-05-21T03:22:08Z

+* The version of ``catalystcoop.ferc_xbrl_extractor`` that was used to perform the
+  conversion when considering XBRL outputs.


Another good argument for splitting out a ferc_dbf_extractor package that depends on dbfread and takes care of the extraction logic to the same degree as ferc_xbrl_extractor is that then we would have a clear "did the code change?" marker for both the XBRL and DBF provenance which would be nice.

zaneselvans · 2026-05-21T03:33:23Z

+                data_config=ferc_to_sqlite,
+                sqlite_path=PudlPaths().sqlite_db_path(f"{form}_xbrl"),
+            )
+            provenance.to_sqlite()


It would be great to add a to_duckdb() in here too, so the two DBs are equivalent, and we can read the provenance remotely instead of needing to download the whole SQLite to determine whether it'll work.

I think that switching to adding this to duckdb makes sense, however we're not currently producing duckdb files for the DBF outputs, so it would add more complexity to have to handle each case differently. I lean towards creating a followup issue to update the DBF extraction to output a duckdb file, then switch to reading provenance from duckdb.

zaneselvans · 2026-05-21T03:35:28Z

+                    dataset=form, data_format="xbrl"
+                ),
+                data_config=ferc_to_sqlite,
+                sqlite_path=PudlPaths().sqlite_db_path(f"{form}_xbrl"),


I can see how storing the sqlite_path in here is convenient, but it also seems brittle, since that absolute path will only make sense in the context in which the provenance was generated, and we are going to be passing the SQLite DBs around between nightly builds and our development machines... as well as distributing them to the outside world.

Is there a need to store the DB path inside the provenance object? Or could it exist ephemerally alongside it here in code, where we know it is relevant? It made more sense when the provenance only existed in the context of the Dagster metadata and particular Dagster instance / run, but it's not really a portable value. Maybe we make it optional and just don't write it to the database table, but do allow it to exist in the Dagster metadata?

I've just removed sqlite_path altogether. I don't see a lot of benefit from adding it to just the dagster metadata since we always use the same path anyway.

zaneselvans · 2026-05-21T06:23:07Z

@@ -134,33 +259,45 @@ def _asset(context) -> dg.MaterializeResult[str]:
        if duckdb_path.exists():


The duckdb database is unconditionally deleted if it exists, and we assume that it'll get recreated below. But now convert_form is only executed conditionally, depending on whether we're able to find a compatible SQLite DB. So if we do find a cached SQLite DB we never get the DuckDB database back.

I've moved this so it will only get deleted if the provenance check fails.

zaneselvans · 2026-05-21T06:33:06Z

+            f"Provenance metadata for local version of {sqlite_path.name} is incompatible."
+            " Downloading version from nightly builds."


I think this message will get displayed also if there just wasn't any SQLite DB locally, which is a little misleading -- would be nice to differentiate between "There's no local DB" and "The local DB is incompatible." for the user.

from_sqlite and ferc_sqlite_provenance_is_compatible have their own logs that should distinguish these cases, so I've just limited the log here.

zaneselvans · 2026-05-21T06:36:14Z

If we're no longer to allow skipping, then we should probably remove the "skipped" option from this literal.

zschira · 2026-06-10T20:15:11Z

@zaneselvans this should be ready for another look:

I tried to test the path of refreshing my local FERC outputs from S3 and it failed with an error about the requested file not being present in the ZIP archive, see stack trace in comments.

This is fixed now!

I think the GitHub caching will get stuck pulling from S3 any time the FERC XBRL Extractor version changes, and also any change to the Zenodo DOIs file will require all FERC DBs to be pulled from S3. I left a draft script & GHA workflow in a comment that I think can let us do per-DB, per-format cache refreshes by constructing the specific provenance metadata for hashing, and keep track of the FERC XBRL Extractor version as well.

I'm using the script you developed for generating better cache keys, thanks!

Because we are now downloading all of the outputs (SQLite, DuckDB, Parquet) when refreshing a local cache (not in CI) if we download the individual Parquet files, that will contaminate the Eel Hole usage metrics. However, we can avoid that by downloading the single zipfile containing all of the parquet files instead.

Now downloading parquet outputs from zipfile.

I would like you to add a field in the Dagster materialization record that indicates which path was taken to materialize the FERC assets (build from scratch, local DB, or download from S3)

Added to the FercProvenanceRecord now

Please update the PR title and description to reflect what has actually ended up happening in here, so we don't confuse ourselves later. Doesn't need to be super long.

Done

Please update the test coverage requirement to reflect the actual test coverage we expect to see when the FERC DB materialization bypasses the extraction code (In pyproject.toml --> tool.coverage.report.fail_under).

Did a rough estimate based on the number of lines in src/pudl/extract/dbf.py that are currently tested divided by the total lines that are tested and found this should result in ~1.5% drop in codecov. Also created an issue to track adding unit tests for dbf extraction (#5308). On that note, also added an issue for standardizing env variable checking (#5309).

codecov · 2026-06-11T17:08:13Z

Codecov Report

❌ Patch coverage is 91.47059% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.21%. Comparing base (0c76673) to head (7717617).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
src/pudl/scripts/generate_ferc_provenance.py	42.86%	16 Missing ⚠️
src/pudl/dagster/assets/raw/ferc_to_sqlite.py	92.86%	8 Missing ⚠️
src/pudl/dagster/provenance.py	92.73%	4 Missing ⚠️
tests/unit/dagster/io_managers_test.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #5264    +/-   ##
========================================
  Coverage   93.21%   93.21%            
========================================
  Files         241      242     +1     
  Lines       20357    20534   +177     
========================================
+ Hits        18975    19140   +165     
- Misses       1382     1394    +12

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zaneselvans · 2026-06-11T18:49:10Z

I'm still working on the review, but in testing the GitHub Actions caching, I've run into a real issue.

When the integration tests pull the FERC databases from the nightly build outputs, they're getting FULL databases with all of the years of data in them, but then the PUDL ETL that runs is a FAST ETL that only expects to be working with databases that have a few years of data in them. The provenance is compatible -- all of the required years of input data are available -- but our FAST ETL has never been run under these conditions, and there are apparently significant side effects. The way that those side effects are showing up right now is in the dbt data validation tests for out_ferc714__hourly_planning_area_demand -- for some reason pulling from the FULL FERC DBs rather than the fast FERC DBs results in unexpectedly high nullness in some columns:

E           AssertionError: failure contexts:
E           source_expect_missingness_between_pudl_out_ferc714__hourly_planning_area_demand_demand_imputed_pudl_mwh__0_0__0_05:
E           
E           |   total_records |   null_records |   null_proportion |
E           |----------------:|---------------:|------------------:|
E           |         2395657 |         814717 |          0.340081 |
E           =====
E           source_expect_missingness_between_pudl_out_ferc714__hourly_planning_area_demand_respondent_id_ferc714_csv__0_0__0_025:
E           
E           |   total_records |   null_records |   null_proportion |
E           |----------------:|---------------:|------------------:|
E           |         2395657 |          70174 |         0.0292922 |

I imagine there are other side-effects that we're just not catching. I don't know how serious they are.

For debugging purposes I think you can recreate this situation locally in the Dagster UI by running a normal extraction of all the FERC databases, and then trying to materialize all in the pudl job after setting the GlobalDataConfigPath to the etl_fast.yml path. I'm trying that now.

Now that there are provenance-compatible datapackages under eel-hole (because I uploaded them directly) I think that pixi run pytest-integration would also fail with the same error if run locally. Also testing that locally now.

zaneselvans · 2026-06-11T22:26:45Z

@zschira Another real issue: we DO actually use the DuckDB outputs in the integration tests, because we are validating that the DuckDB + SQLite outputs are the same in the integration tests. So we'll need to download and cache those in CI as well.

2026-06-11 21:49:20 [    INFO] tests.integration.extract.ferc_xbrl_extract_test:49 Comparing ferc1 SQLite vs. DuckDB outputs...

For the sake of reproducibility and not getting surprised by CI failures that Worked Fine On My Machine we might want to force the integration tests to engage exactly the same caching behavior regardless of where they are run. Which could mean skipping Parquet downloads even when run locally. Or downloading the Parquet on GitHub if there's space (which would mean everything is always using the full FERC outputs). In which case I think the right place to control the behavior might be tests/integration/conftest.py? Which I don't think currently exists. Ideally we want all of these to do the same thing, axiomatically:

pixi run pytest-integration
pixi run pytest tests/integration/my_test.py
whatever is being run in GitHub CI

Or maybe we just make it the Fast ETL in general -- since the only real application of the fast ETL is testing?

zaneselvans

Blocking

❌ Fast ETL can't consume Full ETL FERC inputs

The Fast ETL currently doesn't work when run with FERC inputs from the Full ETL, which means CI will always fail if it tries to use outputs from S3. As you suggested, a quick fix for this is to never use the S3 outputs in CI and rely entirely on GitHub caching.

This doesn't feel great though, since the provenance data / logic pretty explicitly states "A superset of years is fine!" but actually it is not fine. Do we have any idea how hard this is to fix? Or whether this mismatch causes other downstream problems? Or whether having the "report_year" embedded in the downstream data would be helpful for other things?

❌ Need both SQlite & DuckDB in CI

We do actually need both SQLite and DuckDB outputs in CI, because the integration tests check for their equivalence. I've added the DuckDB files to those that are cached on GitHub and what gets downloaded from nightly in CI. However, this still needs to be tested, and to do so we'll need to purge the GitHub cache somehow.

If this causes disk space issues, we can fall back on skipping the SQLite vs. DuckDB equivalence if os.getnev("GITHUB_ACTIONS", False).

Non-blocking

Request to rename {local,nightly}_parquet_dir_path to {local,nightly}_parquet_path since it's a zip archive in nightly, not a dir.

zaneselvans · 2026-06-11T03:47:27Z

 sort = "miss"
 skip_empty = true
-fail_under = 93
+fail_under = 91


Oh wow, I thought it would drop much more than this!

zaneselvans · 2026-06-11T03:49:42Z

 resource_description = "pudl.scripts.resource_description:main"
 update_zenodo_dois = "pudl.scripts.update_zenodo_dois:main"
 zenodo_data_release = "pudl.scripts.zenodo_data_release:main"
+generate_ferc_provenance = "pudl.scripts.generate_ferc_provenance:main"


Note that the script being called in pytest.yml in the integration test caching step is ferc_provenance not generate_ferc_provenance

zaneselvans · 2026-06-12T03:08:51Z

+            ${{ env.PUDL_OUTPUT }}/ferc1_xbrl.sqlite
+            ${{ env.PUDL_OUTPUT }}/ferc1_xbrl_datapackage.json
+            ${{ env.PUDL_OUTPUT }}/ferc1_xbrl_taxonomy_metadata.json


I had to tell the integration tests to use a non-temporary PUDL_OUTPUT so that we would have a deterministic path to cache and restore to.

I tested this:

✅ With nothing in the cache.

✅ With fast-ETL outputs from a previous CI run in the cache.

❌ With nothing in the cache and "compatible" outputs in S3.

It turns out that the S3 nightly (full ETL) outputs aren't really compatible with running the fast ETL downstream.

We also have an integration test that depends on having both the SQLite and DuckDB outputs available (so that we can verify that they are equivalent) which fails if using cached outputs right now, because we're not caching the DuckDB outputs. And will also fail when the S3 caching is working, unless we also download the DuckDB outputs.

I went ahead and added DuckDB to the outputs that should be cached on GitHub and downloaded from nightly, but I don't think that will fix the CI for now, because the cache key hasn't changed and it also doesn't have the DuckDB outputs to load from the cache yet.

zaneselvans · 2026-06-12T03:21:58Z

+    if not request.config.getoption("--live-pudl-output") and not os.getenv(
+        "GITHUB_ACTIONS", False


Use a predictable, non-temporary PUDL_OUTPUT directory for pytest if we're on GitHub, so we can use the GitHub cache.

zaneselvans · 2026-06-12T03:27:09Z

+    local_taxonomy_json_path: Path | None = None
+    nightly_taxonomy_json_path: UPath | None = None
+    local_parquet_dir_path: Path | None = None
+    nightly_parquet_dir_path: UPath | None = None


The _dir_ here feels a little misleading because it's a zipfile, not a directory. Would it be disruptive to change it to local_parquet_path and nightly_parquet_path instead?

Changed to just parquet_path and added comments indicating that the local version points to a directory of parquet and nightly points to a zipfile.

zaneselvans · 2026-06-12T03:36:49Z

+        accepts a regular ``Path``, as we should never try to write directly to s3.
+        """
+        if datapackage_path.exists():
+            json_dict = json.loads(datapackage_path.read_text())


And because this is reading directly from the UPath, we don't have to worry about clobbering the local datapackage with the nightly outputs before we actually know if it's compatible, right?

Yeah this will just read into memory and we overwrite the datapackage later if it's determined to be compatible.

…nto ferc-caching

zschira · 2026-06-16T17:21:55Z

Ok @zaneselvans and I've changed this to not use the nightly cache during integration tests. I also cleared the ferc cache in blacksmith and re-ran the integration tests twice. It passed both times and ran the full extraction the first time and used the cache the second time.

Also added a new issue to start caching the etl-fast outputs (#5334)

zschira added 3 commits May 18, 2026 12:07

Add xbrl extractor version to provenance metadata

ba42fa9

Separate provenance check from loading metadata

ef6dda4

Make ferc_to_sqlite check for cached db from nightly builds

554d769

zschira self-assigned this May 19, 2026

zschira added ferc1 Anything having to do with FERC Form 1 ferc714 Anything having to do with FERC Form 714 sqlite Issues related to interacting with sqlite databases labels May 19, 2026

zschira added this to Catalyst Megaproject May 19, 2026

zschira added metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. ferc2 Issues related to the FERC Form 2 dataset xbrl Related to the FERC XBRL transition ferc6 ferc60 labels May 19, 2026

github-project-automation Bot moved this to New in Catalyst Megaproject May 19, 2026

zschira added 3 commits May 20, 2026 10:26

Allow forcing ferc_to_sqlite extraction

0a2b011

Add documentation for new FERC caching capabilities

9ba3fcd

Merge branch 'main' into ferc-caching

a2fefca

zschira marked this pull request as ready for review May 20, 2026 15:20

zschira moved this from New to In review in Catalyst Megaproject May 20, 2026

zschira commented May 20, 2026

View reviewed changes

Only check xbrl extractor version for xbrl sqlite dbs

d5cbcda

zschira commented May 20, 2026

View reviewed changes

Comment thread src/pudl/dagster/provenance.py

zschira commented May 20, 2026

View reviewed changes

zaneselvans self-requested a review May 20, 2026 17:06

Merge branch 'ferc-caching' of github.com:catalyst-cooperative/pudl i…

b0759f8

…nto ferc-caching

zaneselvans requested changes May 21, 2026

View reviewed changes

github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject May 21, 2026

zschira added 2 commits May 21, 2026 09:26

Set xbrl extractor version for xbrl provenance, not dbf

2e0675d

Create a single ferc_to_sqlite asset factory

086599c

zschira mentioned this pull request Jun 10, 2026

Standardize environment variable handling #5309

Open

1 task

Merge branch 'main' into ferc-caching

56fb720

zschira requested a review from zaneselvans June 10, 2026 20:15

zaneselvans added 3 commits June 11, 2026 11:01

Merge branch 'main' into ferc-caching

1e67327

Fix ferc provenance script name in pytest workflow

391b1f7

Use built-in datapackage.to_json()

0b916d6

zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 11, 2026

Try to fix GitHub Actions caching of FERC outputs

815ad2b

zaneselvans added 2 commits June 11, 2026 21:10

Add caching of XBRL-derived DuckDB outputs.

2a17768

Always download DuckDB outputs.

c93fe51

zaneselvans requested changes Jun 12, 2026

View reviewed changes

github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject Jun 12, 2026

zschira added 4 commits June 15, 2026 17:32

Don't use nightly ferc cache in integration tests

7717617

Merge branch 'main' into ferc-caching

200e950

Clarify ferc parquet path naming

aa50863

Merge branch 'ferc-caching' of github.com:catalyst-cooperative/pudl i…

721aa95

…nto ferc-caching

zschira mentioned this pull request Jun 16, 2026

Improve FERC Caching Distribution Paths #5334

Open

3 tasks

zschira enabled auto-merge June 16, 2026 17:20

zaneselvans approved these changes Jun 16, 2026

View reviewed changes

zschira added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit 1a90019 Jun 16, 2026
9 checks passed

zschira deleted the ferc-caching branch June 16, 2026 18:22

github-project-automation Bot moved this from In progress to Done in Catalyst Megaproject Jun 16, 2026

zschira mentioned this pull request Jun 17, 2026

Fix 2026-06-17 build failure #5338

Merged

		* The version of ``catalystcoop.ferc_xbrl_extractor`` that was used to perform the
		conversion when considering XBRL outputs.

		@@ -134,33 +259,45 @@ def _asset(context) -> dg.MaterializeResult[str]:
		if duckdb_path.exists():

		f"Provenance metadata for local version of {sqlite_path.name} is incompatible."
		" Downloading version from nightly builds."

		if not request.config.getoption("--live-pudl-output") and not os.getenv(
		"GITHUB_ACTIONS", False

Uh oh!

Conversation

zschira commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What problem does this address?

What did you change?

Documentation

Testing

To-do list

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zschira commented Jun 10, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zaneselvans commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaneselvans commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaneselvans left a comment

Choose a reason for hiding this comment

Blocking

❌ Fast ETL can't consume Full ETL FERC inputs

❌ Need both SQlite & DuckDB in CI

Non-blocking

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zschira commented May 19, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading

zaneselvans commented Jun 11, 2026 •

edited

Loading

zaneselvans commented Jun 11, 2026 •

edited

Loading