Skip to content

Add FERC provenance metadata to datapackage#5264

Merged
zschira merged 67 commits into
mainfrom
ferc-caching
Jun 16, 2026
Merged

Add FERC provenance metadata to datapackage#5264
zschira merged 67 commits into
mainfrom
ferc-caching

Conversation

@zschira

@zschira zschira commented May 19, 2026

Copy link
Copy Markdown
Member

Overview

Closes #5220.

What problem does this address?

This PR allows us to use ferc_to_sqlite outputs in nightly builds as a cache for local development and in CI. It also directly caches outputs in CI to bypass nightly builds in CI runs when possible.

What did you change?

This PR adds FercSqliteProvenanceRecord to the datapckage.json associated with the outputs for a specific FERC Form and data format (DBF or XBRL). It then updates the ferc_to_sqlite fllow so it will try to use existing outputs either locally from nightly builds. It does so as follows:

  1. Check if there is a datapackage.json file locally
  2. Check if provenance in datapackage.json is compatible with requirements of current run. If so go to step 6.
  3. Download datapackage.json from nightly builds and check provenance compatibility. If not go to step 5.
  4. Download all relevant outputs from nightly builds (just sqlite/datapckage for dbf, xbrl also includes parquet, duckdb, and taxonomy json)
  5. Run extraction from scratch
  6. Return dagster materialization metadata with provenance metadata

Documentation

Make sure to update relevant aspects of the documentation:

  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

  • Added unit tests for writing / reading metadata to / from SQLite DBs + compatibility checks using this new cached metadata

To-do list

  • Add a way to force re-extraction

@zschira zschira self-assigned this May 19, 2026
@zschira zschira added ferc1 Anything having to do with FERC Form 1 ferc714 Anything having to do with FERC Form 714 sqlite Issues related to interacting with sqlite databases labels May 19, 2026
@zschira zschira added metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. ferc2 Issues related to the FERC Form 2 dataset xbrl Related to the FERC XBRL transition ferc6 ferc60 labels May 19, 2026
@zschira zschira marked this pull request as ready for review May 20, 2026 15:20
@zschira zschira moved this from New to In review in Catalyst Megaproject May 20, 2026
Comment thread src/pudl/dagster/provenance.py Outdated
@classmethod
def from_sqlite(cls, sqlite_path: Path) -> "FercSqliteProvenanceRecord":
"""Read SQLite provenance metadata from DB."""
try:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrapped in a try / except to handle cases where the DB doesn't exist or the _provenance_metadata table doesn't exist (for DBs created before this change).

Comment thread src/pudl/dagster/provenance.py Outdated
f"required={sorted(required_years)}"
)

if stored.ferc_xbrl_extractor_version != provenance.ferc_xbrl_extractor_version:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this also check the version of catalystcoop.ferc_xbrl_extractor if data_format is xbrl. This seemed important if we're going to rely on cached DBs.

Comment thread src/pudl/dagster/provenance.py
)
return

if instance is None:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated out the "get provenance from instance" so we can call this method with provenance records from SQLite as well.

@zaneselvans zaneselvans self-requested a review May 20, 2026 17:06

@zaneselvans zaneselvans left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things that seem like bugs (which maybe also point at new tests to add):

  • We unconditionally delete existing SQLite and DuckDB databases, which causes downstream problems, either because that means we can't ever use the local SQLite even if it would have been compatible, and we lose the DuckDB outputs which are now only conditionally re-created. There's also no management of the Parquet outputs in here at all which seems sketchy. The filesystem consequences of all the materialization paths should be identical.
  • We don't set the ferc_xbrl_extractor_version in the XBRL case, but we do set it in the DBF case. Seems like they got switched accidentally.
  • Not necessarily a bug but the sqlite_path will always be a local path that pertains only to the system that the DB was created on, so it seems like we probably should not be writing it into the DB and passing it around and relying on it -- it was okay when this was just part of the Dagster metadata and could never move out of that context, but now it seems like it should maybe be handled separately.
  • Not a bug but it sure would be nice if we could check the provenance without needing to download hundreds of megabytes of database first -- this is something we can do with DuckDB if/when we switch to using that as the primary input. And actually we're already producing the DuckDB databases in the nightly builds, so if we wanted to, we could read the provenance from the DuckDB outputs (if it's being written there, which I think we should do in order to keep the two DBs equivalent) and use that to decide whether we want to download the SQLite outputs (since in the case of the nightly build outputs we'll have very high confidence that the two DBs are from the same run).

Side note: if/when we switch to Parquet being the primary output, where do we store the provenance, and how do we ensure that it is valid for all of the hundreds of tables simultaneously?

There's a lot in here. Let me know if you would like to talk through it and come up with a todo list together.

Comment thread docs/dev/clone_ferc1.rst Outdated
Comment thread docs/dev/clone_ferc1.rst
Comment thread docs/dev/clone_ferc1.rst
Comment on lines +122 to +123
* The version of ``catalystcoop.ferc_xbrl_extractor`` that was used to perform the
conversion when considering XBRL outputs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good argument for splitting out a ferc_dbf_extractor package that depends on dbfread and takes care of the extraction logic to the same degree as ferc_xbrl_extractor is that then we would have a clear "did the code change?" marker for both the XBRL and DBF provenance which would be nice.

data_config=ferc_to_sqlite,
sqlite_path=PudlPaths().sqlite_db_path(f"{form}_xbrl"),
)
provenance.to_sqlite()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a to_duckdb() in here too, so the two DBs are equivalent, and we can read the provenance remotely instead of needing to download the whole SQLite to determine whether it'll work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that switching to adding this to duckdb makes sense, however we're not currently producing duckdb files for the DBF outputs, so it would add more complexity to have to handle each case differently. I lean towards creating a followup issue to update the DBF extraction to output a duckdb file, then switch to reading provenance from duckdb.

dataset=form, data_format="xbrl"
),
data_config=ferc_to_sqlite,
sqlite_path=PudlPaths().sqlite_db_path(f"{form}_xbrl"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how storing the sqlite_path in here is convenient, but it also seems brittle, since that absolute path will only make sense in the context in which the provenance was generated, and we are going to be passing the SQLite DBs around between nightly builds and our development machines... as well as distributing them to the outside world.

Is there a need to store the DB path inside the provenance object? Or could it exist ephemerally alongside it here in code, where we know it is relevant? It made more sense when the provenance only existed in the context of the Dagster metadata and particular Dagster instance / run, but it's not really a portable value. Maybe we make it optional and just don't write it to the database table, but do allow it to exist in the Dagster metadata?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just removed sqlite_path altogether. I don't see a lot of benefit from adding it to just the dagster metadata since we always use the same path anyway.

@@ -134,33 +259,45 @@ def _asset(context) -> dg.MaterializeResult[str]:
if duckdb_path.exists():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duckdb database is unconditionally deleted if it exists, and we assume that it'll get recreated below. But now convert_form is only executed conditionally, depending on whether we're able to find a compatible SQLite DB. So if we do find a cached SQLite DB we never get the DuckDB database back.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this so it will only get deleted if the provenance check fails.

Comment thread src/pudl/dagster/provenance.py
Comment on lines +116 to +117
f"Provenance metadata for local version of {sqlite_path.name} is incompatible."
" Downloading version from nightly builds."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this message will get displayed also if there just wasn't any SQLite DB locally, which is a little misleading -- would be nice to differentiate between "There's no local DB" and "The local DB is incompatible." for the user.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_sqlite and ferc_sqlite_provenance_is_compatible have their own logs that should distinguish these cases, so I've just limited the log here.

Comment thread src/pudl/dagster/provenance.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're no longer to allow skipping, then we should probably remove the "skipped" option from this literal.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment thread src/pudl/dagster/provenance.py
@github-project-automation github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject May 21, 2026
@zschira

zschira commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

@zaneselvans this should be ready for another look:

I tried to test the path of refreshing my local FERC outputs from S3 and it failed with an error about the requested file not being present in the ZIP archive, see stack trace in comments.

This is fixed now!

I think the GitHub caching will get stuck pulling from S3 any time the FERC XBRL Extractor version changes, and also any change to the Zenodo DOIs file will require all FERC DBs to be pulled from S3. I left a draft script & GHA workflow in a comment that I think can let us do per-DB, per-format cache refreshes by constructing the specific provenance metadata for hashing, and keep track of the FERC XBRL Extractor version as well.

I'm using the script you developed for generating better cache keys, thanks!

Because we are now downloading all of the outputs (SQLite, DuckDB, Parquet) when refreshing a local cache (not in CI) if we download the individual Parquet files, that will contaminate the Eel Hole usage metrics. However, we can avoid that by downloading the single zipfile containing all of the parquet files instead.

Now downloading parquet outputs from zipfile.

I would like you to add a field in the Dagster materialization record that indicates which path was taken to materialize the FERC assets (build from scratch, local DB, or download from S3)

Added to the FercProvenanceRecord now

Please update the PR title and description to reflect what has actually ended up happening in here, so we don't confuse ourselves later. Doesn't need to be super long.

Done

Please update the test coverage requirement to reflect the actual test coverage we expect to see when the FERC DB materialization bypasses the extraction code (In pyproject.toml --> tool.coverage.report.fail_under).

Did a rough estimate based on the number of lines in src/pudl/extract/dbf.py that are currently tested divided by the total lines that are tested and found this should result in ~1.5% drop in codecov. Also created an issue to track adding unit tests for dbf extraction (#5308). On that note, also added an issue for standardizing env variable checking (#5309).

@zschira zschira requested a review from zaneselvans June 10, 2026 20:15
@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 11, 2026
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.47059% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.21%. Comparing base (0c76673) to head (7717617).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
src/pudl/scripts/generate_ferc_provenance.py 42.86% 16 Missing ⚠️
src/pudl/dagster/assets/raw/ferc_to_sqlite.py 92.86% 8 Missing ⚠️
src/pudl/dagster/provenance.py 92.73% 4 Missing ⚠️
tests/unit/dagster/io_managers_test.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #5264    +/-   ##
========================================
  Coverage   93.21%   93.21%            
========================================
  Files         241      242     +1     
  Lines       20357    20534   +177     
========================================
+ Hits        18975    19140   +165     
- Misses       1382     1394    +12     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zaneselvans

zaneselvans commented Jun 11, 2026

Copy link
Copy Markdown
Member

I'm still working on the review, but in testing the GitHub Actions caching, I've run into a real issue.

When the integration tests pull the FERC databases from the nightly build outputs, they're getting FULL databases with all of the years of data in them, but then the PUDL ETL that runs is a FAST ETL that only expects to be working with databases that have a few years of data in them. The provenance is compatible -- all of the required years of input data are available -- but our FAST ETL has never been run under these conditions, and there are apparently significant side effects. The way that those side effects are showing up right now is in the dbt data validation tests for out_ferc714__hourly_planning_area_demand -- for some reason pulling from the FULL FERC DBs rather than the fast FERC DBs results in unexpectedly high nullness in some columns:

E           AssertionError: failure contexts:
E           source_expect_missingness_between_pudl_out_ferc714__hourly_planning_area_demand_demand_imputed_pudl_mwh__0_0__0_05:
E           
E           |   total_records |   null_records |   null_proportion |
E           |----------------:|---------------:|------------------:|
E           |         2395657 |         814717 |          0.340081 |
E           =====
E           source_expect_missingness_between_pudl_out_ferc714__hourly_planning_area_demand_respondent_id_ferc714_csv__0_0__0_025:
E           
E           |   total_records |   null_records |   null_proportion |
E           |----------------:|---------------:|------------------:|
E           |         2395657 |          70174 |         0.0292922 |

I imagine there are other side-effects that we're just not catching. I don't know how serious they are.

For debugging purposes I think you can recreate this situation locally in the Dagster UI by running a normal extraction of all the FERC databases, and then trying to materialize all in the pudl job after setting the GlobalDataConfigPath to the etl_fast.yml path. I'm trying that now.

Now that there are provenance-compatible datapackages under eel-hole (because I uploaded them directly) I think that pixi run pytest-integration would also fail with the same error if run locally. Also testing that locally now.

@zaneselvans

zaneselvans commented Jun 11, 2026

Copy link
Copy Markdown
Member

@zschira Another real issue: we DO actually use the DuckDB outputs in the integration tests, because we are validating that the DuckDB + SQLite outputs are the same in the integration tests. So we'll need to download and cache those in CI as well.

2026-06-11 21:49:20 [    INFO] tests.integration.extract.ferc_xbrl_extract_test:49 Comparing ferc1 SQLite vs. DuckDB outputs...

For the sake of reproducibility and not getting surprised by CI failures that Worked Fine On My Machine we might want to force the integration tests to engage exactly the same caching behavior regardless of where they are run. Which could mean skipping Parquet downloads even when run locally. Or downloading the Parquet on GitHub if there's space (which would mean everything is always using the full FERC outputs). In which case I think the right place to control the behavior might be tests/integration/conftest.py? Which I don't think currently exists. Ideally we want all of these to do the same thing, axiomatically:

  • pixi run pytest-integration
  • pixi run pytest tests/integration/my_test.py
  • whatever is being run in GitHub CI

Or maybe we just make it the Fast ETL in general -- since the only real application of the fast ETL is testing?

@zaneselvans zaneselvans left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking

❌ Fast ETL can't consume Full ETL FERC inputs

The Fast ETL currently doesn't work when run with FERC inputs from the Full ETL, which means CI will always fail if it tries to use outputs from S3. As you suggested, a quick fix for this is to never use the S3 outputs in CI and rely entirely on GitHub caching.

This doesn't feel great though, since the provenance data / logic pretty explicitly states "A superset of years is fine!" but actually it is not fine. Do we have any idea how hard this is to fix? Or whether this mismatch causes other downstream problems? Or whether having the "report_year" embedded in the downstream data would be helpful for other things?

❌ Need both SQlite & DuckDB in CI

We do actually need both SQLite and DuckDB outputs in CI, because the integration tests check for their equivalence. I've added the DuckDB files to those that are cached on GitHub and what gets downloaded from nightly in CI. However, this still needs to be tested, and to do so we'll need to purge the GitHub cache somehow.

If this causes disk space issues, we can fall back on skipping the SQLite vs. DuckDB equivalence if os.getnev("GITHUB_ACTIONS", False).

Non-blocking

Request to rename {local,nightly}_parquet_dir_path to {local,nightly}_parquet_path since it's a zip archive in nightly, not a dir.

Comment thread docs/dev/clone_ferc1.rst
Comment thread src/pudl/dagster/io_managers.py
Comment thread pyproject.toml
sort = "miss"
skip_empty = true
fail_under = 93
fail_under = 91

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, I thought it would drop much more than this!

Comment thread pyproject.toml
resource_description = "pudl.scripts.resource_description:main"
update_zenodo_dois = "pudl.scripts.update_zenodo_dois:main"
zenodo_data_release = "pudl.scripts.zenodo_data_release:main"
generate_ferc_provenance = "pudl.scripts.generate_ferc_provenance:main"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the script being called in pytest.yml in the integration test caching step is ferc_provenance not generate_ferc_provenance

Comment on lines +249 to +251
${{ env.PUDL_OUTPUT }}/ferc1_xbrl.sqlite
${{ env.PUDL_OUTPUT }}/ferc1_xbrl_datapackage.json
${{ env.PUDL_OUTPUT }}/ferc1_xbrl_taxonomy_metadata.json

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to tell the integration tests to use a non-temporary PUDL_OUTPUT so that we would have a deterministic path to cache and restore to.

I tested this:

  • ✅ With nothing in the cache.
  • ✅ With fast-ETL outputs from a previous CI run in the cache.
  • ❌ With nothing in the cache and "compatible" outputs in S3.

It turns out that the S3 nightly (full ETL) outputs aren't really compatible with running the fast ETL downstream.

We also have an integration test that depends on having both the SQLite and DuckDB outputs available (so that we can verify that they are equivalent) which fails if using cached outputs right now, because we're not caching the DuckDB outputs. And will also fail when the S3 caching is working, unless we also download the DuckDB outputs.

I went ahead and added DuckDB to the outputs that should be cached on GitHub and downloaded from nightly, but I don't think that will fix the CI for now, because the cache key hasn't changed and it also doesn't have the DuckDB outputs to load from the cache yet.

Comment thread tests/conftest.py
Comment on lines +670 to +671
if not request.config.getoption("--live-pudl-output") and not os.getenv(
"GITHUB_ACTIONS", False

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a predictable, non-temporary PUDL_OUTPUT directory for pytest if we're on GitHub, so we can use the GitHub cache.

local_taxonomy_json_path: Path | None = None
nightly_taxonomy_json_path: UPath | None = None
local_parquet_dir_path: Path | None = None
nightly_parquet_dir_path: UPath | None = None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _dir_ here feels a little misleading because it's a zipfile, not a directory. Would it be disruptive to change it to local_parquet_path and nightly_parquet_path instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to just parquet_path and added comments indicating that the local version points to a directory of parquet and nightly points to a zipfile.

accepts a regular ``Path``, as we should never try to write directly to s3.
"""
if datapackage_path.exists():
json_dict = json.loads(datapackage_path.read_text())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And because this is reading directly from the UPath, we don't have to worry about clobbering the local datapackage with the nightly outputs before we actually know if it's compatible, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this will just read into memory and we overwrite the datapackage later if it's determined to be compatible.

@github-project-automation github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject Jun 12, 2026
@zschira

zschira commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Ok @zaneselvans and I've changed this to not use the nightly cache during integration tests. I also cleared the ferc cache in blacksmith and re-ran the integration tests twice. It passed both times and ran the full extraction the first time and used the cache the second time.

Also added a new issue to start caching the etl-fast outputs (#5334)

@zschira zschira added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit 1a90019 Jun 16, 2026
9 checks passed
@zschira zschira deleted the ferc-caching branch June 16, 2026 18:22
@github-project-automation github-project-automation Bot moved this from In progress to Done in Catalyst Megaproject Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ferc1 Anything having to do with FERC Form 1 ferc2 Issues related to the FERC Form 2 dataset ferc6 ferc60 ferc714 Anything having to do with FERC Form 714 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. sqlite Issues related to interacting with sqlite databases xbrl Related to the FERC XBRL transition

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Attach FERC provenance metadata to extracted SQLite DB's

2 participants