Add FERC provenance metadata to datapackage#5264
Conversation
| @classmethod | ||
| def from_sqlite(cls, sqlite_path: Path) -> "FercSqliteProvenanceRecord": | ||
| """Read SQLite provenance metadata from DB.""" | ||
| try: |
There was a problem hiding this comment.
This is wrapped in a try / except to handle cases where the DB doesn't exist or the _provenance_metadata table doesn't exist (for DBs created before this change).
| f"required={sorted(required_years)}" | ||
| ) | ||
|
|
||
| if stored.ferc_xbrl_extractor_version != provenance.ferc_xbrl_extractor_version: |
There was a problem hiding this comment.
I made this also check the version of catalystcoop.ferc_xbrl_extractor if data_format is xbrl. This seemed important if we're going to rely on cached DBs.
| ) | ||
| return | ||
|
|
||
| if instance is None: |
There was a problem hiding this comment.
Separated out the "get provenance from instance" so we can call this method with provenance records from SQLite as well.
zaneselvans
left a comment
There was a problem hiding this comment.
Some things that seem like bugs (which maybe also point at new tests to add):
- We unconditionally delete existing SQLite and DuckDB databases, which causes downstream problems, either because that means we can't ever use the local SQLite even if it would have been compatible, and we lose the DuckDB outputs which are now only conditionally re-created. There's also no management of the Parquet outputs in here at all which seems sketchy. The filesystem consequences of all the materialization paths should be identical.
- We don't set the ferc_xbrl_extractor_version in the XBRL case, but we do set it in the DBF case. Seems like they got switched accidentally.
- Not necessarily a bug but the
sqlite_pathwill always be a local path that pertains only to the system that the DB was created on, so it seems like we probably should not be writing it into the DB and passing it around and relying on it -- it was okay when this was just part of the Dagster metadata and could never move out of that context, but now it seems like it should maybe be handled separately. - Not a bug but it sure would be nice if we could check the provenance without needing to download hundreds of megabytes of database first -- this is something we can do with DuckDB if/when we switch to using that as the primary input. And actually we're already producing the DuckDB databases in the nightly builds, so if we wanted to, we could read the provenance from the DuckDB outputs (if it's being written there, which I think we should do in order to keep the two DBs equivalent) and use that to decide whether we want to download the SQLite outputs (since in the case of the nightly build outputs we'll have very high confidence that the two DBs are from the same run).
Side note: if/when we switch to Parquet being the primary output, where do we store the provenance, and how do we ensure that it is valid for all of the hundreds of tables simultaneously?
There's a lot in here. Let me know if you would like to talk through it and come up with a todo list together.
| * The version of ``catalystcoop.ferc_xbrl_extractor`` that was used to perform the | ||
| conversion when considering XBRL outputs. |
There was a problem hiding this comment.
Another good argument for splitting out a ferc_dbf_extractor package that depends on dbfread and takes care of the extraction logic to the same degree as ferc_xbrl_extractor is that then we would have a clear "did the code change?" marker for both the XBRL and DBF provenance which would be nice.
| data_config=ferc_to_sqlite, | ||
| sqlite_path=PudlPaths().sqlite_db_path(f"{form}_xbrl"), | ||
| ) | ||
| provenance.to_sqlite() |
There was a problem hiding this comment.
It would be great to add a to_duckdb() in here too, so the two DBs are equivalent, and we can read the provenance remotely instead of needing to download the whole SQLite to determine whether it'll work.
There was a problem hiding this comment.
I think that switching to adding this to duckdb makes sense, however we're not currently producing duckdb files for the DBF outputs, so it would add more complexity to have to handle each case differently. I lean towards creating a followup issue to update the DBF extraction to output a duckdb file, then switch to reading provenance from duckdb.
| dataset=form, data_format="xbrl" | ||
| ), | ||
| data_config=ferc_to_sqlite, | ||
| sqlite_path=PudlPaths().sqlite_db_path(f"{form}_xbrl"), |
There was a problem hiding this comment.
I can see how storing the sqlite_path in here is convenient, but it also seems brittle, since that absolute path will only make sense in the context in which the provenance was generated, and we are going to be passing the SQLite DBs around between nightly builds and our development machines... as well as distributing them to the outside world.
Is there a need to store the DB path inside the provenance object? Or could it exist ephemerally alongside it here in code, where we know it is relevant? It made more sense when the provenance only existed in the context of the Dagster metadata and particular Dagster instance / run, but it's not really a portable value. Maybe we make it optional and just don't write it to the database table, but do allow it to exist in the Dagster metadata?
There was a problem hiding this comment.
I've just removed sqlite_path altogether. I don't see a lot of benefit from adding it to just the dagster metadata since we always use the same path anyway.
| @@ -134,33 +259,45 @@ def _asset(context) -> dg.MaterializeResult[str]: | |||
| if duckdb_path.exists(): | |||
There was a problem hiding this comment.
The duckdb database is unconditionally deleted if it exists, and we assume that it'll get recreated below. But now convert_form is only executed conditionally, depending on whether we're able to find a compatible SQLite DB. So if we do find a cached SQLite DB we never get the DuckDB database back.
There was a problem hiding this comment.
I've moved this so it will only get deleted if the provenance check fails.
| f"Provenance metadata for local version of {sqlite_path.name} is incompatible." | ||
| " Downloading version from nightly builds." |
There was a problem hiding this comment.
I think this message will get displayed also if there just wasn't any SQLite DB locally, which is a little misleading -- would be nice to differentiate between "There's no local DB" and "The local DB is incompatible." for the user.
There was a problem hiding this comment.
from_sqlite and ferc_sqlite_provenance_is_compatible have their own logs that should distinguish these cases, so I've just limited the log here.
There was a problem hiding this comment.
If we're no longer to allow skipping, then we should probably remove the "skipped" option from this literal.
|
@zaneselvans this should be ready for another look:
This is fixed now!
I'm using the script you developed for generating better cache keys, thanks!
Now downloading parquet outputs from zipfile.
Added to the
Done
Did a rough estimate based on the number of lines in |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5264 +/- ##
========================================
Coverage 93.21% 93.21%
========================================
Files 241 242 +1
Lines 20357 20534 +177
========================================
+ Hits 18975 19140 +165
- Misses 1382 1394 +12 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
I'm still working on the review, but in testing the GitHub Actions caching, I've run into a real issue. When the integration tests pull the FERC databases from the nightly build outputs, they're getting FULL databases with all of the years of data in them, but then the PUDL ETL that runs is a FAST ETL that only expects to be working with databases that have a few years of data in them. The provenance is compatible -- all of the required years of input data are available -- but our FAST ETL has never been run under these conditions, and there are apparently significant side effects. The way that those side effects are showing up right now is in the I imagine there are other side-effects that we're just not catching. I don't know how serious they are. For debugging purposes I think you can recreate this situation locally in the Dagster UI by running a normal extraction of all the FERC databases, and then trying to materialize all in the Now that there are provenance-compatible datapackages under eel-hole (because I uploaded them directly) I think that |
|
@zschira Another real issue: we DO actually use the For the sake of reproducibility and not getting surprised by CI failures that Worked Fine On My Machine we might want to force the integration tests to engage exactly the same caching behavior regardless of where they are run. Which could mean skipping Parquet downloads even when run locally. Or downloading the Parquet on GitHub if there's space (which would mean everything is always using the full FERC outputs). In which case I think the right place to control the behavior might be
Or maybe we just make it the Fast ETL in general -- since the only real application of the fast ETL is testing? |
zaneselvans
left a comment
There was a problem hiding this comment.
Blocking
❌ Fast ETL can't consume Full ETL FERC inputs
The Fast ETL currently doesn't work when run with FERC inputs from the Full ETL, which means CI will always fail if it tries to use outputs from S3. As you suggested, a quick fix for this is to never use the S3 outputs in CI and rely entirely on GitHub caching.
This doesn't feel great though, since the provenance data / logic pretty explicitly states "A superset of years is fine!" but actually it is not fine. Do we have any idea how hard this is to fix? Or whether this mismatch causes other downstream problems? Or whether having the "report_year" embedded in the downstream data would be helpful for other things?
❌ Need both SQlite & DuckDB in CI
We do actually need both SQLite and DuckDB outputs in CI, because the integration tests check for their equivalence. I've added the DuckDB files to those that are cached on GitHub and what gets downloaded from nightly in CI. However, this still needs to be tested, and to do so we'll need to purge the GitHub cache somehow.
If this causes disk space issues, we can fall back on skipping the SQLite vs. DuckDB equivalence if os.getnev("GITHUB_ACTIONS", False).
Non-blocking
Request to rename {local,nightly}_parquet_dir_path to {local,nightly}_parquet_path since it's a zip archive in nightly, not a dir.
| sort = "miss" | ||
| skip_empty = true | ||
| fail_under = 93 | ||
| fail_under = 91 |
There was a problem hiding this comment.
Oh wow, I thought it would drop much more than this!
| resource_description = "pudl.scripts.resource_description:main" | ||
| update_zenodo_dois = "pudl.scripts.update_zenodo_dois:main" | ||
| zenodo_data_release = "pudl.scripts.zenodo_data_release:main" | ||
| generate_ferc_provenance = "pudl.scripts.generate_ferc_provenance:main" |
There was a problem hiding this comment.
Note that the script being called in pytest.yml in the integration test caching step is ferc_provenance not generate_ferc_provenance
| ${{ env.PUDL_OUTPUT }}/ferc1_xbrl.sqlite | ||
| ${{ env.PUDL_OUTPUT }}/ferc1_xbrl_datapackage.json | ||
| ${{ env.PUDL_OUTPUT }}/ferc1_xbrl_taxonomy_metadata.json |
There was a problem hiding this comment.
I had to tell the integration tests to use a non-temporary PUDL_OUTPUT so that we would have a deterministic path to cache and restore to.
I tested this:
- ✅ With nothing in the cache.
- ✅ With fast-ETL outputs from a previous CI run in the cache.
- ❌ With nothing in the cache and "compatible" outputs in S3.
It turns out that the S3 nightly (full ETL) outputs aren't really compatible with running the fast ETL downstream.
We also have an integration test that depends on having both the SQLite and DuckDB outputs available (so that we can verify that they are equivalent) which fails if using cached outputs right now, because we're not caching the DuckDB outputs. And will also fail when the S3 caching is working, unless we also download the DuckDB outputs.
I went ahead and added DuckDB to the outputs that should be cached on GitHub and downloaded from nightly, but I don't think that will fix the CI for now, because the cache key hasn't changed and it also doesn't have the DuckDB outputs to load from the cache yet.
| if not request.config.getoption("--live-pudl-output") and not os.getenv( | ||
| "GITHUB_ACTIONS", False |
There was a problem hiding this comment.
Use a predictable, non-temporary PUDL_OUTPUT directory for pytest if we're on GitHub, so we can use the GitHub cache.
| local_taxonomy_json_path: Path | None = None | ||
| nightly_taxonomy_json_path: UPath | None = None | ||
| local_parquet_dir_path: Path | None = None | ||
| nightly_parquet_dir_path: UPath | None = None |
There was a problem hiding this comment.
The _dir_ here feels a little misleading because it's a zipfile, not a directory. Would it be disruptive to change it to local_parquet_path and nightly_parquet_path instead?
There was a problem hiding this comment.
Changed to just parquet_path and added comments indicating that the local version points to a directory of parquet and nightly points to a zipfile.
| accepts a regular ``Path``, as we should never try to write directly to s3. | ||
| """ | ||
| if datapackage_path.exists(): | ||
| json_dict = json.loads(datapackage_path.read_text()) |
There was a problem hiding this comment.
And because this is reading directly from the UPath, we don't have to worry about clobbering the local datapackage with the nightly outputs before we actually know if it's compatible, right?
There was a problem hiding this comment.
Yeah this will just read into memory and we overwrite the datapackage later if it's determined to be compatible.
|
Ok @zaneselvans and I've changed this to not use the Also added a new issue to start caching the |
Overview
Closes #5220.
What problem does this address?
This PR allows us to use
ferc_to_sqliteoutputs in nightly builds as a cache for local development and in CI. It also directly caches outputs in CI to bypass nightly builds in CI runs when possible.What did you change?
This PR adds
FercSqliteProvenanceRecordto thedatapckage.jsonassociated with the outputs for a specific FERC Form and data format (DBF or XBRL). It then updates theferc_to_sqlitefllow so it will try to use existing outputs either locally from nightly builds. It does so as follows:datapackage.jsonfile locallydatapackage.jsonis compatible with requirements of current run. If so go to step 6.datapackage.jsonfrom nightly builds and check provenance compatibility. If not go to step 5.Documentation
Make sure to update relevant aspects of the documentation:
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list