Overview
We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.
Components
Archival
I developed an archiver in the pudl-archiver repo, but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.
As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.
Tasks:
Extraction infrastructure
We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.
Tasks:
Overview
We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.
Components
Archival
I developed an archiver in the pudl-archiver repo, but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.
As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.
Tasks:
Extraction infrastructure
We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.
Tasks: