Develop infrastructure for MVP version of PUDL integration

# Overview

We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.

## Components
### Archival
I developed an archiver in the [pudl-archiver repo](https://github.com/catalyst-cooperative/pudl-archiver), but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.

As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.

#### Tasks:

- [ ] Decide how to handle archive updates
    - [ ] Do we need a staging environment to avoid creating a bad archive environment?
    - [ ] How do we handle versioning? Do we want a clear lineage of change to raw archives?
- [ ] Implement backend changes based on update strategy
- [ ] Decide if we want the SEC archiver to live in pudl-archiver repo or attempt to transition to a library
- [ ] Set archiver to run at least once/year

### Extraction infrastructure
We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.

#### Tasks:
- [x] #58
- [ ] Should we use dagster? If yes we'll need to do the following:
    - [ ] Turn cloud interface into dagster resource
    - [ ] Create asset for loading raw filings
    - [ ] Create asset for trained model that can be loaded from a cache
    - [ ] Create assets for basic 10k/ex 21 extraction
- [ ] Decide how frequently all elements need to be refreshed and implement automation
    - [ ]  How often should model be retrained
    - [ ] How often whould we rerun


```[tasklist]
## Sub-issues
- [ ] #63 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop infrastructure for MVP version of PUDL integration #57

Overview

Components

Archival

Tasks:

Extraction infrastructure

Tasks:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Develop infrastructure for MVP version of PUDL integration #57

Description

Overview

Components

Archival

Tasks:

Extraction infrastructure

Tasks:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions