Add SEC 10-K data source documentation by zaneselvans · Pull Request #4562 · catalyst-cooperative/pudl

zaneselvans · 2025-08-22T23:58:45Z

Overview

Adds a data source documentation page for the SEC 10-K data.

Questions & Issues

Writing

Add intro to irregularities to provide context.
Make sure the original data & availability sections are actually in the right places / have the right context.

Analysis

Look at the record linkage from the EIA side rather than the SEC side.

Communicating / Visualizing the EDA

How can we best refer to / include a sampling of parent-subsidiary data to illustrating the quality and completeness of linkages?
Including a big CSV table is too much. Link to a notebook? Leave the code snippet(s) in so others can run them?
What's the right way to visualize this kind of information / record linkage?
Generating a tree / graph view of the parent-subsidiary relationships for electricity companies and their subsidiaries that have CIKs could be interesting.

What's the future of this dataset? How can it be improved?

Clean up the SIC names/IDs so we can easily select companies (small/easy, would make EDA cleaner)
Fix the company info block bug (requires re-running the text extraction & processing on SEC side)
Make sure we get the 2018-2022 data filled in (is this just re-running on data we already have?)
Capture a new 2025 snapshot (requires re-running the scraper)
Improve the SEC 10-K filer to EIA Utility linkage (this is the only one with lots of rich information).
Compile more training data for the Exhibit 21 ML pipeline (how helpful will this be? How much will we need?)
Enrich / improve linkage to subsidiaries (is there any avenue for this?)
Can we afford to hold on to the raw data? ($200/mo to store it. Can we use cold-storage?)
Run the record linkage across all years of data, so utilities that only appear outside of 2023 are included.
Clean up subsidiary names within each individual SEC 10-K filer across all reporting years so that we can have consistent subsidiary IDs across time, at least within a single parent company.
For SEC 10-K filers, bring in some fundamental financial metrics like total assets / fixed capital investments, net income, market cap, book value, etc. from the SEC 10-K itself for use in weighting their importance.

Better explain use cases / blockers

Rank electric utilities by transition friendliness
- Select all electricity companies (4911, 4931).
- For each of them, make a list of all subsidiaries (including sub-subsidiaries) that have CIKs
- Look up all the industry codes associated with each of those subsidiary CIKs
- Make a list of "fossil" CIKs: natural gas, petroleum, and coal related industries.
- What proportion of each electricity company's subsidiary company CIKs are electric vs. fossil industries?
- Ideally weight the subsidiaries by their book value, market cap, revenues, net-income, etc.
- Rank electric companies according to how purely electric (vs. fossil) they are.
- Now build a graph for each electric company

review-notebook-app · 2025-08-27T17:45:55Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…to sec10k-docs

For more information, see https://pre-commit.ci

katie-lamb

I left some edits and suggestions on the data source docs page. I also will add a section to the methodology page and irregularities section of the data source page on how we don't parse through nested layers of subsidiaries.

katie-lamb · 2025-09-22T19:03:37Z

@@ -851,19 +851,9 @@
        "title": "U.S. Securities and Exchange Commission Form 10-K",


Nit but it would be nice to have "SEC" in the title to make it more searchable and have it match the other data source titles which have the dataset abbreviation. "U.S. Securities and Exchange Commission (SEC) Form 10-K" is my suggestion.

katie-lamb · 2025-09-24T21:17:37Z

+First, metadata related to each filing was parsed out of the plaintext headers of the
+HTML documents. Some of these headers pertain to the filing as a whole and others
+describe attributes of the individual companies associated with the filing. Each filing
+has a single company that is the primary filer, and the filing is associated with their
+`Central Index Key <https://www.sec.gov/search-filings/cik-lookup>`__ (CIK) -- a
+persistent company identifier assigned by the SEC that is more durable and standardized
+than the company name.


To me this Background section feels a little too long and the level of detail makes it a bit confusing. I left some suggestions for the next three paragraphs:

"First, metadata related to each filing and the companies associated with the filing was parsed out of the plaintext headers of the HTML documents. Some of these headers pertain to the filing as a whole and others
describe attributes of the individual companies associated with the filing. Each filing
has a single company that is the primary filer, and the filing is associated with their
Central Index Key <https://www.sec.gov/search-filings/cik-lookup>__ (CIK) -- a
persistent company identifier assigned by the SEC that is more durable and standardized
than the company name."

The sentence about the headers pertaining to the filing as a whole vs attributes of the individual filers doesn't feel necessary and was a bit confusing. It could belong in the table level doc string of the filing info table and/or the company info table?

katie-lamb · 2025-09-24T21:19:02Z

+EDGAR database, including the Exhibit 21 attachments. We extract two kinds of data from
+this raw data source using different methods.


Suggestion:

We extract two kinds of data from this raw data source using different methods.

-->
"We extract two kinds of data from this raw data source: metadata about the filing companies and filing itself, and ownership data about the company's subsidiaries."

katie-lamb · 2025-09-24T21:20:27Z

+The plaintext headers are not necessarily intended to be machine readable, but they
+are highly structured and allow us to compile a database of all SEC 10-K filings and the
+companies involved. The ``core_sec10k`` tables derived from the headers provide the


I think this first sentence "The plaintext headers..." could be cut and put into a table level docstring for the company info or filing info tables.

I felt like it was important to communicate that much of what we are doing with this data source is compiling a database of SEC 10-K filers & filings, and doing it in a kind of opportunistic way -- by parsing these headers, rather than pulling from some canonical API or existing SEC database that provides the company information. I think this becomes important later on when trying to interpret the company information table, because it helps explain why there are multiple not necessarily identical entries for the same companies in association with many different filings.

katie-lamb · 2025-09-24T21:30:50Z

+companies involved. The ``core_sec10k`` tables derived from the headers provide the
+context that's necessary to link the subsidiary company information extracted from
+Exhibit 21 to other sets of companies, including the SEC 10-K filers themselves, as well
+as companies that file the EIA Form 860.


I think this paragraph strays away from the main focus of this section which is the two kinds of data that we extract from the raw data source. To be more focused, I think we could move this sentence on using info pulled from headers to link to EIA into the linkage section below.

katie-lamb · 2025-09-25T17:58:34Z

+
+No meaningful company ID is reported for subsidiaries that appear in the Exhibit 21
+attachments. We construct an ad-hoc ID by concatenating the Central Index Key (CIK) of
+the main filer with the name and location of the subsidiary company as observed in the


I would say:

"of the parent company" or maybe "of the parent company who is filing" instead of "main filer". I'm finding "main filer" to be confusing since we've only mentioned one or two times previously in this data source page that there is more than one company in a filing. Also, we've made the assumption (and it seems backed up by SEC filing language) that the main filer is the owner company whose subsidiaries are reported in the attached Ex. 21

I lean toward using vocabulary consistent with the SEC -- it would be weird for someone familiar with the SEC 10-k to be excited to use our data only to find we were using different words for central concepts

katie-lamb · 2025-09-25T17:59:01Z

+-------------------------------------------------------
+
+No meaningful company ID is reported for subsidiaries that appear in the Exhibit 21
+attachments. We construct an ad-hoc ID by concatenating the Central Index Key (CIK) of


Maybe:

"We construct an ad-hoc ID for each subsidiary company by..."

katie-lamb · 2025-09-25T18:09:23Z

+confidently how much of each subsidiary the parent company owns. Anecdotally, based on
+outside information, many of the subsidiaries that do not report an ownership fraction
+seem to be entirely owned by a single parent company, but we don't know how common that
+really is.


It's extremely common for an Ex. 21 to say at the bottom "all subsidiaries are wholly owned" or something like that. So I wouldn't say this is "outside information". Maybe simplify this to say:

"When ownership fractions are not included, it's very common for an Ex. 21 to include that all subsidiaries are wholly owned by the parent, but our existing model does not detect when this information is included."

katie-lamb · 2025-09-25T18:10:38Z

+is part of. However, both columns contain significant numbers of null values and the
+name field is not entirely standardized. Further cleaning of these columns is needed to


This high null percentage is really because of a lack of reporting in the raw data right? Or are we not correctly grabbing this from headers?

I haven't gone back to look at the raw headers to see if we're somehow not extracting data that's there. But I wouldn't be surprised at all if there are just a lot of truly missing values.

In the code changes in this PR I did go ahead and standardize the names (based on canonical names from SEC) and filled in missing names when there was a code, and foreward/backward filled gaps where the values before and after the gap were the same. But there are still a lot of missing values because there were significant numbers of companies that had never reported a SIC code.

So the bit in this paragraph about needing to do further cleaning / standardization can probably be removed.

katie-lamb · 2025-09-25T18:38:51Z

+
+{% endblock %}
+
+{% block notable_irregularities %}


I'm finding this section a little overwhelming and hard to parse through. Can the irregularities that pertain to record linkage be put in a different section specific to record linkage? That would be:

The SEC 10-K filer to EIA utility linkage is based only on 2023 data

Matches between EIA utility ids and SEC 10-K energy companies are sparse (and everything under this)

Reorganized as suggested (with some minor exceptions for things below Matches that didn't seem to be about linkages)

zaneselvans · 2025-09-26T15:39:32Z

+This is a little trickier to evaluate. SEC filers use a four-digit Standard Industrial
+Classification (SIC) code to identify the company's primary industry, but not all utilities
+file under the SICs for electric services (4911 and 4931), and not all companies that
+file as electric services are utilities. We see SICs as unexpected as computer storage
+devices, nursing facilities, and real estate among our -- high-confidence -- matches to EIA
+utilities.


An intuitive expectation that I have based on what I know about the SEC 10-K filers and browsing through the companies listed under SICs 4911 and 4931 is that conceptually, a very high proportion of the filers in those industries should really show up in the set of EIA utilities. I think to a much higher degree than I would expect that the EIA utilities would necessarily show up as SEC 10-K filers, since there are so many smaller special-purpose companies that exist only to own a single generator or slices of some plant, or to act as an operator without owning the underlying asset that they're operating.

Meaning, if a company is explicitly in the electricity services business, AND it is big enough to be filing the SEC 10-K, then I would be pretty surprised if it doesn't also show up in the list of EIA utilities. Maybe there's some name or other data mismatch that's keeping us from identifying it correctly, but off the top of my head I'm not sure what the common case would be where the electricity company is an SEC 10-K filer and also doesn't report to EIA.

Co-authored-by: Kathryn Mazaitis <1158666+krivard@users.noreply.github.com>

…to sec10k-docs

krivard · 2025-09-30T20:38:31Z

+:issue:`4165` and PR :pr:`4134`. This bug has been fixed, and the lost addresses can
+be recovered by re-running the upstream extraction.
+
+Industry classifications applied to companies have poor coverage


I want to collect some more numbers on this, and add a sentence or two on the actual coverage

Future work! #4643

codecov · 2025-10-02T21:19:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.14%. Comparing base (752fe80) to head (0664c3d).
⚠️ Report is 102 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
+ Coverage   93.13%   93.14%   +0.01%     
==========================================
  Files         199      199              
  Lines       16922    16947      +25     
==========================================
+ Hits        15760    15785      +25     
  Misses       1162     1162

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zaneselvans added 4 commits August 20, 2025 09:16

Add an initial incomplete data source docs page for SEC 10-K

faf7c7e

Merge branch 'main' into sec10k-docs

60cac11

Merge branch 'main' into sec10k-docs

30abec5

Merge branch 'main' into sec10k-docs

518e6a9

zaneselvans self-assigned this Aug 22, 2025

zaneselvans added docs Documentation for users and contributors. sec10k Issues related to SEC 10K filing data. labels Aug 22, 2025

zaneselvans added this to Catalyst Megaproject Aug 22, 2025

github-project-automation Bot moved this to New in Catalyst Megaproject Aug 22, 2025

zaneselvans changed the title ~~Sec10k docs~~ Add SEC 10-K data source documentation Aug 23, 2025

zaneselvans moved this from New to In progress in Catalyst Megaproject Aug 23, 2025

zaneselvans added 5 commits August 24, 2025 17:27

Merge branch 'main' into sec10k-docs

297575b

Merge branch 'main' into sec10k-docs

0b6c676

Merge branch 'main' into sec10k-docs

43a1ac2

Flesh out SEC 10-K data source docs page.

5e4acea

WIP: Add SEC 10-K data review notebook.

2dbd30a

zaneselvans and others added 13 commits August 27, 2025 11:46

Merge branch 'main' into sec10k-docs

7ce9ad5

Merge branch 'main' into sec10k-docs

17b6c95

Add some questions to SEC 10K template

f752801

Merge branch 'main' into sec10k-docs

8bfa793

add some more analysis to SEC 10-K notebook.

25d266f

Merge branch 'sec10k-docs' of github.com:catalyst-cooperative/pudl in…

4c41a6a

…to sec10k-docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc88d7c

For more information, see https://pre-commit.ci

Fix docs source formatting error.

94dfc65

Merge branch 'main' into sec10k-docs

f857891

Merge in environment changes from main

e62d9af

Merge branch 'main' into sec10k-docs

7ff8ca5

Add a notebook checking some SEC 10-K things.

87ab370

edits to SEC 10-K data source docs

45bc970

Merge branch 'main' into sec10k-docs

7372fa4

katie-lamb suggested changes Sep 25, 2025

View reviewed changes

github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject Sep 25, 2025

katie-lamb added 2 commits September 25, 2025 14:55

add irregularity about nested subsidiaries

0683f33

fix typo

6e07b37

katie-lamb reviewed Sep 25, 2025

View reviewed changes

Comment thread docs/templates/sec10k_child.rst.jinja

zaneselvans commented Sep 26, 2025

View reviewed changes

zaneselvans and others added 5 commits September 29, 2025 00:27

Merge branch 'main' into sec10k-docs

473ba24

Update src/pudl/metadata/dfs.py

1b02896

Co-authored-by: Kathryn Mazaitis <1158666+krivard@users.noreply.github.com>

Fix typo

5e1c29c

Merge branch 'sec10k-docs' of github.com:catalyst-cooperative/pudl in…

4e78b11

…to sec10k-docs

Address review comments; rework irregularities section

b7cd997

krivard reviewed Sep 30, 2025

View reviewed changes

Comment thread docs/templates/sec10k_child.rst.jinja Outdated

Fix TODO

f823843

krivard reviewed Sep 30, 2025

View reviewed changes

add sec abbreeviation

a2bd82c

katie-lamb approved these changes Oct 2, 2025

View reviewed changes

katie-lamb added this pull request to the merge queue Oct 2, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Oct 2, 2025

krivard reviewed Oct 2, 2025

View reviewed changes

Comment thread src/pudl/output/sec10k.py Outdated

Fix set arithmetic

bf6a108

krivard reviewed Oct 2, 2025

View reviewed changes

Comment thread src/pudl/output/sec10k.py Outdated

If we don't find the known nonunique sics, that's okay

0664c3d

krivard added this pull request to the merge queue Oct 2, 2025

Merged via the queue into main with commit 71cb6cd Oct 2, 2025
25 checks passed

krivard deleted the sec10k-docs branch October 2, 2025 22:24

github-project-automation Bot moved this from In progress to Done in Catalyst Megaproject Oct 2, 2025

jdangerx mentioned this pull request Oct 13, 2025

Evaluate and document usability and limitations of SEC 10K tables #4329

Open

		@@ -851,19 +851,9 @@
		"title": "U.S. Securities and Exchange Commission Form 10-K",

		EDGAR database, including the Exhibit 21 attachments. We extract two kinds of data from
		this raw data source using different methods.

		is part of. However, both columns contain significant numbers of null values and the
		name field is not entirely standardized. Further cleaning of these columns is needed to

Uh oh!

Conversation

zaneselvans commented Aug 22, 2025 • edited by jdangerx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Questions & Issues

Writing

Analysis

Communicating / Visualizing the EDA

What's the future of this dataset? How can it be improved?

Better explain use cases / blockers

Uh oh!

review-notebook-app Bot commented Aug 27, 2025

Uh oh!

katie-lamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zaneselvans Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zaneselvans commented Aug 22, 2025 •

edited by jdangerx

Loading

zaneselvans Sep 29, 2025 •

edited

Loading

zaneselvans Sep 26, 2025 •

edited

Loading

codecov Bot commented Oct 2, 2025 •

edited

Loading