Skip to content

Add SEC 10-K data source documentation#4562

Merged
krivard merged 65 commits into
mainfrom
sec10k-docs
Oct 2, 2025
Merged

Add SEC 10-K data source documentation#4562
krivard merged 65 commits into
mainfrom
sec10k-docs

Conversation

@zaneselvans

@zaneselvans zaneselvans commented Aug 22, 2025

Copy link
Copy Markdown
Member

Overview

Adds a data source documentation page for the SEC 10-K data.

See also the marimo notebook in this PR

Closes #4347 #4329

Questions & Issues

Writing

  • Add intro to irregularities to provide context.
  • Make sure the original data & availability sections are actually in the right places / have the right context.

Analysis

  • Look at the record linkage from the EIA side rather than the SEC side.

Communicating / Visualizing the EDA

  • How can we best refer to / include a sampling of parent-subsidiary data to illustrating the quality and completeness of linkages?
  • Including a big CSV table is too much. Link to a notebook? Leave the code snippet(s) in so others can run them?
  • What's the right way to visualize this kind of information / record linkage?
  • Generating a tree / graph view of the parent-subsidiary relationships for electricity companies and their subsidiaries that have CIKs could be interesting.

What's the future of this dataset? How can it be improved?

  • Clean up the SIC names/IDs so we can easily select companies (small/easy, would make EDA cleaner)
  • Fix the company info block bug (requires re-running the text extraction & processing on SEC side)
  • Make sure we get the 2018-2022 data filled in (is this just re-running on data we already have?)
  • Capture a new 2025 snapshot (requires re-running the scraper)
  • Improve the SEC 10-K filer to EIA Utility linkage (this is the only one with lots of rich information).
  • Compile more training data for the Exhibit 21 ML pipeline (how helpful will this be? How much will we need?)
  • Enrich / improve linkage to subsidiaries (is there any avenue for this?)
  • Can we afford to hold on to the raw data? ($200/mo to store it. Can we use cold-storage?)
  • Run the record linkage across all years of data, so utilities that only appear outside of 2023 are included.
  • Clean up subsidiary names within each individual SEC 10-K filer across all reporting years so that we can have consistent subsidiary IDs across time, at least within a single parent company.
  • For SEC 10-K filers, bring in some fundamental financial metrics like total assets / fixed capital investments, net income, market cap, book value, etc. from the SEC 10-K itself for use in weighting their importance.

Better explain use cases / blockers

  • Rank electric utilities by transition friendliness
    • Select all electricity companies (4911, 4931).
    • For each of them, make a list of all subsidiaries (including sub-subsidiaries) that have CIKs
    • Look up all the industry codes associated with each of those subsidiary CIKs
    • Make a list of "fossil" CIKs: natural gas, petroleum, and coal related industries.
    • What proportion of each electricity company's subsidiary company CIKs are electric vs. fossil industries?
    • Ideally weight the subsidiaries by their book value, market cap, revenues, net-income, etc.
    • Rank electric companies according to how purely electric (vs. fossil) they are.
    • Now build a graph for each electric company

@zaneselvans zaneselvans self-assigned this Aug 22, 2025
@zaneselvans zaneselvans added docs Documentation for users and contributors. sec10k Issues related to SEC 10K filing data. labels Aug 22, 2025
@zaneselvans zaneselvans changed the title Sec10k docs Add SEC 10-K data source documentation Aug 23, 2025
@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Aug 23, 2025
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@katie-lamb katie-lamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some edits and suggestions on the data source docs page. I also will add a section to the methodology page and irregularities section of the data source page on how we don't parse through nested layers of subsidiaries.

Comment thread src/pudl/metadata/sources.py Outdated
@@ -851,19 +851,9 @@
"title": "U.S. Securities and Exchange Commission Form 10-K",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit but it would be nice to have "SEC" in the title to make it more searchable and have it match the other data source titles which have the dataset abbreviation. "U.S. Securities and Exchange Commission (SEC) Form 10-K" is my suggestion.

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +37 to +43
First, metadata related to each filing was parsed out of the plaintext headers of the
HTML documents. Some of these headers pertain to the filing as a whole and others
describe attributes of the individual companies associated with the filing. Each filing
has a single company that is the primary filer, and the filing is associated with their
`Central Index Key <https://www.sec.gov/search-filings/cik-lookup>`__ (CIK) -- a
persistent company identifier assigned by the SEC that is more durable and standardized
than the company name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this Background section feels a little too long and the level of detail makes it a bit confusing. I left some suggestions for the next three paragraphs:

"First, metadata related to each filing and the companies associated with the filing was parsed out of the plaintext headers of the HTML documents. Some of these headers pertain to the filing as a whole and others
describe attributes of the individual companies associated with the filing.
Each filing
has a single company that is the primary filer, and the filing is associated with their
Central Index Key <https://www.sec.gov/search-filings/cik-lookup>__ (CIK) -- a
persistent company identifier assigned by the SEC that is more durable and standardized
than the company name.
"

The sentence about the headers pertaining to the filing as a whole vs attributes of the individual filers doesn't feel necessary and was a bit confusing. It could belong in the table level doc string of the filing info table and/or the company info table?

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +34 to +35
EDGAR database, including the Exhibit 21 attachments. We extract two kinds of data from
this raw data source using different methods.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

We extract two kinds of data from this raw data source using different methods.

-->
"We extract two kinds of data from this raw data source: metadata about the filing companies and filing itself, and ownership data about the company's subsidiaries."

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +45 to +47
The plaintext headers are not necessarily intended to be machine readable, but they
are highly structured and allow us to compile a database of all SEC 10-K filings and the
companies involved. The ``core_sec10k`` tables derived from the headers provide the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this first sentence "The plaintext headers..." could be cut and put into a table level docstring for the company info or filing info tables.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like it was important to communicate that much of what we are doing with this data source is compiling a database of SEC 10-K filers & filings, and doing it in a kind of opportunistic way -- by parsing these headers, rather than pulling from some canonical API or existing SEC database that provides the company information. I think this becomes important later on when trying to interpret the company information table, because it helps explain why there are multiple not necessarily identical entries for the same companies in association with many different filings.

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +47 to +50
companies involved. The ``core_sec10k`` tables derived from the headers provide the
context that's necessary to link the subsidiary company information extracted from
Exhibit 21 to other sets of companies, including the SEC 10-K filers themselves, as well
as companies that file the EIA Form 860.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this paragraph strays away from the main focus of this section which is the two kinds of data that we extract from the raw data source. To be more focused, I think we could move this sentence on using info pulled from headers to link to EIA into the linkage section below.

Comment thread docs/templates/sec10k_child.rst.jinja Outdated

No meaningful company ID is reported for subsidiaries that appear in the Exhibit 21
attachments. We construct an ad-hoc ID by concatenating the Central Index Key (CIK) of
the main filer with the name and location of the subsidiary company as observed in the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say:

"of the parent company" or maybe "of the parent company who is filing" instead of "main filer". I'm finding "main filer" to be confusing since we've only mentioned one or two times previously in this data source page that there is more than one company in a filing. Also, we've made the assumption (and it seems backed up by SEC filing language) that the main filer is the owner company whose subsidiaries are reported in the attached Ex. 21

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean toward using vocabulary consistent with the SEC -- it would be weird for someone familiar with the SEC 10-k to be excited to use our data only to find we were using different words for central concepts

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
-------------------------------------------------------

No meaningful company ID is reported for subsidiaries that appear in the Exhibit 21
attachments. We construct an ad-hoc ID by concatenating the Central Index Key (CIK) of

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

"We construct an ad-hoc ID for each subsidiary company by..."

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +406 to +409
confidently how much of each subsidiary the parent company owns. Anecdotally, based on
outside information, many of the subsidiaries that do not report an ownership fraction
seem to be entirely owned by a single parent company, but we don't know how common that
really is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's extremely common for an Ex. 21 to say at the bottom "all subsidiaries are wholly owned" or something like that. So I wouldn't say this is "outside information". Maybe simplify this to say:

"When ownership fractions are not included, it's very common for an Ex. 21 to include that all subsidiaries are wholly owned by the parent, but our existing model does not detect when this information is included."

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +416 to +417
is part of. However, both columns contain significant numbers of null values and the
name field is not entirely standardized. Further cleaning of these columns is needed to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This high null percentage is really because of a lack of reporting in the raw data right? Or are we not correctly grabbing this from headers?

@zaneselvans zaneselvans Sep 29, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone back to look at the raw headers to see if we're somehow not extracting data that's there. But I wouldn't be surprised at all if there are just a lot of truly missing values.

In the code changes in this PR I did go ahead and standardize the names (based on canonical names from SEC) and filled in missing names when there was a code, and foreward/backward filled gaps where the values before and after the gap were the same. But there are still a lot of missing values because there were significant numbers of companies that had never reported a SIC code.

So the bit in this paragraph about needing to do further cleaning / standardization can probably be removed.


{% endblock %}

{% block notable_irregularities %}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this section a little overwhelming and hard to parse through. Can the irregularities that pertain to record linkage be put in a different section specific to record linkage? That would be:

  • The SEC 10-K filer to EIA utility linkage is based only on 2023 data
  • Matches between EIA utility ids and SEC 10-K energy companies are sparse (and everything under this)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganized as suggested (with some minor exceptions for things below Matches that didn't seem to be about linkages)

@github-project-automation github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject Sep 25, 2025
Comment thread docs/templates/sec10k_child.rst.jinja
Comment thread docs/templates/sec10k_child.rst.jinja Outdated
Comment on lines +253 to +258
This is a little trickier to evaluate. SEC filers use a four-digit Standard Industrial
Classification (SIC) code to identify the company's primary industry, but not all utilities
file under the SICs for electric services (4911 and 4931), and not all companies that
file as electric services are utilities. We see SICs as unexpected as computer storage
devices, nursing facilities, and real estate among our -- high-confidence -- matches to EIA
utilities.

@zaneselvans zaneselvans Sep 26, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An intuitive expectation that I have based on what I know about the SEC 10-K filers and browsing through the companies listed under SICs 4911 and 4931 is that conceptually, a very high proportion of the filers in those industries should really show up in the set of EIA utilities. I think to a much higher degree than I would expect that the EIA utilities would necessarily show up as SEC 10-K filers, since there are so many smaller special-purpose companies that exist only to own a single generator or slices of some plant, or to act as an operator without owning the underlying asset that they're operating.

Meaning, if a company is explicitly in the electricity services business, AND it is big enough to be filing the SEC 10-K, then I would be pretty surprised if it doesn't also show up in the list of EIA utilities. Maybe there's some name or other data mismatch that's keeping us from identifying it correctly, but off the top of my head I'm not sure what the common case would be where the electricity company is an SEC 10-K filer and also doesn't report to EIA.

Comment thread docs/templates/sec10k_child.rst.jinja Outdated
:issue:`4165` and PR :pr:`4134`. This bug has been fixed, and the lost addresses can
be recovered by re-running the upstream extraction.

Industry classifications applied to companies have poor coverage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to collect some more numbers on this, and add a sentence or two on the actual coverage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work! #4643

@katie-lamb katie-lamb added this pull request to the merge queue Oct 2, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Oct 2, 2025
Comment thread src/pudl/output/sec10k.py Outdated
Comment thread src/pudl/output/sec10k.py Outdated
@codecov

codecov Bot commented Oct 2, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.14%. Comparing base (752fe80) to head (0664c3d).
⚠️ Report is 102 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
+ Coverage   93.13%   93.14%   +0.01%     
==========================================
  Files         199      199              
  Lines       16922    16947      +25     
==========================================
+ Hits        15760    15785      +25     
  Misses       1162     1162              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@krivard krivard added this pull request to the merge queue Oct 2, 2025
Merged via the queue into main with commit 71cb6cd Oct 2, 2025
25 checks passed
@krivard krivard deleted the sec10k-docs branch October 2, 2025 22:24
@github-project-automation github-project-automation Bot moved this from In progress to Done in Catalyst Megaproject Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation for users and contributors. sec10k Issues related to SEC 10K filing data.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Create data source documentation page for SEC 10-K

3 participants