Skip to content

Add SEC 10-k example notebook#16

Open
krivard wants to merge 4 commits into
mainfrom
sec10k
Open

Add SEC 10-k example notebook#16
krivard wants to merge 4 commits into
mainfrom
sec10k

Conversation

@krivard

@krivard krivard commented Nov 7, 2025

Copy link
Copy Markdown
Contributor

Overview

Implement the sec10k notebook spec from October 2025

What problem does this address?

What did you change in this PR?

  • Add example notebook for SEC 10-k

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Easy

  • Link to our sec10k docs in the intro
  • Confirm beta status or not
  • Spell out PUDL at first usage and link to main docs index
  • Cut "make file output of namechanges we can use at unspecified future time"
  • link to Methods when we first mention matching to EIA
  • drop rogue polars section
  • clarify why location is important to matching
  • Cut fuel section
  • expand slash notation to "in other words" "put another way"
  • Add conclusions / drop TODO for unique link coverage

Needs discussion / future work

  • connect output tables to the (which?) eia table
  • cut coverage sections? & if so, find a way to keep the cool part (about how prevalent "blank checks" industries are in the electricity generation space)?
  • motivate natural gas
  • annotate subsets of subsidiaries with percentages (but out of what population?)
  • drop counts 2 or less, even though the sample size is already small?
  • find a good example of a utility whose parent or subsidiary is something that has generated clearly biasing interests and caused a scandal

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment thread 07-sec10k-use-cases.ipynb
@@ -0,0 +1,642 @@
{

@e-belfer e-belfer Dec 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In it's final version, we'll want to add some copy up top about what the data is, where to find documentation, and what we're going to do with it.


Reply via ReviewNB

Comment thread 07-sec10k-use-cases.ipynb
@@ -0,0 +1,642 @@
{

@e-belfer e-belfer Dec 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is central_index_key a field that someone working with the data would be comfortable with? Without more context, it's not clear to me why this is an important first step or what the motivation is here.


Reply via ReviewNB

Comment thread 07-sec10k-use-cases.ipynb
@@ -0,0 +1,642 @@
{

@e-belfer e-belfer Dec 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this wind up getting used somewhere? It's not clear to me what the application is at present.


Reply via ReviewNB

Comment thread 07-sec10k-use-cases.ipynb
@@ -0,0 +1,642 @@
{

@e-belfer e-belfer Dec 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels more oriented towards characterizing data completion. Narratively, I think it'd make more sense for the notebook to focus on data utilization - e.g., how do I get total generation for a series of nested entities or get a list of all plants owned by one entity and its subsidiaries? You could focus in on one entity for narrative simplicity.


Reply via ReviewNB

Comment thread 07-sec10k-use-cases.ipynb
@@ -0,0 +1,642 @@
{

@e-belfer e-belfer Dec 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - this feels like characterizing data quality, not demonstrating how to best use the data.


Reply via ReviewNB

Comment thread 07-sec10k-use-cases.ipynb
@@ -0,0 +1,642 @@
{

@e-belfer e-belfer Dec 1, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the most important part of the notebook and what I'd focus on.


Reply via ReviewNB

@e-belfer e-belfer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Things are flowing well and the examples are very useful.

I have a high-level question about the relevance of the middle section for this notebook (characterizing whether sector codes make sense), and a couple of thoughts of context that would be helpful to add, but I like the way this is developing!

One thing I would love to see and isn't in your outline is leveraging the connection between the data - e.g., how much did a parent and their subsidiaries spend on X rate category overall (from FERC connection)? How many plants are ultimately controlled by X parent company and.... etc.

Comment thread 07-sec10k-use-cases.ipynb
"# Introduction\n",
"\n",
"Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n",
"Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: could link to the main data page somewhere and to our docs page elsewhere?

Comment thread 07-sec10k-use-cases.ipynb Outdated
"\n",
"Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n",
"Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n",
"PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we formally consider SEC10k to be in beta?
(It'd be good to have criteria for what we mean by this, but that's a different question).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our docs we say

We only conducted an initial round of modeling, so this dataset is a beta version and its contents and connections to other datasets are probabilistic in nature.

Comment thread 07-sec10k-use-cases.ipynb
"Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n",
"Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n",
"PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n",
"Four output tables are available:\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking total nit: Would be helpful to add an extra line break here to make this a distinctive paragraph. Or to make the paragraph break start at "PUDL has..."

Comment thread 07-sec10k-use-cases.ipynb Outdated
"\n",
"Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n",
"Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n",
"PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly the first time we introduce PUDL we want to spell it out and link to the main read the docs index.

Comment thread 07-sec10k-use-cases.ipynb
"PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n",
"Four output tables are available:\n",
"\n",
"* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subversion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub-version maybe? There are a bunch of different forms 10-k:

  • 10-k: the standard annual report.
  • 10-k/a: an amended version of the annual report.
  • 10-k405: filed to report insider trading that was not reported in a timely fashion.
  • 10-k405/a: an amended version of the 10-k405.
  • 10-kt: submitted in lieu of or in addition to a standard 10-K annual report when a company changes the end of its fiscal year (e.g. due to a merger) leaving the company with a longer or shorter reporting period.
  • 10-kt/a: an amended version of the 10-kt.
  • 10-ksb: the annual report for small businesses, also known as penny stocks.
  • 10-ksb/a: an amended version of the 10-ksb.

Comment thread 07-sec10k-use-cases.ipynb
]
},
{
"cell_type": "markdown",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can, but what is the motivation for doing so wrt the rest of our data or the broader SEC data landscape? Might be helpful to spell it out a bit more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know! ideas?

Comment thread 07-sec10k-use-cases.ipynb Outdated
},
{
"cell_type": "markdown",
"id": "31081288-d9d1-400d-9b98-f28562b29950",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally don't need us to do this 3+ times, if having the codes somewhere is helpful we could just leave a list for people to work from.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut the Fuel section?

Comment thread 07-sec10k-use-cases.ipynb
"id": "f226a7bf-c548-4849-9f3a-608ab11bc546",
"metadata": {},
"source": [
"The strong presence of natural gas transmission and distribution in this list is notable.\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider dropping the <2 count, they seem kind of random and mainly make me wonder whether the industry codes are useful or reliable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I'm reluctant to drop 10% of an already tiny sample but I also don't like that the small count items might jeopardize the perceived reliability of the data

Comment thread 07-sec10k-use-cases.ipynb
"output_type": "stream",
"text": [
"150505 subsidiaries of companies in electricity industries\n",
" 5023 subsidiaries where SIC is known\n",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be helpful to print percentages here as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what denominator(s) would make sense here?

  • the total number of subsidiaries in the dataset, across parent companies from all industries
  • the number of subsidiaries of companies in electricity industries (line 1)
  • line-dependent
    • subsidiaries of companies in electricity industries: out of all subsidiaries anywhere
    • subsidiaries where SIC is known: out of line 1
    • subsidiaries with unknown SIC known to be an EIA utility: out of line 1
    • subsidiaries with insufficient industry metadata to decide either way: out of line 1
    • subsidiaries where SIC is not an electricity industry: out of line 2

Comment thread 07-sec10k-use-cases.ipynb
}
],
"source": [
"xindustry_electricity_as_subsidiary[[\"parent_company_industry_name_sic\",\"parent_company_industry_id_sic\"]].value_counts()"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a good example of a utility whose parent or subsidiary is something that has generated clearly biasing interests and caused a scandal? Would be helpful/grounding to flag in this section.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm largely out of my depth there but one of the plans for the "leveraging subsidiary relationships" section below was:

Easy-Medium - Example of multiple layers of subsidiary nesting (Berkshire Hathaway → BHE → PacifiCorp)

  • Pitfall here is that multiple layers get reported at the top level. Potential double-counting.
  • Need a good way to display the output of BH that’s not overwhelming / better signal to noise

So probably not in this round, but might make a good first contribution

@krivard krivard left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to work on the easy fixes but had some questions & clarifications

Comment thread 07-sec10k-use-cases.ipynb
"PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n",
"Four output tables are available:\n",
"\n",
"* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)\n",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub-version maybe? There are a bunch of different forms 10-k:

  • 10-k: the standard annual report.
  • 10-k/a: an amended version of the annual report.
  • 10-k405: filed to report insider trading that was not reported in a timely fashion.
  • 10-k405/a: an amended version of the 10-k405.
  • 10-kt: submitted in lieu of or in addition to a standard 10-K annual report when a company changes the end of its fiscal year (e.g. due to a merger) leaving the company with a longer or shorter reporting period.
  • 10-kt/a: an amended version of the 10-kt.
  • 10-ksb: the annual report for small businesses, also known as penny stocks.
  • 10-ksb/a: an amended version of the 10-ksb.

Comment thread 07-sec10k-use-cases.ipynb
"* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)\n",
"* `out_sec10k__quarterly_company_information`: attributes describing the companies which file 10-K’s\n",
"* `out_sec10k__parents_and_subsidiaries`: ownership information about parent companies and their subsidiary companies\n",
"* `out_sec10k__changelog_company_name`: information about company name changes\n",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which EIA table?

Comment thread 07-sec10k-use-cases.ipynb Outdated
"source": [
"Do companies sometimes change their industry code across filings?\n",
"\n",
"/is the number of unique (company, industry) pairs greater than the number of unique companies?"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's leftover shorthand for "in other words," will fix

Comment thread 07-sec10k-use-cases.ipynb Outdated
"id": "75c3983d-7bc1-4e22-b7bb-3e9fee5101e3",
"metadata": {},
"source": [
"lol, \"blank checks\"[<sup id=\"fn1-back\">1</sup>](#fn1 \"https://en.wikipedia.org/wiki/Special-purpose_acquisition_company\") -- but otherwise pretty close to the distribution over all quarterly links.\n",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of the coverage evaluation though -- if I move coverage to another notebook, should I find a way to keep this info here? any thoughts on how to frame it, if not as part of coverage?

Comment thread 07-sec10k-use-cases.ipynb
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"#### Within industries we most associate with electric utilities, what percent of SEC filers have links to an EIA utility ID?"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in technical terms we mean high-precision, low-recall:

  • precision: (# of correct predictions) / (number of predicted matches)
  • recall: (# of correct predictions) / (total # of ground truth matches)

in general-audience terms we mean:

  • if we say there's a match, we're probably right
  • if we don't say there's a match, it doesn't mean much, a match still might exist

Comment thread 07-sec10k-use-cases.ipynb
]
},
{
"cell_type": "markdown",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know! ideas?

Comment thread 07-sec10k-use-cases.ipynb Outdated
},
{
"cell_type": "markdown",
"id": "31081288-d9d1-400d-9b98-f28562b29950",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut the Fuel section?

Comment thread 07-sec10k-use-cases.ipynb
"output_type": "stream",
"text": [
"150505 subsidiaries of companies in electricity industries\n",
" 5023 subsidiaries where SIC is known\n",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what denominator(s) would make sense here?

  • the total number of subsidiaries in the dataset, across parent companies from all industries
  • the number of subsidiaries of companies in electricity industries (line 1)
  • line-dependent
    • subsidiaries of companies in electricity industries: out of all subsidiaries anywhere
    • subsidiaries where SIC is known: out of line 1
    • subsidiaries with unknown SIC known to be an EIA utility: out of line 1
    • subsidiaries with insufficient industry metadata to decide either way: out of line 1
    • subsidiaries where SIC is not an electricity industry: out of line 2

Comment thread 07-sec10k-use-cases.ipynb
}
],
"source": [
"xindustry_electricity_as_subsidiary[[\"parent_company_industry_name_sic\",\"parent_company_industry_id_sic\"]].value_counts()"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm largely out of my depth there but one of the plans for the "leveraging subsidiary relationships" section below was:

Easy-Medium - Example of multiple layers of subsidiary nesting (Berkshire Hathaway → BHE → PacifiCorp)

  • Pitfall here is that multiple layers get reported at the top level. Potential double-counting.
  • Need a good way to display the output of BH that’s not overwhelming / better signal to noise

So probably not in this round, but might make a good first contribution

Comment thread 07-sec10k-use-cases.ipynb
"id": "f226a7bf-c548-4849-9f3a-608ab11bc546",
"metadata": {},
"source": [
"The strong presence of natural gas transmission and distribution in this list is notable.\n",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I'm reluctant to drop 10% of an already tiny sample but I also don't like that the small count items might jeopardize the perceived reliability of the data

@krivard krivard marked this pull request as ready for review February 16, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants