Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
| @@ -0,0 +1,642 @@ | |||
| { | |||
There was a problem hiding this comment.
In it's final version, we'll want to add some copy up top about what the data is, where to find documentation, and what we're going to do with it.
Reply via ReviewNB
| @@ -0,0 +1,642 @@ | |||
| { | |||
There was a problem hiding this comment.
Is central_index_key a field that someone working with the data would be comfortable with? Without more context, it's not clear to me why this is an important first step or what the motivation is here.
Reply via ReviewNB
| @@ -0,0 +1,642 @@ | |||
| { | |||
There was a problem hiding this comment.
Does this wind up getting used somewhere? It's not clear to me what the application is at present.
Reply via ReviewNB
| @@ -0,0 +1,642 @@ | |||
| { | |||
There was a problem hiding this comment.
This feels more oriented towards characterizing data completion. Narratively, I think it'd make more sense for the notebook to focus on data utilization - e.g., how do I get total generation for a series of nested entities or get a list of all plants owned by one entity and its subsidiaries? You could focus in on one entity for narrative simplicity.
Reply via ReviewNB
| @@ -0,0 +1,642 @@ | |||
| { | |||
There was a problem hiding this comment.
Same comment as above - this feels like characterizing data quality, not demonstrating how to best use the data.
Reply via ReviewNB
| @@ -0,0 +1,642 @@ | |||
| { | |||
There was a problem hiding this comment.
There was a problem hiding this comment.
Looking good! Things are flowing well and the examples are very useful.
I have a high-level question about the relevance of the middle section for this notebook (characterizing whether sector codes make sense), and a couple of thoughts of context that would be helpful to add, but I like the way this is developing!
One thing I would love to see and isn't in your outline is leveraging the connection between the data - e.g., how much did a parent and their subsidiaries spend on X rate category overall (from FERC connection)? How many plants are ultimately controlled by X parent company and.... etc.
| "# Introduction\n", | ||
| "\n", | ||
| "Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n", | ||
| "Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n", |
There was a problem hiding this comment.
Non-blocking: could link to the main data page somewhere and to our docs page elsewhere?
| "\n", | ||
| "Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n", | ||
| "Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n", | ||
| "PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n", |
There was a problem hiding this comment.
Do we formally consider SEC10k to be in beta?
(It'd be good to have criteria for what we mean by this, but that's a different question).
There was a problem hiding this comment.
In our docs we say
We only conducted an initial round of modeling, so this dataset is a beta version and its contents and connections to other datasets are probabilistic in nature.
| "Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n", | ||
| "Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n", | ||
| "PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n", | ||
| "Four output tables are available:\n", |
There was a problem hiding this comment.
Non-blocking total nit: Would be helpful to add an extra line break here to make this a distinctive paragraph. Or to make the paragraph break start at "PUDL has..."
| "\n", | ||
| "Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.\n", | ||
| "Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.\n", | ||
| "PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n", |
There was a problem hiding this comment.
Possibly the first time we introduce PUDL we want to spell it out and link to the main read the docs index.
| "PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n", | ||
| "Four output tables are available:\n", | ||
| "\n", | ||
| "* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)\n", |
There was a problem hiding this comment.
sub-version maybe? There are a bunch of different forms 10-k:
- 10-k: the standard annual report.
- 10-k/a: an amended version of the annual report.
- 10-k405: filed to report insider trading that was not reported in a timely fashion.
- 10-k405/a: an amended version of the 10-k405.
- 10-kt: submitted in lieu of or in addition to a standard 10-K annual report when a company changes the end of its fiscal year (e.g. due to a merger) leaving the company with a longer or shorter reporting period.
- 10-kt/a: an amended version of the 10-kt.
- 10-ksb: the annual report for small businesses, also known as penny stocks.
- 10-ksb/a: an amended version of the 10-ksb.
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", |
There was a problem hiding this comment.
Sure can, but what is the motivation for doing so wrt the rest of our data or the broader SEC data landscape? Might be helpful to spell it out a bit more.
There was a problem hiding this comment.
I don't actually know! ideas?
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "31081288-d9d1-400d-9b98-f28562b29950", |
There was a problem hiding this comment.
Personally don't need us to do this 3+ times, if having the codes somewhere is helpful we could just leave a list for people to work from.
There was a problem hiding this comment.
cut the Fuel section?
| "id": "f226a7bf-c548-4849-9f3a-608ab11bc546", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "The strong presence of natural gas transmission and distribution in this list is notable.\n", |
There was a problem hiding this comment.
I would consider dropping the <2 count, they seem kind of random and mainly make me wonder whether the industry codes are useful or reliable.
There was a problem hiding this comment.
hmmm I'm reluctant to drop 10% of an already tiny sample but I also don't like that the small count items might jeopardize the perceived reliability of the data
| "output_type": "stream", | ||
| "text": [ | ||
| "150505 subsidiaries of companies in electricity industries\n", | ||
| " 5023 subsidiaries where SIC is known\n", |
There was a problem hiding this comment.
Could be helpful to print percentages here as well
There was a problem hiding this comment.
what denominator(s) would make sense here?
- the total number of subsidiaries in the dataset, across parent companies from all industries
- the number of subsidiaries of companies in electricity industries (line 1)
- line-dependent
- subsidiaries of companies in electricity industries: out of all subsidiaries anywhere
- subsidiaries where SIC is known: out of line 1
- subsidiaries with unknown SIC known to be an EIA utility: out of line 1
- subsidiaries with insufficient industry metadata to decide either way: out of line 1
- subsidiaries where SIC is not an electricity industry: out of line 2
| } | ||
| ], | ||
| "source": [ | ||
| "xindustry_electricity_as_subsidiary[[\"parent_company_industry_name_sic\",\"parent_company_industry_id_sic\"]].value_counts()" |
There was a problem hiding this comment.
Do we have a good example of a utility whose parent or subsidiary is something that has generated clearly biasing interests and caused a scandal? Would be helpful/grounding to flag in this section.
There was a problem hiding this comment.
I'm largely out of my depth there but one of the plans for the "leveraging subsidiary relationships" section below was:
Easy-Medium - Example of multiple layers of subsidiary nesting (Berkshire Hathaway → BHE → PacifiCorp)
- Pitfall here is that multiple layers get reported at the top level. Potential double-counting.
- Need a good way to display the output of BH that’s not overwhelming / better signal to noise
So probably not in this round, but might make a good first contribution
krivard
left a comment
There was a problem hiding this comment.
Going to work on the easy fixes but had some questions & clarifications
| "PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.\n", | ||
| "Four output tables are available:\n", | ||
| "\n", | ||
| "* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)\n", |
There was a problem hiding this comment.
sub-version maybe? There are a bunch of different forms 10-k:
- 10-k: the standard annual report.
- 10-k/a: an amended version of the annual report.
- 10-k405: filed to report insider trading that was not reported in a timely fashion.
- 10-k405/a: an amended version of the 10-k405.
- 10-kt: submitted in lieu of or in addition to a standard 10-K annual report when a company changes the end of its fiscal year (e.g. due to a merger) leaving the company with a longer or shorter reporting period.
- 10-kt/a: an amended version of the 10-kt.
- 10-ksb: the annual report for small businesses, also known as penny stocks.
- 10-ksb/a: an amended version of the 10-ksb.
| "* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)\n", | ||
| "* `out_sec10k__quarterly_company_information`: attributes describing the companies which file 10-K’s\n", | ||
| "* `out_sec10k__parents_and_subsidiaries`: ownership information about parent companies and their subsidiary companies\n", | ||
| "* `out_sec10k__changelog_company_name`: information about company name changes\n", |
| "source": [ | ||
| "Do companies sometimes change their industry code across filings?\n", | ||
| "\n", | ||
| "/is the number of unique (company, industry) pairs greater than the number of unique companies?" |
There was a problem hiding this comment.
that's leftover shorthand for "in other words," will fix
| "id": "75c3983d-7bc1-4e22-b7bb-3e9fee5101e3", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "lol, \"blank checks\"[<sup id=\"fn1-back\">1</sup>](#fn1 \"https://en.wikipedia.org/wiki/Special-purpose_acquisition_company\") -- but otherwise pretty close to the distribution over all quarterly links.\n", |
There was a problem hiding this comment.
this is part of the coverage evaluation though -- if I move coverage to another notebook, should I find a way to keep this info here? any thoughts on how to frame it, if not as part of coverage?
| "jp-MarkdownHeadingCollapsed": true | ||
| }, | ||
| "source": [ | ||
| "#### Within industries we most associate with electric utilities, what percent of SEC filers have links to an EIA utility ID?" |
There was a problem hiding this comment.
in technical terms we mean high-precision, low-recall:
- precision: (# of correct predictions) / (number of predicted matches)
- recall: (# of correct predictions) / (total # of ground truth matches)
in general-audience terms we mean:
- if we say there's a match, we're probably right
- if we don't say there's a match, it doesn't mean much, a match still might exist
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", |
There was a problem hiding this comment.
I don't actually know! ideas?
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "31081288-d9d1-400d-9b98-f28562b29950", |
There was a problem hiding this comment.
cut the Fuel section?
| "output_type": "stream", | ||
| "text": [ | ||
| "150505 subsidiaries of companies in electricity industries\n", | ||
| " 5023 subsidiaries where SIC is known\n", |
There was a problem hiding this comment.
what denominator(s) would make sense here?
- the total number of subsidiaries in the dataset, across parent companies from all industries
- the number of subsidiaries of companies in electricity industries (line 1)
- line-dependent
- subsidiaries of companies in electricity industries: out of all subsidiaries anywhere
- subsidiaries where SIC is known: out of line 1
- subsidiaries with unknown SIC known to be an EIA utility: out of line 1
- subsidiaries with insufficient industry metadata to decide either way: out of line 1
- subsidiaries where SIC is not an electricity industry: out of line 2
| } | ||
| ], | ||
| "source": [ | ||
| "xindustry_electricity_as_subsidiary[[\"parent_company_industry_name_sic\",\"parent_company_industry_id_sic\"]].value_counts()" |
There was a problem hiding this comment.
I'm largely out of my depth there but one of the plans for the "leveraging subsidiary relationships" section below was:
Easy-Medium - Example of multiple layers of subsidiary nesting (Berkshire Hathaway → BHE → PacifiCorp)
- Pitfall here is that multiple layers get reported at the top level. Potential double-counting.
- Need a good way to display the output of BH that’s not overwhelming / better signal to noise
So probably not in this round, but might make a good first contribution
| "id": "f226a7bf-c548-4849-9f3a-608ab11bc546", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "The strong presence of natural gas transmission and distribution in this list is notable.\n", |
There was a problem hiding this comment.
hmmm I'm reluctant to drop 10% of an already tiny sample but I also don't like that the small count items might jeopardize the perceived reliability of the data
Overview
Implement the sec10k notebook spec from October 2025
What problem does this address?
What did you change in this PR?
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list
Easy
Needs discussion / future work