Add SEC 10-K data source documentation#4562
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
For more information, see https://pre-commit.ci
katie-lamb
left a comment
There was a problem hiding this comment.
I left some edits and suggestions on the data source docs page. I also will add a section to the methodology page and irregularities section of the data source page on how we don't parse through nested layers of subsidiaries.
| @@ -851,19 +851,9 @@ | |||
| "title": "U.S. Securities and Exchange Commission Form 10-K", | |||
There was a problem hiding this comment.
Nit but it would be nice to have "SEC" in the title to make it more searchable and have it match the other data source titles which have the dataset abbreviation. "U.S. Securities and Exchange Commission (SEC) Form 10-K" is my suggestion.
| First, metadata related to each filing was parsed out of the plaintext headers of the | ||
| HTML documents. Some of these headers pertain to the filing as a whole and others | ||
| describe attributes of the individual companies associated with the filing. Each filing | ||
| has a single company that is the primary filer, and the filing is associated with their | ||
| `Central Index Key <https://www.sec.gov/search-filings/cik-lookup>`__ (CIK) -- a | ||
| persistent company identifier assigned by the SEC that is more durable and standardized | ||
| than the company name. |
There was a problem hiding this comment.
To me this Background section feels a little too long and the level of detail makes it a bit confusing. I left some suggestions for the next three paragraphs:
"First, metadata related to each filing and the companies associated with the filing was parsed out of the plaintext headers of the HTML documents. Some of these headers pertain to the filing as a whole and others Each filing
describe attributes of the individual companies associated with the filing.
has a single company that is the primary filer, and the filing is associated with their
Central Index Key <https://www.sec.gov/search-filings/cik-lookup>__ (CIK) -- a
persistent company identifier assigned by the SEC that is more durable and standardized"
than the company name.
The sentence about the headers pertaining to the filing as a whole vs attributes of the individual filers doesn't feel necessary and was a bit confusing. It could belong in the table level doc string of the filing info table and/or the company info table?
| EDGAR database, including the Exhibit 21 attachments. We extract two kinds of data from | ||
| this raw data source using different methods. |
There was a problem hiding this comment.
Suggestion:
We extract two kinds of data from this raw data source using different methods.
-->
"We extract two kinds of data from this raw data source: metadata about the filing companies and filing itself, and ownership data about the company's subsidiaries."
| The plaintext headers are not necessarily intended to be machine readable, but they | ||
| are highly structured and allow us to compile a database of all SEC 10-K filings and the | ||
| companies involved. The ``core_sec10k`` tables derived from the headers provide the |
There was a problem hiding this comment.
I think this first sentence "The plaintext headers..." could be cut and put into a table level docstring for the company info or filing info tables.
There was a problem hiding this comment.
I felt like it was important to communicate that much of what we are doing with this data source is compiling a database of SEC 10-K filers & filings, and doing it in a kind of opportunistic way -- by parsing these headers, rather than pulling from some canonical API or existing SEC database that provides the company information. I think this becomes important later on when trying to interpret the company information table, because it helps explain why there are multiple not necessarily identical entries for the same companies in association with many different filings.
| companies involved. The ``core_sec10k`` tables derived from the headers provide the | ||
| context that's necessary to link the subsidiary company information extracted from | ||
| Exhibit 21 to other sets of companies, including the SEC 10-K filers themselves, as well | ||
| as companies that file the EIA Form 860. |
There was a problem hiding this comment.
I think this paragraph strays away from the main focus of this section which is the two kinds of data that we extract from the raw data source. To be more focused, I think we could move this sentence on using info pulled from headers to link to EIA into the linkage section below.
|
|
||
| No meaningful company ID is reported for subsidiaries that appear in the Exhibit 21 | ||
| attachments. We construct an ad-hoc ID by concatenating the Central Index Key (CIK) of | ||
| the main filer with the name and location of the subsidiary company as observed in the |
There was a problem hiding this comment.
I would say:
"of the parent company" or maybe "of the parent company who is filing" instead of "main filer". I'm finding "main filer" to be confusing since we've only mentioned one or two times previously in this data source page that there is more than one company in a filing. Also, we've made the assumption (and it seems backed up by SEC filing language) that the main filer is the owner company whose subsidiaries are reported in the attached Ex. 21
There was a problem hiding this comment.
I lean toward using vocabulary consistent with the SEC -- it would be weird for someone familiar with the SEC 10-k to be excited to use our data only to find we were using different words for central concepts
| ------------------------------------------------------- | ||
|
|
||
| No meaningful company ID is reported for subsidiaries that appear in the Exhibit 21 | ||
| attachments. We construct an ad-hoc ID by concatenating the Central Index Key (CIK) of |
There was a problem hiding this comment.
Maybe:
"We construct an ad-hoc ID for each subsidiary company by..."
| confidently how much of each subsidiary the parent company owns. Anecdotally, based on | ||
| outside information, many of the subsidiaries that do not report an ownership fraction | ||
| seem to be entirely owned by a single parent company, but we don't know how common that | ||
| really is. |
There was a problem hiding this comment.
It's extremely common for an Ex. 21 to say at the bottom "all subsidiaries are wholly owned" or something like that. So I wouldn't say this is "outside information". Maybe simplify this to say:
"When ownership fractions are not included, it's very common for an Ex. 21 to include that all subsidiaries are wholly owned by the parent, but our existing model does not detect when this information is included."
| is part of. However, both columns contain significant numbers of null values and the | ||
| name field is not entirely standardized. Further cleaning of these columns is needed to |
There was a problem hiding this comment.
This high null percentage is really because of a lack of reporting in the raw data right? Or are we not correctly grabbing this from headers?
There was a problem hiding this comment.
I haven't gone back to look at the raw headers to see if we're somehow not extracting data that's there. But I wouldn't be surprised at all if there are just a lot of truly missing values.
In the code changes in this PR I did go ahead and standardize the names (based on canonical names from SEC) and filled in missing names when there was a code, and foreward/backward filled gaps where the values before and after the gap were the same. But there are still a lot of missing values because there were significant numbers of companies that had never reported a SIC code.
So the bit in this paragraph about needing to do further cleaning / standardization can probably be removed.
|
|
||
| {% endblock %} | ||
|
|
||
| {% block notable_irregularities %} |
There was a problem hiding this comment.
I'm finding this section a little overwhelming and hard to parse through. Can the irregularities that pertain to record linkage be put in a different section specific to record linkage? That would be:
- The SEC 10-K filer to EIA utility linkage is based only on 2023 data
- Matches between EIA utility ids and SEC 10-K energy companies are sparse (and everything under this)
There was a problem hiding this comment.
Reorganized as suggested (with some minor exceptions for things below Matches that didn't seem to be about linkages)
| This is a little trickier to evaluate. SEC filers use a four-digit Standard Industrial | ||
| Classification (SIC) code to identify the company's primary industry, but not all utilities | ||
| file under the SICs for electric services (4911 and 4931), and not all companies that | ||
| file as electric services are utilities. We see SICs as unexpected as computer storage | ||
| devices, nursing facilities, and real estate among our -- high-confidence -- matches to EIA | ||
| utilities. |
There was a problem hiding this comment.
An intuitive expectation that I have based on what I know about the SEC 10-K filers and browsing through the companies listed under SICs 4911 and 4931 is that conceptually, a very high proportion of the filers in those industries should really show up in the set of EIA utilities. I think to a much higher degree than I would expect that the EIA utilities would necessarily show up as SEC 10-K filers, since there are so many smaller special-purpose companies that exist only to own a single generator or slices of some plant, or to act as an operator without owning the underlying asset that they're operating.
Meaning, if a company is explicitly in the electricity services business, AND it is big enough to be filing the SEC 10-K, then I would be pretty surprised if it doesn't also show up in the list of EIA utilities. Maybe there's some name or other data mismatch that's keeping us from identifying it correctly, but off the top of my head I'm not sure what the common case would be where the electricity company is an SEC 10-K filer and also doesn't report to EIA.
Co-authored-by: Kathryn Mazaitis <1158666+krivard@users.noreply.github.com>
| :issue:`4165` and PR :pr:`4134`. This bug has been fixed, and the lost addresses can | ||
| be recovered by re-running the upstream extraction. | ||
|
|
||
| Industry classifications applied to companies have poor coverage |
There was a problem hiding this comment.
I want to collect some more numbers on this, and add a sentence or two on the actual coverage
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4562 +/- ##
==========================================
+ Coverage 93.13% 93.14% +0.01%
==========================================
Files 199 199
Lines 16922 16947 +25
==========================================
+ Hits 15760 15785 +25
Misses 1162 1162 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Overview
Adds a data source documentation page for the SEC 10-K data.
See also the marimo notebook in this PR
Closes #4347 #4329
Questions & Issues
Writing
Analysis
Communicating / Visualizing the EDA
What's the future of this dataset? How can it be improved?
Better explain use cases / blockers