Add core_eia176__yearly_company_characteristics#5197
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Extracts operational and ownership characteristics from EIA-176 Part 3 (Lines B–F) into a new annual company-level table. Includes boolean is_* columns for operation/ownership types, other_ownership_description, and has_alternative_fuel_fleet derived from the company data table.
…76__yearly_company_characteristics
- Use map({1.0: True}) instead of eq(1.0) for has_alternative_fuel_fleet so years
where the question wasn't asked (1997-2004, 2016-2024) remain NULL rather than False
- Set operating_state to sa.Enum(...) and nullable=False in migration; all is_* columns
to nullable=False to match actual transform output
- Add required: True and enum constraints to FIELD_METADATA_BY_RESOURCE so metadata
matches migration schema (fixes test_migrations_match_metadata)
- Add not_null dbt tests for all is_* columns
401c5d5 to
255a29c
Compare
…ny_characteristics
…ne from v2026.3.0
… onto current head
|
Hi @e-belfer, Ready for a review! |
e-belfer
left a comment
There was a problem hiding this comment.
I think this is off to a good start!
My main question is that at present this covers Part 3 Line A and a bit of B of the original survey, but the description and the original issue indicate this table should cover A-F.
This would mean including the following variables:
- alternative_fleet_size (Part B)
- customer_choice_residential_eligible (Part C)
- customer_choice_residential_participating (Part C)
- sales_acquisitions_1_yes_0_no (Part D)
- natural_gas_pump_price (not on the form but provided through portal)
Part E might require its own table with a PK of operator ID, year and county. However, the rest of these fields can/should probably fit into this existing table. I was only able to find this data in the bulk file and not the report downloads, so this is probably out of scope for this issue. This is also true for the customer choice commercial eligible/participating fields. See #4729.
A few other minor notes:
- See my comment below about the failing operator state ENUM constraint.
- Given the total non-overlap of the other_ownership fields 1 and 2, I think we can safely combine them into one boolean field. This appears identically in the raw data, so it isn't caused by a mismapping on our end but the secondary field only appears in 2016 and isn't conveying any additional information.
- Verified in the raw data that provision of the "other description" column happens often without checking the "is_other" box - I don't think we need any action here, the current status seems fine.
| "natural_gas_other_disposition_items": None, | ||
| "natural_gas_supply_items": None, | ||
| "operation_types_and_sector_items": None, | ||
| "type_of_operations_and_sector_items": "operation_types_and_sector_items", |
There was a problem hiding this comment.
Great catch about the misname. We actually created this problem by naming the column mapping CSV type_of_operations_and_sector_items - rather than performing a rename here, you could fix this problem at the source by renaming pudl/package_data/eia176/column_maps/type_of_operations_and_sector_items.csv to pudl/package_data/eia176/column_maps/operation_types_and_sector_items.csv
| "type": "boolean", | ||
| "description": "Whether the company operates a public compressed natural gas (CNG) fueling station.", | ||
| }, | ||
| "is_public_liquid_natural_gas_fueling_station": { |
There was a problem hiding this comment.
| "is_public_liquid_natural_gas_fueling_station": { | |
| "is_public_lng_fueling_station": { |
Elsewhere in the fields we use lng and I think that's a fine abbreviation here and elsewhere, especially with your helpful field definition.
| "core_eia176__yearly_company_characteristics": { | ||
| "operating_state": { | ||
| "description": "State that the operator is reporting for.", | ||
| "constraints": {"required": True, "enum": SUBDIVISION_CODES_ISO3166}, |
There was a problem hiding this comment.
This constraint is currently failing with the following error:
ValueError: Values in operating_state column are not included in categorical values in field enum constraint and will be converted to nulls (['FX', 'OO', 'BL', 'MX']).
These all appear to be derived from adjustment records. I'm relatively confident that MX is Mexico, but I wasn't able to track down precise confirmation of any of these - the best way to confirm this would be to email the EIA and ask (eiainfonaturalgas@eia.gov).
If we're able to map these to specific regions (e.g., federal adjustment, adjustment from Mexican imports) we should keep them and expand the enum. Otherwise, we should null them.
| "additional_summary_text": ( | ||
| "a company's operational and ownership characteristics." | ||
| ), | ||
| "additional_source_text": "(Part 3, Lines B–F)", |
There was a problem hiding this comment.
This actually covers Part 3, Line A and a bit of Line B at the present. Were you planning to add in other columns to cover the rest of B-F?
Rename the column-map CSV from type_of_operations_and_sector_items to
operation_types_and_sector_items so the internal page key matches the
shorter name used throughout PUDL. Add a source_filename special case
in the extractor to map the page key back to EIA's original ZIP filename
(eia176_{year}_type_of_operations_and_sector_items.csv). Simplify the
asset dict entry to use None instead of an explicit out_page alias.
- Rename is_public_liquid_natural_gas_fueling_station to is_public_lng_fueling_station for consistency with other LNG fields - Merge is_other_ownership_2 into is_other_ownership in transform; the two fields never co-occur and _2 only appears in 2016 (27 rows) - Remove is_other_ownership_2 field definition and all required: True overrides from FIELD_METADATA_BY_RESOURCE (no-op for Parquet-only tables) - Remove operating_state enum constraint pending EIA clarification of non-US codes FX/OO/BL/MX; add explanatory comment referencing catalyst-cooperative#4729 - Expand resource description to document bulk-only fields excluded from scope (Lines B-D) with reference to catalyst-cooperative#4729 - Log row count before/after dropna on operating_state
- Drop is_other_ownership_2 column (merged into is_other_ownership) - Rename is_public_liquid_natural_gas_fueling_station to is_public_lng_fueling_station - Change operating_state from Enum to Text (enum constraint removed pending EIA response) - Set all non-PK columns to nullable=True to match current Python metadata
|
@e-belfer Investigated all five fields by checking every layer of the pipeline: all
Removed the enum constraint for now and added a comment explaining the Combine Done - OR-merged the two columns in transform then dropped Page key fix Took your suggestion - renamed the column-map CSV to
Renamed
Removed all Drop count logging Replaced the silent |
|
Scanned all raw EIA-176 operation_types_and_sector_items CSVs (1997–2024, 55,589 rows total) and found 0 null state values, so the dropna was defensive and never dropped anything. Replaced the log with assert null_operating_state == 0 so the pipeline will fail loudly if that ever changes. I am still waiting on Emmanuel Eboweme, a Survey Statistician at EIA to confirm the operating state anomolies ETA is next Tuesday. |
Hi @irubey, I'm a little confused about what you mean by this data not existing - these fields all exist in |
Add alternative_fleet_size, customer_choice_residential_eligible, customer_choice_residential_participating, has_sales_or_acquisitions, and natural_gas_pump_price to core_eia176__yearly_company_characteristics across transform, metadata, migration, dbt schema, and release notes. Drop national-level adjustment records (FX/MX/BL/OO) now that EIA has confirmed their meaning; restore the operating_state ENUM constraint. Update row counts to reflect the drops. Fix natural_gas_pump_price unit from USD_per_mcf to USD_per_Mcf.
Update 48fade8aeee8 to depend on f98868d3f5cb (forensics migration from main) instead of 4f252e9e2ce3, making the migration chain linear.
|
Hi @e-belfer, Two updates since the last review: Parts B-F fields
EIA response on non-US state codes Heard back from Emmanuel Eboweme (Survey Statistician, EIA) on May 4:
He confirmed these are national-level adjustment records and should be excluded from state-level analysis. They are now explicitly dropped in the transform with a logged count, and the |
|
|
||
| # Drop national-level adjustment records — see NATIONAL_ADJUSTMENT_STATE_CODES | ||
| n_national = df["operating_state"].isin(NATIONAL_ADJUSTMENT_STATE_CODES).sum() | ||
| logger.info( |
There was a problem hiding this comment.
To do: add assertion here about expected drop count
| "customer_choice_residential_participating", | ||
| ]: | ||
| if col in df.columns: | ||
| df[col] = df[col].astype(pd.Int64Dtype()) |
There was a problem hiding this comment.
To do: no need to cast dtype, this is already accounted for in writing a schema.
| "Whether the utility plants to operate alternative-fueled vehicles this coming year." | ||
| ), | ||
| }, | ||
| "alternative_fleet_size": { |
There was a problem hiding this comment.
To do: drop the line numbers, not consistent with rest of fields
| "type": "number", | ||
| "description": ( | ||
| "Price of natural gas at public fueling stations operated by the company " | ||
| "(EIA Form 176 Part 3, Line E). Reported 2014-2016 only." |
There was a problem hiding this comment.
To do: move year warning out into table-level description
| @@ -0,0 +1,75 @@ | |||
| version: 2 | |||
| sources: | |||
There was a problem hiding this comment.
To do: add human schema version
There was a problem hiding this comment.
aesharpe
left a comment
There was a problem hiding this comment.
Hi @irubey, sorry for the long delay. I don't have a whole lot more to add here. Let me know if you'd like to jump back in and make these changes yourself otherwise I can go ahead an do the final touches! Thanks for all your work here 🙏 .
| "is_gatherer": { | ||
| "type": "boolean", | ||
| "description": "Whether the company operates as a natural gas gatherer.", | ||
| }, |
There was a problem hiding this comment.
Recommend specifying what a gatherer is if possible
| raw_eia176__operation_types_and_sector_items: Raw EIA-176 RP4 table; primary | ||
| source for all ``is_*`` columns and ``operating_state``. |
Overview
Closes #4697.
What problem does this address?
EIA Form 176 Part 3 (Lines A-B) contains company-level characteristics
(operation type, ownership type, and alternative fuel fleet) but this data
was never extracted into a PUDL output table. The
raw_eia176__operation_types_and_sector_itemsasset (added in #4710) was also never consumed downstream because its extract
page key was wrong, silently producing an empty table.
What did you change?
core_eia176__yearly_company_characteristics(55,589 rows, 1997-2024),with one row per
(report_year, operator_id_eia)covering 15 boolean operation/ownership flags,
operating_state,other_ownership_description, andhas_alternative_fuel_fleet.operation_types_and_sector_items.csvand adding asource_filenamemappingin the extractor to translate back to EIA's ZIP filename
eia176_{year}_type_of_operations_and_sector_items.csv.is_other_ownership_2(appeared only in 2016, never co-occurring withis_other_ownership) intois_other_ownershipvia OR in transform, thendropped the redundant field.
is_public_liquid_natural_gas_fueling_stationtois_public_lng_fueling_stationat the column-map level.operating_stateENUM constraint pending EIA clarification onfour anomalous codes (
FX,OO,BL,MX) found on adjustment placeholderrecords. Emailed eiainfonaturalgas@eia.gov; will restore or filter once confirmed.
src/pudl/metadata/fields.pyand defined the tablein
src/pudl/metadata/resources/eia176.py, including documentation of whichParts B-D fields are excluded (bulk-download-only) with reference to Investigate bulk vs report data from EIA #4729.
48fade8aeee8.Documentation
src/metadata). Done: fields and resource metadata added.Testing
core_eia176__yearly_company_characteristicsvia Dagster locally:55,589 rows, 1997-2024, 0 null
operating_statevalues."X"->True,NaN->False), and1.0float artifact replacement inother_ownership_description.To-do list
dbttests.pixi run prek-runto run linters and static code analysis checks.pixi run pytest-cilocally to ensure that the merge queue will accept your PR.