Skip to content

Add core_eia176__yearly_company_characteristics#5197

Draft
irubey wants to merge 18 commits into
catalyst-cooperative:mainfrom
irubey:4697-eia176-company-characteristics
Draft

Add core_eia176__yearly_company_characteristics#5197
irubey wants to merge 18 commits into
catalyst-cooperative:mainfrom
irubey:4697-eia176-company-characteristics

Conversation

@irubey

@irubey irubey commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Overview

Closes #4697.

What problem does this address?

EIA Form 176 Part 3 (Lines A-B) contains company-level characteristics
(operation type, ownership type, and alternative fuel fleet) but this data
was never extracted into a PUDL output table. The raw_eia176__operation_types_and_sector_items
asset (added in #4710) was also never consumed downstream because its extract
page key was wrong, silently producing an empty table.

What did you change?

  • Added core_eia176__yearly_company_characteristics (55,589 rows, 1997-2024),
    with one row per (report_year, operator_id_eia) covering 15 boolean operation/
    ownership flags, operating_state, other_ownership_description, and
    has_alternative_fuel_fleet.
  • Fixed the page key bug by renaming the column-map CSV to
    operation_types_and_sector_items.csv and adding a source_filename mapping
    in the extractor to translate back to EIA's ZIP filename
    eia176_{year}_type_of_operations_and_sector_items.csv.
  • Merged is_other_ownership_2 (appeared only in 2016, never co-occurring with
    is_other_ownership) into is_other_ownership via OR in transform, then
    dropped the redundant field.
  • Renamed is_public_liquid_natural_gas_fueling_station to
    is_public_lng_fueling_station at the column-map level.
  • Removed the operating_state ENUM constraint pending EIA clarification on
    four anomalous codes (FX, OO, BL, MX) found on adjustment placeholder
    records. Emailed eiainfonaturalgas@eia.gov; will restore or filter once confirmed.
  • Added 17 new fields to src/pudl/metadata/fields.py and defined the table
    in src/pudl/metadata/resources/eia176.py, including documentation of which
    Parts B-D fields are excluded (bulk-download-only) with reference to Investigate bulk vs report data from EIA #4729.
  • Added dbt schema with row-count and not-null tests.
  • Added Alembic migration 48fade8aeee8.

Documentation

  • Update the release notes: reference the PR and related issues.
  • Update relevant table or source description metadata (see src/metadata). Done: fields and resource metadata added.
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

  • Materialized core_eia176__yearly_company_characteristics via Dagster locally:
    55,589 rows, 1997-2024, 0 null operating_state values.
  • Verified PK uniqueness (0 duplicates), correct boolean conversion ("X" -> True,
    NaN -> False), and 1.0 float artifact replacement in other_ownership_description.
  • dbt row-count seeds added for all 28 years.
  • dbt: 20/20 tests passed.

To-do list

  • If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
  • Run pixi run prek-run to run linters and static code analysis checks.
  • Run pixi run pytest-ci locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

irubey added 4 commits April 21, 2026 10:55
Extracts operational and ownership characteristics from EIA-176 Part 3
(Lines B–F) into a new annual company-level table. Includes boolean
is_* columns for operation/ownership types, other_ownership_description,
and has_alternative_fuel_fleet derived from the company data table.
…76__yearly_company_characteristics

- Use map({1.0: True}) instead of eq(1.0) for has_alternative_fuel_fleet so years
  where the question wasn't asked (1997-2004, 2016-2024) remain NULL rather than False
- Set operating_state to sa.Enum(...) and nullable=False in migration; all is_* columns
  to nullable=False to match actual transform output
- Add required: True and enum constraints to FIELD_METADATA_BY_RESOURCE so metadata
  matches migration schema (fixes test_migrations_match_metadata)
- Add not_null dbt tests for all is_* columns
@irubey irubey force-pushed the 4697-eia176-company-characteristics branch from 401c5d5 to 255a29c Compare April 21, 2026 16:58
@irubey

irubey commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Hi @e-belfer,

Ready for a review!

@cmgosnell cmgosnell added community Issues that contributors have volunteered to take on or fostering more community eia176 Issues related to the EIA Form 176 natural gas supply and disposition dataset. labels Apr 22, 2026
@cmgosnell cmgosnell requested a review from e-belfer April 22, 2026 16:10
@cmgosnell cmgosnell moved this from New to In review in Catalyst Megaproject Apr 22, 2026

@e-belfer e-belfer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is off to a good start!

My main question is that at present this covers Part 3 Line A and a bit of B of the original survey, but the description and the original issue indicate this table should cover A-F.

This would mean including the following variables:

  • alternative_fleet_size (Part B)
  • customer_choice_residential_eligible (Part C)
  • customer_choice_residential_participating (Part C)
  • sales_acquisitions_1_yes_0_no (Part D)
  • natural_gas_pump_price (not on the form but provided through portal)

Part E might require its own table with a PK of operator ID, year and county. However, the rest of these fields can/should probably fit into this existing table. I was only able to find this data in the bulk file and not the report downloads, so this is probably out of scope for this issue. This is also true for the customer choice commercial eligible/participating fields. See #4729.

A few other minor notes:

  • See my comment below about the failing operator state ENUM constraint.
  • Given the total non-overlap of the other_ownership fields 1 and 2, I think we can safely combine them into one boolean field. This appears identically in the raw data, so it isn't caused by a mismapping on our end but the secondary field only appears in 2016 and isn't conveying any additional information.
  • Verified in the raw data that provision of the "other description" column happens often without checking the "is_other" box - I don't think we need any action here, the current status seems fine.

Comment thread src/pudl/extract/eia176.py Outdated
"natural_gas_other_disposition_items": None,
"natural_gas_supply_items": None,
"operation_types_and_sector_items": None,
"type_of_operations_and_sector_items": "operation_types_and_sector_items",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch about the misname. We actually created this problem by naming the column mapping CSV type_of_operations_and_sector_items - rather than performing a rename here, you could fix this problem at the source by renaming pudl/package_data/eia176/column_maps/type_of_operations_and_sector_items.csv to pudl/package_data/eia176/column_maps/operation_types_and_sector_items.csv

Comment thread src/pudl/metadata/fields.py Outdated
"type": "boolean",
"description": "Whether the company operates a public compressed natural gas (CNG) fueling station.",
},
"is_public_liquid_natural_gas_fueling_station": {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"is_public_liquid_natural_gas_fueling_station": {
"is_public_lng_fueling_station": {

Elsewhere in the fields we use lng and I think that's a fine abbreviation here and elsewhere, especially with your helpful field definition.

Comment thread src/pudl/metadata/fields.py Outdated
"core_eia176__yearly_company_characteristics": {
"operating_state": {
"description": "State that the operator is reporting for.",
"constraints": {"required": True, "enum": SUBDIVISION_CODES_ISO3166},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constraint is currently failing with the following error:

ValueError: Values in operating_state column are not included in categorical values in field enum constraint and will be converted to nulls (['FX', 'OO', 'BL', 'MX']).

These all appear to be derived from adjustment records. I'm relatively confident that MX is Mexico, but I wasn't able to track down precise confirmation of any of these - the best way to confirm this would be to email the EIA and ask (eiainfonaturalgas@eia.gov).

If we're able to map these to specific regions (e.g., federal adjustment, adjustment from Mexican imports) we should keep them and expand the enum. Otherwise, we should null them.

Comment thread src/pudl/metadata/fields.py
Comment thread src/pudl/metadata/resources/eia176.py Outdated
"additional_summary_text": (
"a company's operational and ownership characteristics."
),
"additional_source_text": "(Part 3, Lines B–F)",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually covers Part 3, Line A and a bit of Line B at the present. Were you planning to add in other columns to cover the rest of B-F?

Comment thread src/pudl/transform/eia176.py
@github-project-automation github-project-automation Bot moved this from In review to In progress in Catalyst Megaproject Apr 27, 2026
irubey added 4 commits April 27, 2026 13:56
Rename the column-map CSV from type_of_operations_and_sector_items to
operation_types_and_sector_items so the internal page key matches the
shorter name used throughout PUDL. Add a source_filename special case
in the extractor to map the page key back to EIA's original ZIP filename
(eia176_{year}_type_of_operations_and_sector_items.csv). Simplify the
asset dict entry to use None instead of an explicit out_page alias.
- Rename is_public_liquid_natural_gas_fueling_station to
  is_public_lng_fueling_station for consistency with other LNG fields
- Merge is_other_ownership_2 into is_other_ownership in transform; the
  two fields never co-occur and _2 only appears in 2016 (27 rows)
- Remove is_other_ownership_2 field definition and all required: True
  overrides from FIELD_METADATA_BY_RESOURCE (no-op for Parquet-only tables)
- Remove operating_state enum constraint pending EIA clarification of
  non-US codes FX/OO/BL/MX; add explanatory comment referencing catalyst-cooperative#4729
- Expand resource description to document bulk-only fields excluded from
  scope (Lines B-D) with reference to catalyst-cooperative#4729
- Log row count before/after dropna on operating_state
- Drop is_other_ownership_2 column (merged into is_other_ownership)
- Rename is_public_liquid_natural_gas_fueling_station to is_public_lng_fueling_station
- Change operating_state from Enum to Text (enum constraint removed pending EIA response)
- Set all non-PK columns to nullable=True to match current Python metadata
@irubey

irubey commented Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

@e-belfer
Thanks for the thorough review! Here's what was done for each point:
Missing fields (Parts B-F)

Investigated all five fields by checking every layer of the pipeline: all
four column-map CSVs, all PUDL source files, and the raw 2023 ZIP directly.
None of the five field names appear anywhere. This confirms your finding -
alternative_fleet_size, customer_choice_residential_eligible,
customer_choice_residential_participating, sales_acquisitions, and
natural_gas_pump_price exist only in the EIA-176 bulk download or portal,
not in the individual report CSVs that PUDL ingests. Updated the resource
description to document the exclusions explicitly and reference #4729.
Part E (natural gas delivered by county) would require a separate table with
a PK of (report_year, operator_id_eia, county) and is also out of scope
for this PR.

operating_state ENUM constraint

Removed the enum constraint for now and added a comment explaining the
situation. Emailed eiainfonaturalgas@eia.gov to ask about the four anomalous
codes (FX, OO, BL, MX) - all appear on placeholder adjustment
records (operator IDs like 17699999XX). Will restore the constraint (or
add a filter) once EIA responds.

Combine is_other_ownership and is_other_ownership_2

Done - OR-merged the two columns in transform then dropped
is_other_ownership_2. Removed the field from the resource, field
definitions, migration, and dbt schema.

Page key fix

Took your suggestion - renamed the column-map CSV to
operation_types_and_sector_items.csv and added a source_filename
special case in the extractor to map back to EIA's original ZIP filename.

lng abbreviation

Renamed is_public_liquid_natural_gas_fueling_station to
is_public_lng_fueling_station at the column-map level so it comes out
of extract already named correctly.

required: True constraints

Removed all {"constraints": {"required": True}} entries for the is_*
fields. Kept the operating_state override (description only, no enum
pending EIA response).

Drop count logging

Replaced the silent dropna with a logged count - captures row count before
and after, logs at INFO level.

@irubey

irubey commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

Scanned all raw EIA-176 operation_types_and_sector_items CSVs (1997–2024, 55,589 rows total) and found 0 null state values, so the dropna was defensive and never dropped anything. Replaced the log with assert null_operating_state == 0 so the pipeline will fail loudly if that ever changes.

I am still waiting on Emmanuel Eboweme, a Survey Statistician at EIA to confirm the operating state anomolies ETA is next Tuesday.

@e-belfer

e-belfer commented May 4, 2026

Copy link
Copy Markdown
Member

Investigated all five fields by checking every layer of the pipeline: all
four column-map CSVs, all PUDL source files, and the raw 2023 ZIP directly.
None of the five field names appear anywhere. This confirms your finding -
alternative_fleet_size, customer_choice_residential_eligible,
customer_choice_residential_participating, sales_acquisitions, and
natural_gas_pump_price exist only in the EIA-176 bulk download or portal,
not in the individual report CSVs that PUDL ingests.

Hi @irubey, I'm a little confused about what you mean by this data not existing - these fields all exist in _core_eia176__yearly_company_data and should be pulled into this table.

irubey added 3 commits May 5, 2026 23:43
Add alternative_fleet_size, customer_choice_residential_eligible,
customer_choice_residential_participating, has_sales_or_acquisitions, and
natural_gas_pump_price to core_eia176__yearly_company_characteristics across
transform, metadata, migration, dbt schema, and release notes. Drop national-level
adjustment records (FX/MX/BL/OO) now that EIA has confirmed their meaning; restore
the operating_state ENUM constraint. Update row counts to reflect the drops.
Fix natural_gas_pump_price unit from USD_per_mcf to USD_per_Mcf.
Update 48fade8aeee8 to depend on f98868d3f5cb (forensics migration
from main) instead of 4f252e9e2ce3, making the migration chain linear.
@irubey

irubey commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Hi @e-belfer,

Two updates since the last review:

Parts B-F fields

alternative_fleet_size, customer_choice_residential_eligible, customer_choice_residential_participating, has_sales_or_acquisitions, and natural_gas_pump_price are now included (found in items field). Coverage is sparse for some (e.g. natural_gas_pump_price is 2014-2016 only) but they are present in the pipeline with null values for years where the question wasn't asked.

EIA response on non-US state codes

Heard back from Emmanuel Eboweme (Survey Statistician, EIA) on May 4:

  • FX = Gulf of America
  • MX = Mexico
  • BL = Brazil
  • OO = countries lacking FIPS state codes (assigned 00 as a placeholder)

He confirmed these are national-level adjustment records and should be excluded from state-level analysis. They are now explicitly dropped in the transform with a logged count, and the operating_state ENUM constraint is restored.

@e-belfer e-belfer requested a review from aesharpe June 1, 2026 18:44

@e-belfer e-belfer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aesharpe will provide a full review, but I am leaving some interim comments in the meantime - no need to address these just yet!


# Drop national-level adjustment records — see NATIONAL_ADJUSTMENT_STATE_CODES
n_national = df["operating_state"].isin(NATIONAL_ADJUSTMENT_STATE_CODES).sum()
logger.info(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: add assertion here about expected drop count

"customer_choice_residential_participating",
]:
if col in df.columns:
df[col] = df[col].astype(pd.Int64Dtype())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: no need to cast dtype, this is already accounted for in writing a schema.

"Whether the utility plants to operate alternative-fueled vehicles this coming year."
),
},
"alternative_fleet_size": {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: drop the line numbers, not consistent with rest of fields

"type": "number",
"description": (
"Price of natural gas at public fueling stations operated by the company "
"(EIA Form 176 Part 3, Line E). Reported 2014-2016 only."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: move year warning out into table-level description

@@ -0,0 +1,75 @@
version: 2
sources:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do: add human schema version

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aesharpe aesharpe left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @irubey, sorry for the long delay. I don't have a whole lot more to add here. Let me know if you'd like to jump back in and make these changes yourself otherwise I can go ahead an do the final touches! Thanks for all your work here 🙏 .

Comment on lines +4377 to +4380
"is_gatherer": {
"type": "boolean",
"description": "Whether the company operates as a natural gas gatherer.",
},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend specifying what a gatherer is if possible

Comment on lines +569 to +570
raw_eia176__operation_types_and_sector_items: Raw EIA-176 RP4 table; primary
source for all ``is_*`` columns and ``operating_state``.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does RP4 standfor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Issues that contributors have volunteered to take on or fostering more community eia176 Issues related to the EIA Form 176 natural gas supply and disposition dataset.

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Create core_eia176__yearly_company_characteristics (Part 3)

5 participants