Skip to content

feat(connectors): failure modes as a declared capability#62

Merged
LGDiMaggio merged 8 commits into
mainfrom
feat/failure-modes-capability
Jun 13, 2026
Merged

feat(connectors): failure modes as a declared capability#62
LGDiMaggio merged 8 commits into
mainfrom
feat/failure-modes-capability

Conversation

@LGDiMaggio

Copy link
Copy Markdown
Owner

Summary

Failure-mode catalogs are now a first-class, declared connector capability. Before this PR, the runtime harvested diagnosis content by probing every connector for a private _failure_modes list — only the Generic CMMS demo connector exposed one, so diagnose_failure worked on the YAML demo and silently degraded to an empty catalog on SQL or Excel deployments. Now any substrate can feed the catalog by declaring Capability.READ_FAILURE_MODES and serving its public read_failure_modes(); the runtime discovers providers through the registry, exactly like every other capability.

The same plant data can now also declare which failure codes apply to which asset (failure_modes column/cell on the asset source), so diagnosis filters to the asset's declared modes instead of matching the whole catalog.

How it works

Substrate Catalog source Capability declared when
Generic CMMS failure_modes.json in the data dir the file exists (local mode only — REST never declares it until a REST fetch exists)
SQL any table/view mapped with entity: FailureMode the mapping is configured
Excel/CSV optional failure_modes sheet schema the sheet is configured

The harvest (_collect_failure_modes) fans out to all providers concurrently, dedups by code (first registration wins), and treats a provider that raises ConnectorError as contributing nothing — so a not-connected or failing source degrades diagnosis honestly instead of aborting the agent. Both consumers (the workflow FailureAnalyzer and the diagnose_failure tool) read this single source.

Multi-valued cells share one committed encoding across substrates: a semicolon-delimited string ("BEAR-WEAR-01;SEAL-LEAK-01"), implemented once in connectors/_entity_builders.py alongside the existing shared dict→entity builders.

Key decisions

  • Conditional declaration: a connector declares the capability only when an actual source is configured, so capability discovery is a true signal of "has a catalog" — not "running in local mode". The Generic CMMS gate is snapshotted at construction/connect rather than stat-ing the file per capability access (capability and loaded catalog can never disagree mid-session).
  • ConnectorError is the degradation contract: the harvest skips providers that raise it; anything else is a connector bug and surfaces loudly. The SQL connector was hardened to honor this — un-connected reads raise ConnectorError (was a raw AttributeError), non-transient driver errors are wrapped, and an invalid catalog row raises ConnectorSchemaError with row context instead of leaking a pydantic ValidationError that aborted agent.start().
  • Async cascade: _build_domain_services and _tool_diagnose_failure became async with the harvest. The old "works on un-started agents" claim is gone — the public read method requires a connection, and the harvest degrades to an honest empty catalog when a provider isn't connected.
  • No new MCP tool: READ_FAILURE_MODES is mapped to an explicit empty list in CAPABILITY_TO_TOOL per the existing gap-visibility convention (same as Calendar).
  • Behavior preserved: the demo loads the same 10 modes and produces identical diagnoses (e2e-verified); the empty-catalog note ("No failure-mode data configured on any connector.") is locked by regression tests; an integration suite proves the CSV substrate and the JSON demo substrate yield the same catalog and the same diagnosis, field-for-field.

Migration note (third-party connectors): a connector that exposed _failure_modes without declaring the capability silently stops contributing — declare READ_FAILURE_MODES and implement async read_failure_modes(). Recorded in the CHANGELOG.

Known Residuals

Accepted from the multi-agent review (none introduced by this PR; recorded for follow-up):

  • P2 — no timeout layer on connector reads (src/machina/agent/runtime.py:881): a hung DB connection blocks agent.start() and diagnose_failure until the driver gives up. Pre-existing property of every connector read path (asset loads included); fixing it only for the harvest would be inconsistent. Needs a cross-cutting timeout decision.
  • P3 — outage vs. absence in the diagnosis note (src/machina/agent/runtime.py:2878): when all providers error, diagnosis reports "No failure-mode data configured" — same note as a genuinely unconfigured system. Distinguishing "configured but currently unavailable" changes the R9 note contract; deliberately deferred.
  • P3 — SQL connection shared across threads without a lock (src/machina/connectors/sql/generic.py:_read_sync): concurrent diagnose calls fan asyncio.to_thread reads onto one pyodbc connection (threadsafety=1). Pre-existing for all SQL reads; the harvest raises the frequency.
  • P3 — MCP failure-mode surface gap: no machina_diagnose_failure MCP tool exists, and the static machina://v1/failure-taxonomy resource diverges from connector catalogs. Follows the documented Calendar convention; pair the tool with the capability when the MCP diagnosis surface is built.

Post-Deploy Monitoring & Validation

Library change — no deployment infrastructure. For operators upgrading:

  • Log signals: failure_mode_harvest_failed (WARNING, with connector= and error=) means a declared provider was skipped — expected for transient outages, a misconfiguration signal if persistent. loaded_failure_modes (DEBUG, count=) and domain_services_ready (failure_modes=) confirm the catalog loaded at start.
  • Healthy signal: diagnose_failure returns probable_failures with no note; the demo agent reports 10 failure modes at start.
  • Failure signal / rollback: diagnosis suddenly returning the empty-catalog note on a previously-working deployment → check the provider's source gate (file present? mapping configured?) before rolling back; the capability is intentionally undeclared when the source is absent.
  • Validation: PYTHONPATH=src python -m pytest tests/integration/test_failure_modes_substrates.py proves substrate-agnostic harvest on the shipped sample data.

Test plan

  • 2319 tests pass (pytest tests/), ruff and mypy (strict) clean.
  • New coverage: capability-gated harvest semantics (dedup first-wins, no-provider start, empty/raising/buggy providers, single-source workflow↔diagnose), conditional declaration per substrate (incl. REST-mode never-declared and file-absent cases), asset↔code linkage and the shared encoding (incl. NULL→'None' regression), SQL ConnectorError contract (un-connected, invalid row), Excel connect-gating, watcher refresh reload + all-or-nothing atomicity, unknown-coercer validation, and CSV↔JSON substrate equivalence (catalog + diagnosis, field-level).
  • Multi-agent review (10 reviewers): 10 findings fixed in 285d295, residuals listed above.

Compound Engineering
Claude Code

LGDiMaggio and others added 8 commits June 12, 2026 17:49
Add READ_FAILURE_MODES to the Capability enum and migrate the runtime
harvest from probing the private _failure_modes attribute to
capability-gated async collection through each connector's public
read_failure_modes() method. A provider that raises ConnectorError
contributes nothing instead of aborting the harvest, preserving the
empty-catalog honesty note. Dedup-by-code, first registration wins.

Generic CMMS declares the capability only when an actual catalog
source exists (local failure_modes.json or a configured REST
endpoint), so capability discovery is a true signal of "has a
catalog" rather than "running in local mode".

The capability is intentionally unmapped in CAPABILITY_TO_TOOL (gap
made visible per the Calendar convention).
Add an optional failure_modes sheet schema to ExcelConnectorConfig,
loaded and header-validated at connect() like the asset registry, with
read_failure_modes() returning the cache. The capabilities ClassVar
becomes a property so READ_FAILURE_MODES is declared only when the
sheet is configured (unconfigured means not-declared, matching the
Generic CMMS conditional-declaration pattern).

Asset rows may carry a column mapped to the failure_modes field: a
single semicolon-delimited string of failure-mode codes (split on ';',
whitespace-trimmed, empty entries dropped) resolved into
Asset.failure_modes. The same encoding serves the list-valued
failure-mode fields (detection_methods, typical_indicators,
recommended_actions).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e linkage

Extend TableMapping.entity with "FailureMode" so a table or view can
serve as a failure-mode catalog source. GenericSqlConnector gains
read_failure_modes() via the existing _find_mapping + _execute_read
pattern and declares Capability.READ_FAILURE_MODES only when a
FailureMode mapping is configured (unconfigured means not-declared).

Asset linkage: a failure_modes field mapping reads a single string
column of semicolon-delimited codes (e.g. "BEAR-WEAR-01;SEAL-LEAK-01"),
split on ";" with per-entry whitespace stripping and empty entries
dropped — the encoding shared with the Excel connector. FailureMode
list fields (detection_methods, typical_indicators,
recommended_actions) accept the same encoding.

Malformed FailureMode mapping columns surface ConnectorSchemaError at
connect through the existing _validate_schemas pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add examples/sample_data/failure_modes.csv mirroring the demo JSON
catalog (10 modes, semicolon-delimited list cells) and an integration
test proving the harvest is substrate-agnostic: the CSV-backed Excel
connector and the JSON-backed Generic CMMS connector yield the same
harvested catalog and identical diagnose_failure output.
…red entity builders

Move the semicolon list-cell encoding (split_list_cell) and the
dict_to_failure_mode mapper into connectors/_entity_builders.py — the
module that already hosts the shared dict→entity builders — and teach
dict_to_asset to resolve the failure_modes linkage key (previously it
silently dropped it). Deletes the per-connector copies in the SQL and
Excel substrates, the dead _ENTITY_BUILDERS registry, and one of the
two duplicated encoding test suites; shared-helper tests move to
tests/unit/connectors/test_entity_builders.py.

Also: the runtime harvest now fans out provider reads concurrently via
asyncio.gather (one slow provider no longer serialises startup or
diagnosis; argument order preserves first-registration-wins dedup),
the Excel connector precomputes its capabilities frozenset in __init__
(matching the SQL connector), and the three Excel sheet loaders share
one exists/read/validate/parse pipeline.
- SQL: honor the ConnectorError contract the runtime harvest keys on —
  un-connected reads raise ConnectorError (was raw AttributeError),
  non-transient driver errors are wrapped, and an invalid failure-mode
  row raises ConnectorSchemaError with row context instead of leaking
  a pydantic ValidationError that aborted agent start.
- Generic CMMS: never declare READ_FAILURE_MODES in REST mode (no REST
  fetch exists — declaring it harvested an empty catalog forever), and
  snapshot the local-source presence at construction/connect instead of
  stat-ing the file on every capabilities access (TOCTOU drift between
  the declared capability and the connect-time catalog).
- Entity builders: a NULL code/name column now hits the empty-code
  validator instead of minting a literal 'None' catalog entry.
- Excel: a configured failure-modes sheet read before connect raises
  ConnectorError (a configured catalog never silently reads as absent);
  watcher refresh() is all-or-nothing (cache snapshot restored on a
  mid-refresh failure); unknown coerce names fail loudly at connect
  instead of silently falling back to string coercion.
- Docs: SQL/Excel connector capability tables gain READ_FAILURE_MODES;
  CHANGELOG entry incl. the private-attribute migration note for
  third-party connector authors.
- Tests: non-ConnectorError propagation, un-connected providers (SQL,
  Excel, CMMS-backed agent), NULL-row translation, refresh reload and
  atomicity, coerce validation, field-level substrate parity.
@LGDiMaggio LGDiMaggio merged commit 4b134c9 into main Jun 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant