feat(connectors): failure modes as a declared capability#62
Merged
Conversation
Add READ_FAILURE_MODES to the Capability enum and migrate the runtime harvest from probing the private _failure_modes attribute to capability-gated async collection through each connector's public read_failure_modes() method. A provider that raises ConnectorError contributes nothing instead of aborting the harvest, preserving the empty-catalog honesty note. Dedup-by-code, first registration wins. Generic CMMS declares the capability only when an actual catalog source exists (local failure_modes.json or a configured REST endpoint), so capability discovery is a true signal of "has a catalog" rather than "running in local mode". The capability is intentionally unmapped in CAPABILITY_TO_TOOL (gap made visible per the Calendar convention).
Add an optional failure_modes sheet schema to ExcelConnectorConfig, loaded and header-validated at connect() like the asset registry, with read_failure_modes() returning the cache. The capabilities ClassVar becomes a property so READ_FAILURE_MODES is declared only when the sheet is configured (unconfigured means not-declared, matching the Generic CMMS conditional-declaration pattern). Asset rows may carry a column mapped to the failure_modes field: a single semicolon-delimited string of failure-mode codes (split on ';', whitespace-trimmed, empty entries dropped) resolved into Asset.failure_modes. The same encoding serves the list-valued failure-mode fields (detection_methods, typical_indicators, recommended_actions). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e linkage Extend TableMapping.entity with "FailureMode" so a table or view can serve as a failure-mode catalog source. GenericSqlConnector gains read_failure_modes() via the existing _find_mapping + _execute_read pattern and declares Capability.READ_FAILURE_MODES only when a FailureMode mapping is configured (unconfigured means not-declared). Asset linkage: a failure_modes field mapping reads a single string column of semicolon-delimited codes (e.g. "BEAR-WEAR-01;SEAL-LEAK-01"), split on ";" with per-entry whitespace stripping and empty entries dropped — the encoding shared with the Excel connector. FailureMode list fields (detection_methods, typical_indicators, recommended_actions) accept the same encoding. Malformed FailureMode mapping columns surface ConnectorSchemaError at connect through the existing _validate_schemas pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add examples/sample_data/failure_modes.csv mirroring the demo JSON catalog (10 modes, semicolon-delimited list cells) and an integration test proving the harvest is substrate-agnostic: the CSV-backed Excel connector and the JSON-backed Generic CMMS connector yield the same harvested catalog and identical diagnose_failure output.
…red entity builders Move the semicolon list-cell encoding (split_list_cell) and the dict_to_failure_mode mapper into connectors/_entity_builders.py — the module that already hosts the shared dict→entity builders — and teach dict_to_asset to resolve the failure_modes linkage key (previously it silently dropped it). Deletes the per-connector copies in the SQL and Excel substrates, the dead _ENTITY_BUILDERS registry, and one of the two duplicated encoding test suites; shared-helper tests move to tests/unit/connectors/test_entity_builders.py. Also: the runtime harvest now fans out provider reads concurrently via asyncio.gather (one slow provider no longer serialises startup or diagnosis; argument order preserves first-registration-wins dedup), the Excel connector precomputes its capabilities frozenset in __init__ (matching the SQL connector), and the three Excel sheet loaders share one exists/read/validate/parse pipeline.
- SQL: honor the ConnectorError contract the runtime harvest keys on — un-connected reads raise ConnectorError (was raw AttributeError), non-transient driver errors are wrapped, and an invalid failure-mode row raises ConnectorSchemaError with row context instead of leaking a pydantic ValidationError that aborted agent start. - Generic CMMS: never declare READ_FAILURE_MODES in REST mode (no REST fetch exists — declaring it harvested an empty catalog forever), and snapshot the local-source presence at construction/connect instead of stat-ing the file on every capabilities access (TOCTOU drift between the declared capability and the connect-time catalog). - Entity builders: a NULL code/name column now hits the empty-code validator instead of minting a literal 'None' catalog entry. - Excel: a configured failure-modes sheet read before connect raises ConnectorError (a configured catalog never silently reads as absent); watcher refresh() is all-or-nothing (cache snapshot restored on a mid-refresh failure); unknown coerce names fail loudly at connect instead of silently falling back to string coercion. - Docs: SQL/Excel connector capability tables gain READ_FAILURE_MODES; CHANGELOG entry incl. the private-attribute migration note for third-party connector authors. - Tests: non-ConnectorError propagation, un-connected providers (SQL, Excel, CMMS-backed agent), NULL-row translation, refresh reload and atomicity, coerce validation, field-level substrate parity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Failure-mode catalogs are now a first-class, declared connector capability. Before this PR, the runtime harvested diagnosis content by probing every connector for a private
_failure_modeslist — only the Generic CMMS demo connector exposed one, sodiagnose_failureworked on the YAML demo and silently degraded to an empty catalog on SQL or Excel deployments. Now any substrate can feed the catalog by declaringCapability.READ_FAILURE_MODESand serving its publicread_failure_modes(); the runtime discovers providers through the registry, exactly like every other capability.The same plant data can now also declare which failure codes apply to which asset (
failure_modescolumn/cell on the asset source), so diagnosis filters to the asset's declared modes instead of matching the whole catalog.How it works
failure_modes.jsonin the data direntity: FailureModefailure_modessheet schemaThe harvest (
_collect_failure_modes) fans out to all providers concurrently, dedups by code (first registration wins), and treats a provider that raisesConnectorErroras contributing nothing — so a not-connected or failing source degrades diagnosis honestly instead of aborting the agent. Both consumers (the workflowFailureAnalyzerand thediagnose_failuretool) read this single source.Multi-valued cells share one committed encoding across substrates: a semicolon-delimited string (
"BEAR-WEAR-01;SEAL-LEAK-01"), implemented once inconnectors/_entity_builders.pyalongside the existing shared dict→entity builders.Key decisions
ConnectorErroris the degradation contract: the harvest skips providers that raise it; anything else is a connector bug and surfaces loudly. The SQL connector was hardened to honor this — un-connected reads raiseConnectorError(was a rawAttributeError), non-transient driver errors are wrapped, and an invalid catalog row raisesConnectorSchemaErrorwith row context instead of leaking a pydanticValidationErrorthat abortedagent.start()._build_domain_servicesand_tool_diagnose_failurebecame async with the harvest. The old "works on un-started agents" claim is gone — the public read method requires a connection, and the harvest degrades to an honest empty catalog when a provider isn't connected.READ_FAILURE_MODESis mapped to an explicit empty list inCAPABILITY_TO_TOOLper the existing gap-visibility convention (same as Calendar).Migration note (third-party connectors): a connector that exposed
_failure_modeswithout declaring the capability silently stops contributing — declareREAD_FAILURE_MODESand implement asyncread_failure_modes(). Recorded in the CHANGELOG.Known Residuals
Accepted from the multi-agent review (none introduced by this PR; recorded for follow-up):
src/machina/agent/runtime.py:881): a hung DB connection blocksagent.start()anddiagnose_failureuntil the driver gives up. Pre-existing property of every connector read path (asset loads included); fixing it only for the harvest would be inconsistent. Needs a cross-cutting timeout decision.src/machina/agent/runtime.py:2878): when all providers error, diagnosis reports "No failure-mode data configured" — same note as a genuinely unconfigured system. Distinguishing "configured but currently unavailable" changes the R9 note contract; deliberately deferred.src/machina/connectors/sql/generic.py:_read_sync): concurrent diagnose calls fanasyncio.to_threadreads onto one pyodbc connection (threadsafety=1). Pre-existing for all SQL reads; the harvest raises the frequency.machina_diagnose_failureMCP tool exists, and the staticmachina://v1/failure-taxonomyresource diverges from connector catalogs. Follows the documented Calendar convention; pair the tool with the capability when the MCP diagnosis surface is built.Post-Deploy Monitoring & Validation
Library change — no deployment infrastructure. For operators upgrading:
failure_mode_harvest_failed(WARNING, withconnector=anderror=) means a declared provider was skipped — expected for transient outages, a misconfiguration signal if persistent.loaded_failure_modes(DEBUG,count=) anddomain_services_ready(failure_modes=) confirm the catalog loaded at start.diagnose_failurereturnsprobable_failureswith nonote; the demo agent reports 10 failure modes at start.PYTHONPATH=src python -m pytest tests/integration/test_failure_modes_substrates.pyproves substrate-agnostic harvest on the shipped sample data.Test plan
pytest tests/), ruff and mypy (strict) clean.'None'regression), SQLConnectorErrorcontract (un-connected, invalid row), Excel connect-gating, watcher refresh reload + all-or-nothing atomicity, unknown-coercer validation, and CSV↔JSON substrate equivalence (catalog + diagnosis, field-level).285d295, residuals listed above.