-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
Many provisioners and other HPC management tools implement a state machine for nodes. As nodes transition in and out of service and through firmware updates, the tooling provides a shared understanding of node state and which steps are required to bring it back into service. Different tools have different sets of states and different transitions between those states that are available. Sites have come to rely on their own interpretation of various indicators. Given OpenCHAMI's emphasis on composability, we should consider two questions. First, should OpenCHAMI support a state machine for nodes or expect sites to have their own. Second, if the community decides to support the state machine, should it be internal or external.
Proposal
OpenCHAMI needs a mechanism to manage the lifecycle of HPC nodes as they transition through various states such as in_service, maintenance, rma, firmware_update, etc.
Two architectural options are under consideration:
- Embed the state machine inside the Inventory Service (SMD)
- Use an external state machine through a workflow engine
Each approach aims to support:
- Node state transitions
- Integration with site-specific tooling (e.g., firmware checks, diagnostics)
- Notifications or side effects (e.g., webhook calls, logging, tagging)
The goal of this RFD is to evaluate both strategies and support a clear, maintainable design choice.
Option 1: Embedded State Machine in Inventory Server
The state transition logic lives inside the inventory service (SMD), and node objects include a state field. Transitions are enforced by the inventory API. Notifications to other systems are sent via webhooks.
Example: State Machine Logic (Go)
type NodeState string
const (
StateInService NodeState = "in_service"
StateMaintenance NodeState = "maintenance"
StateFirmwareUpdate NodeState = "firmware_update"
)
var validTransitions = map[NodeState][]NodeState{
StateInService: {StateMaintenance},
StateMaintenance: {StateFirmwareUpdate, StateInService},
StateFirmwareUpdate: {StateInService},
}
func isValidTransition(from, to NodeState) bool {
for _, valid := range validTransitions[from] {
if valid == to {
return true
}
}
return false
}
### Applying a State Transition
```go
func UpdateNodeState(nodeID string, newState NodeState) error {
node, err := db.GetNode(nodeID)
if err != nil {
return err
}
if !isValidTransition(node.State, newState) {
return fmt.Errorf("invalid state transition from %s to %s", node.State, newState)
}
node.State = newState
err = db.SaveNode(node)
if err != nil {
return err
}
// Send webhook for notification
go notifyWebhooks(node)
return nil
}Benefits
- Simpler deployment: fewer moving parts
- Easier to understand and trace single-system logic
- REST API directly reflects authoritative state
Drawbacks
- Tightly couples state logic to inventory system
- Difficult to customize transitions per site
- Harder to add async or long-lived logic (e.g., firmware retry, RMA waiting)
- Operational complexity increases with embedded hooks
Option 2: External Workflow Engine for State Management
In this model, the inventory system is decoupled from the state transition logic. Instead, an external workflow engine (e.g., Temporal) manages the node lifecycle as a series of state transitions. The inventory system records the current state and also maintains a trail of all past transitions, enabling full reconstruction of a node’s lifecycle history.
Workflow Engine Responsibilities
- Own and enforce the valid transitions between states
- Perform asynchronous tasks (e.g., firmware checks, diagnostics, RMA wait)
- Call back to the inventory service to:
- Update the node’s current state
- Append a structured entry to the state transition log
Inventory System Responsibilities
- Expose a REST API for getting and updating node metadata
- Store:
current_state(e.g.,in_service)state_history: a chronological list of transitions with metadata (timestamps, initiator, reason)
Example transition log entry:
{
"from": "maintenance",
"to": "firmware_update",
"timestamp": "2025-06-18T16:44:00Z",
"initiator": "temporal:ReturnToServiceWorkflow",
"note": "Automated firmware validation triggered"
}Benefits
- Supports site-specific lifecycle requirements without bloating inventory logic
- Enables asynchronous and long-running transitions (e.g., RMA wait)
- Provides an auditable history of state changes
- Easier to evolve workflows independently of the inventory API
- External workflow systems have a broader community of support than an internal state machine would. Likely to be more customizable as well.
Drawbacks
- Adds operational complexity (deployment and monitoring of workflow engine)
- Requires consistency between workflow engine and inventory state
- Developers must manage state synchronization explicitly (e.g., retries on failed updates)
References:
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status