Skip to content

[RFD] Node Lifecycle State Machine #91

@alexlovelltroy

Description

@alexlovelltroy

Context

Many provisioners and other HPC management tools implement a state machine for nodes. As nodes transition in and out of service and through firmware updates, the tooling provides a shared understanding of node state and which steps are required to bring it back into service. Different tools have different sets of states and different transitions between those states that are available. Sites have come to rely on their own interpretation of various indicators. Given OpenCHAMI's emphasis on composability, we should consider two questions. First, should OpenCHAMI support a state machine for nodes or expect sites to have their own. Second, if the community decides to support the state machine, should it be internal or external.

Proposal

OpenCHAMI needs a mechanism to manage the lifecycle of HPC nodes as they transition through various states such as in_service, maintenance, rma, firmware_update, etc.

Two architectural options are under consideration:

  1. Embed the state machine inside the Inventory Service (SMD)
  2. Use an external state machine through a workflow engine

Each approach aims to support:

  • Node state transitions
  • Integration with site-specific tooling (e.g., firmware checks, diagnostics)
  • Notifications or side effects (e.g., webhook calls, logging, tagging)

The goal of this RFD is to evaluate both strategies and support a clear, maintainable design choice.


Option 1: Embedded State Machine in Inventory Server

The state transition logic lives inside the inventory service (SMD), and node objects include a state field. Transitions are enforced by the inventory API. Notifications to other systems are sent via webhooks.

Example: State Machine Logic (Go)

type NodeState string

const (
    StateInService      NodeState = "in_service"
    StateMaintenance    NodeState = "maintenance"
    StateFirmwareUpdate NodeState = "firmware_update"
)

var validTransitions = map[NodeState][]NodeState{
    StateInService:      {StateMaintenance},
    StateMaintenance:    {StateFirmwareUpdate, StateInService},
    StateFirmwareUpdate: {StateInService},
}

func isValidTransition(from, to NodeState) bool {
    for _, valid := range validTransitions[from] {
        if valid == to {
            return true
        }
    }
    return false
}

### Applying a State Transition

```go
func UpdateNodeState(nodeID string, newState NodeState) error {
    node, err := db.GetNode(nodeID)
    if err != nil {
        return err
    }

    if !isValidTransition(node.State, newState) {
        return fmt.Errorf("invalid state transition from %s to %s", node.State, newState)
    }

    node.State = newState
    err = db.SaveNode(node)
    if err != nil {
        return err
    }

    // Send webhook for notification
    go notifyWebhooks(node)

    return nil
}

Benefits

  • Simpler deployment: fewer moving parts
  • Easier to understand and trace single-system logic
  • REST API directly reflects authoritative state

Drawbacks

  • Tightly couples state logic to inventory system
  • Difficult to customize transitions per site
  • Harder to add async or long-lived logic (e.g., firmware retry, RMA waiting)
  • Operational complexity increases with embedded hooks

Option 2: External Workflow Engine for State Management

In this model, the inventory system is decoupled from the state transition logic. Instead, an external workflow engine (e.g., Temporal) manages the node lifecycle as a series of state transitions. The inventory system records the current state and also maintains a trail of all past transitions, enabling full reconstruction of a node’s lifecycle history.

Workflow Engine Responsibilities

  • Own and enforce the valid transitions between states
  • Perform asynchronous tasks (e.g., firmware checks, diagnostics, RMA wait)
  • Call back to the inventory service to:
    • Update the node’s current state
    • Append a structured entry to the state transition log

Inventory System Responsibilities

  • Expose a REST API for getting and updating node metadata
  • Store:
    • current_state (e.g., in_service)
    • state_history: a chronological list of transitions with metadata (timestamps, initiator, reason)

Example transition log entry:

{
  "from": "maintenance",
  "to": "firmware_update",
  "timestamp": "2025-06-18T16:44:00Z",
  "initiator": "temporal:ReturnToServiceWorkflow",
  "note": "Automated firmware validation triggered"
}

Benefits

  • Supports site-specific lifecycle requirements without bloating inventory logic
  • Enables asynchronous and long-running transitions (e.g., RMA wait)
  • Provides an auditable history of state changes
  • Easier to evolve workflows independently of the inventory API
  • External workflow systems have a broader community of support than an internal state machine would. Likely to be more customizable as well.

Drawbacks

  • Adds operational complexity (deployment and monitoring of workflow engine)
  • Requires consistency between workflow engine and inventory state
  • Developers must manage state synchronization explicitly (e.g., retries on failed updates)

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfdRequest for Discussion

    Type

    No type

    Projects

    Status

    Inbox

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions