feat: route machine lifecycle power through component backends#2959
feat: route machine lifecycle power through component backends#2959osu wants to merge 2 commits into
Conversation
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
Important Review skippedNo new commits to review since the last review. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughImplements per-machine compute-tray power routing through Core Redfish (direct), RMS-configured (direct), or rack state-controller (queued) paths. Introduces ChangesCompute Power Routing via Component Manager Backends
Sequence Diagram(s)sequenceDiagram
participant Client
participant ComponentManagerHandler
participant classify as classify_compute_power_targets
participant dispatch as dispatch_compute_power_direct
participant CoreComputeTrayManager
participant RMSComputeTrayManager
participant RackHandler as queue_compute_power_control_via_rack_state_controller
participant DB
Client->>ComponentManagerHandler: component_power_control(MachineIds, action)
ComponentManagerHandler->>classify: deduplicate + classify machines
classify-->>ComponentManagerHandler: CoreDirect / ConfiguredDirect / RackStateController buckets
ComponentManagerHandler->>dispatch: dispatch CoreDirect bucket
dispatch->>DB: update power_options (optimistic version check)
dispatch->>CoreComputeTrayManager: power_control(endpoints, action)
CoreComputeTrayManager-->>dispatch: results
dispatch-->>ComponentManagerHandler: per-machine ComponentResults
ComponentManagerHandler->>dispatch: dispatch ConfiguredDirect bucket
dispatch->>RMSComputeTrayManager: power_control(endpoints, action)
RMSComputeTrayManager-->>dispatch: results
dispatch-->>ComponentManagerHandler: per-machine ComponentResults
ComponentManagerHandler->>RackHandler: queue RackStateController bucket
RackHandler->>DB: SELECT rack FOR UPDATE
RackHandler->>DB: set maintenance_requested (MaintenanceScope + PowerControl)
RackHandler-->>ComponentManagerHandler: queued success
ComponentManagerHandler-->>Client: per-machine results
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-28 20:15:02 UTC | Commit: 3e4bbc4 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/api-core/src/tests/common/api_fixtures/mod.rs (1)
1366-1382: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winPass the Redfish simulator into
build_component_managerfor Core compute overrides.
TestEnvOverrides::component_manager_confignow exposes the backend choice, but this fixture always callsbuild_component_manager(..., None)for the Redfish pool. Any test that setscompute_tray_backend: Corewill fail during env construction even thoughredfish_simis already available in this scope.Proposed fix
let test_component_manager = component_manager::component_manager::build_component_manager( &component_manager_config, component_manager_rack_profiles, rms_sim.as_rms_client(), None, Some(db_pool.clone()), - None, + Some(redfish_sim.clone()), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 1366 - 1382, The `build_component_manager` call in the test fixture is still hardcoding the Redfish pool argument to `None`, which breaks env construction when `TestEnvOverrides::component_manager_config` selects `compute_tray_backend: Core`. Update the fixture so the existing `redfish_sim` in scope is passed through to `component_manager::component_manager::build_component_manager` when building `test_component_manager`, while preserving the current `rms_sim.as_rms_client()` and `db_pool` wiring. Use the `build_component_manager` and `TestEnvOverrides::component_manager_config` symbols to locate the change.crates/api-core/src/handlers/component_manager.rs (1)
1374-1571: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winReject duplicate BMC IPs before direct dispatch.
ip_to_machine_idkeeps only one machine per BMC IP, whiledispatch_endpointscan still contain multiple endpoints with that same IP. In that case, direct dispatch can issue duplicate power operations and then attribute backend results to the wrong machine. Mirror the rack-controller guard by detecting duplicate BMC IPs during endpoint resolution and returning per-machine errors instead of dispatching them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/component_manager.rs` around lines 1374 - 1571, Duplicate BMC IPs are still allowed into the direct-dispatch path, which can make `dispatch_compute_power_direct` send repeated power actions and misattribute results because `ip_to_machine_id` only retains one machine per IP. Add a duplicate-IP check while building endpoints in `resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated BMC IPs are rejected with per-machine `error_result` entries instead of being added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and `ip_to_machine_id` flow as the place to mirror the rack-controller guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-core/src/handlers/rack.rs`:
- Around line 831-845: The maintenance scheduling flow in rack.rs is holding the
rack transaction open while awaiting credential_manager.set_credentials, which
can block DB locks on an external dependency. Move credential storage out of the
transaction scope in the maintenance scheduling path (around the rack token
handling in the maintenance_access_token block), or switch to a
request-scoped/two-phase key so the DB transaction only covers rack state
changes. Keep the transaction limited to the DB update logic and perform the
credential backend write after the transaction has been committed.
In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 337-343: The controller-path test fixture is using only the base
credential manager, so it misses credentials available through the composite
snapshot-based reader. Update the relevant constructors in the test fixture to
inject the composite credential reader built from
test_static_credential_snapshot() plus credential_manager, matching the API path
behavior; use the existing CompositeCredentialManager and credential_reader
symbols to locate the affected setup.
In `@crates/machine-controller/src/handler.rs`:
- Around line 8683-8699: The InitialReset flow in the host state handler is
mixing backends by calling services.power_control before immediately polling
with create_redfish_client_from_machine/get_power_state, which can route rack
machines through RMS and then fail on direct BMC reads. Update the InitialReset
path in handler::StateHandlerOutcome/state transition logic so it stays on
Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path
before any RMS-driven power action; make the same fix in the other matching
InitialReset block referenced by the review.
- Around line 10302-10319: In dispatch_core_host_power_control, the current
action can be downgraded from ACPowercycle to ForceOff, but the later DPU
bookkeeping still sees the original value. Return the effective power action
from dispatch_core_host_power_control, capture that result at the call site, and
use it before the DPU reboot timestamp loop so the recorded action matches the
one actually executed; keep the existing ACPowercycle fallback and the
dispatch_component_manager_power_control path in sync.
---
Outside diff comments:
In `@crates/api-core/src/handlers/component_manager.rs`:
- Around line 1374-1571: Duplicate BMC IPs are still allowed into the
direct-dispatch path, which can make `dispatch_compute_power_direct` send
repeated power actions and misattribute results because `ip_to_machine_id` only
retains one machine per IP. Add a duplicate-IP check while building endpoints in
`resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated
BMC IPs are rejected with per-machine `error_result` entries instead of being
added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the
existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and
`ip_to_machine_id` flow as the place to mirror the rack-controller guard.
In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 1366-1382: The `build_component_manager` call in the test fixture
is still hardcoding the Redfish pool argument to `None`, which breaks env
construction when `TestEnvOverrides::component_manager_config` selects
`compute_tray_backend: Core`. Update the fixture so the existing `redfish_sim`
in scope is passed through to
`component_manager::component_manager::build_component_manager` when building
`test_component_manager`, while preserving the current `rms_sim.as_rms_client()`
and `db_pool` wiring. Use the `build_component_manager` and
`TestEnvOverrides::component_manager_config` symbols to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3d98a856-b97a-4c69-aa11-1a5dde0f6a21
⛔ Files ignored due to path filters (4)
Cargo.lockis excluded by!**/*.lockrest-api/flow/internal/nicoapi/gen/nico.pb.gois excluded by!**/*.pb.go,!**/gen/**,!rest-api/**/*.pb.gorest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.gois excluded by!**/*.pb.go,!rest-api/**/*.pb.gorest-api/workflow-schema/site-agent/workflows/v1/nico_nico.protois excluded by!rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (30)
crates/api-core/src/handlers/component_manager.rscrates/api-core/src/handlers/rack.rscrates/api-core/src/setup.rscrates/api-core/src/tests/common/api_fixtures/mod.rscrates/api-core/src/tests/component_manager_compute_power.rscrates/api-core/src/tests/machine_power.rscrates/api-core/src/tests/machine_states.rscrates/api-core/src/tests/mod.rscrates/api-core/src/tests/rack_state_controller/handler.rscrates/api-db/src/power_options.rscrates/api-model/src/machine/mod.rscrates/api-model/src/rack.rscrates/component-manager/src/component_manager.rscrates/component-manager/src/compute_tray_manager.rscrates/component-manager/src/config.rscrates/component-manager/src/core_compute_manager.rscrates/component-manager/src/rms.rscrates/machine-controller/Cargo.tomlcrates/machine-controller/src/context.rscrates/machine-controller/src/handler.rscrates/machine-controller/src/redfish.rscrates/rack-controller/Cargo.tomlcrates/rack-controller/src/context.rscrates/rack-controller/src/io.rscrates/rack-controller/src/maintenance.rscrates/rack-controller/src/ready.rscrates/rpc/proto/forge.protorest-api/flow/docs/component-manager-config.mdrest-api/flow/internal/nicoapi/nicoproto/nico.protorest-api/flow/internal/task/componentmanager/compute/nico/nico.go
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
/ok to test 89f4e30 |
|
Is this ready for review? |
|
@zhaozhongn no |
Route compute-machine lifecycle power operations through the Component Manager abstraction.
Related issues
Closes #2053
Addresses #2421
Addresses #2679
Type of Change
Breaking Changes
Testing
Validated with:
cargo test -p component-manager --lib -- --test-threads=1(81 passed)cargo test --locked -p carbide-machine-controller -p carbide-rack-controller -p carbide-rack -p carbide-secrets --no-default-features --lib -- --test-threads=1(39, 35, 14, and 42 passed)carbide-api-coredatabase-backed integration tests for standalone Core routing, rack queue/bypass dispatch, duplicate BMC rejection, Core-only initial reset, AC-power-cycle fallback bookkeeping, concurrent maintenance-token ownership, firmware-token cleanup, and force-delete cleanupcarbide-api-modelrack scope serialization and state-SLA testscargo check --lockedfor the affected Component Manager, controller, rack, secrets, API-core, and API-model cratescargo clippy --lockedwith warnings denied for the affected targetscargo +nightly fmt --all -- --checkcargo test --locked --no-runfor all four originally affected cratesAdditional Notes
The durable rack dispatch marker intentionally fails a recovered in-flight request closed instead of replaying non-idempotent restart or AC-cycle actions.
The follow-up removes comment-only protobuf, generated Go, and Flow documentation edits so the PR does not pull unrelated generated-file drift into its CI scope.