fix(onboard): refresh provider state on agent changes#3857
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
📝 WalkthroughWalkthroughDetect agent changes during resume to clear agent-scoped session state and stop tracked model-router processes; treat built-in "brave" as stale when webSearchConfig is absent, filtering it from suggestions and carry-forward logic. Tests added/updated across onboarding and policy tiers. ChangesResume Agent Change & Brave Preset Handling
Sequence Diagram(s)sequenceDiagram
participant ResumeHandler as Onboard Resume
participant AgentCompare as Agent Comparison
participant RouterMgr as stopTrackedModelRouterForAgentChange
participant StateClear as clearAgentScopedResumeState
participant ConflictCheck as getResumeConfigConflicts
participant SandboxLogic as Sandbox Reuse Logic
ResumeHandler->>AgentCompare: compare normalized recordedAgent vs selectedAgent
AgentCompare-->>ResumeHandler: resumeAgentChanged flag
ResumeHandler->>RouterMgr: stop tracked model-router if pid present
ResumeHandler->>StateClear: clear agent-scoped resume/session state
StateClear-->>ResumeHandler: updated Session
ResumeHandler->>ConflictCheck: recompute resume config conflicts
ConflictCheck-->>ResumeHandler: return conflicts (agent mismatch not a conflict)
ResumeHandler->>SandboxLogic: evaluate reuse (requires !resumeAgentChanged)
SandboxLogic-->>ResumeHandler: decide recreate or reuse
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4478-4523: The new resume/preset flow (functions
resetStepForAgentChange and clearAgentScopedResumeState) should be moved out of
the large core onboarding file into a dedicated onboard helpers module: create a
new module exposing resetStepForAgentChange(session, stepName) and
clearAgentScopedResumeState(session, selectedAgentName) and import them where
used; preserve all logic (normalization via normalizeSandboxAgentName, session
field resets, resetSteps array and lastCompletedStep/lastStepStarted
adjustments) and update callers to use the new helpers so the core onboarding
file no longer grows with this agent-change/session-reset logic.
- Around line 9420-9428: When switching agents in the resume branch, stop the
tracked model-router process before clearing routed session state so we don't
leave a live router with discarded credentials; detect the router PID/credential
fields on the current session and call the routine that stops the tracked router
(e.g. stopTrackedModelRouter or the existing stopModelRouter helper) prior to
calling onboardSession.updateSession and clearAgentScopedResumeState; ensure
routerPid and routerCredentialHash are cleared only after the router has been
stopped so reconcileModelRouter() won’t encounter a healthy endpoint with
unknown credentials.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d2d83640-1f40-4b82-b81b-0970e85c110d
📒 Files selected for processing (7)
src/lib/agent/defs.test.tssrc/lib/onboard.tssrc/lib/onboard/policy-selection.tstest/onboard-policy-suggestions.test.tstest/onboard-preset-diff.test.tstest/onboard.test.tstest/policy-tiers-onboard.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26142155076
|
jyaunches
left a comment
There was a problem hiding this comment.
Thanks for the fix — the agent-change resume behavior and Brave policy handling both look directionally correct, and the advisor-required E2Es passed against cb653bbc9bda834cfc5dada51ac709db0dde3afd (onboard-resume-e2e, network-policy-e2e, brave-search-e2e, hermes-e2e).
One blocker before merge:
🔴 Monolith growth / failing budget check: this PR grows src/lib/onboard.ts from 10332 → 10402 lines (+70), and the required onboard-entrypoint-budget check is failing. Please move the new agent-resume state-clearing helpers (clearAgentScopedResumeState / resetStepForAgentChange) into a focused src/lib/onboard/* module and import them from onboard.ts, or otherwise offset the growth by extracting existing onboard logic in the same PR. This is especially important because there is an active onboard extraction stack in flight (#3861, #3868, #3870, #3871, #3872, #3873, #3874, #3876).
Suggestions while touching this:
- Add a short comment near the resume-agent-change branch explaining why agent mismatch is no longer a hard resume conflict and instead clears agent-scoped state.
- Consider hoisting the stale built-in Brave predicate to a small helper if we expect more tests/paths to need the custom-vs-built-in distinction.
No security or E2E blockers found beyond the monolith budget failure.
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Review used the provided deterministic GitHub context and supplied diff; no repository commands, package-manager commands, PR scripts, or tests were executed.; The supplied diff is marked truncated if large, so line-specific verification is limited to visible hunks and deterministic metadata.; CI and E2E results for current head f1e26db were incomplete in the trusted context.; No linked issues were present, so acceptance coverage maps PR body clauses rather than issue clauses. Full advisor summaryPR Review AdvisorBase: Directionally aligned with the onboarding resume/policy fix, but blocked by pending CI, BLOCKED merge state/CHANGES_REQUESTED, and missing required E2E evidence for the current head SHA f1e26db. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26172943149
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/lib/onboard/model-router-process.ts (2)
49-51: ⚡ Quick winPrefix unused loop variable with underscore.
The
attemptvariable is only used to control iteration count and is never read. As per coding guidelines, unused variables should be prefixed with underscore following Biome conventions.♻️ Proposed fix
- for (let attempt = 0; attempt < 10; attempt++) { + for (let _attempt = 0; _attempt < 10; _attempt++) { await new Promise((resolve) => setTimeout(resolve, 500)); if (!isProcessRunning(pid) && !(await isRouterHealthy(port, 1000))) return; }Same for line 58:
- for (let attempt = 0; attempt < 5; attempt++) { + for (let _attempt = 0; _attempt < 5; _attempt++) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/model-router-process.ts` around lines 49 - 51, The loop uses an unused loop variable `attempt`; rename it to `_attempt` (e.g., for (let _attempt = 0; _attempt < 10; _attempt++)) to follow the Biome convention for unused variables, and make the same change for the second loop noted (the one similar at line 58); no other logic changes are required—just update the loop variable name wherever `attempt` is declared in these retry loops that call isProcessRunning(pid) and isRouterHealthy(port, 1000).
12-12: ⚡ Quick winPrefer top-level ESM import over inline
require.Using
require("http")inside the function is inconsistent with the rest of the codebase style and the type import on line 21. Move to a top-level import for consistency.♻️ Proposed fix
// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. // SPDX-License-Identifier: Apache-2.0 +import http from "node:http"; import type { Session } from "../state/onboard-session";Then update line 12:
- const http = require("http");And line 20-21:
const request = http - .get(`http://127.0.0.1:${port}/health`, (res: import("node:http").IncomingMessage) => { + .get(`http://127.0.0.1:${port}/health`, (res: http.IncomingMessage) => {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/model-router-process.ts` at line 12, Replace the inline CommonJS usage of const http = require("http") with a top-level ESM import to match the codebase style: remove the require and add a top-level import (e.g. import * as http from "http") and ensure any existing type references (like IncomingMessage/ServerResponse) continue to import from "http" as needed; update uses of the local const http variable accordingly in model-router-process.ts so there is no inline require left.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard/model-router-process.ts`:
- Around line 49-51: The loop uses an unused loop variable `attempt`; rename it
to `_attempt` (e.g., for (let _attempt = 0; _attempt < 10; _attempt++)) to
follow the Biome convention for unused variables, and make the same change for
the second loop noted (the one similar at line 58); no other logic changes are
required—just update the loop variable name wherever `attempt` is declared in
these retry loops that call isProcessRunning(pid) and isRouterHealthy(port,
1000).
- Line 12: Replace the inline CommonJS usage of const http = require("http")
with a top-level ESM import to match the codebase style: remove the require and
add a top-level import (e.g. import * as http from "http") and ensure any
existing type references (like IncomingMessage/ServerResponse) continue to
import from "http" as needed; update uses of the local const http variable
accordingly in model-router-process.ts so there is no inline require left.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 151561f4-e716-4727-afe2-e3946a3acde8
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/agent-resume-state.tssrc/lib/onboard/model-router-process.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26173052957
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26174948896
|
…pe-brave-policy Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26192033420
|
…pe-brave-policy Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
9316-9329: Please run an onboarding E2E for the agent-change resume path.This branch now tears down tracked router state, clears agent-scoped session data, and changes resume-time sandbox reuse/recreation semantics. A targeted onboard lifecycle E2E such as
sandbox-operations-e2eorchannels-stop-start-e2ewould give much better coverage than unit-only validation here.As per coding guidelines,
src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 9316 - 9329, Add an E2E test exercising the agent-change resume path to validate the new teardown and session-clear behavior: create or update a test (e.g., extend sandbox-operations-e2e or channels-stop-start-e2e) that starts a sandbox with a recorded agent, triggers a resume with a different selected agent, and asserts the tracked router teardown via stopTrackedModelRouterForAgentChange, that onboardSession.updateSession calls clearAgentScopedResumeState for the new selectedAgentName, and that sandbox provider selection/recreation and sandbox reuse semantics (formatSandboxAgentName changes) behave as expected (no residual scoped state, provider refreshed, and resume succeeds). Ensure the E2E covers the router port fallback (loadBlueprintProfile("routed")?.router.port || 4000) and verifies the resumed sandbox lifecycle completes end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9316-9329: Add an E2E test exercising the agent-change resume path
to validate the new teardown and session-clear behavior: create or update a test
(e.g., extend sandbox-operations-e2e or channels-stop-start-e2e) that starts a
sandbox with a recorded agent, triggers a resume with a different selected
agent, and asserts the tracked router teardown via
stopTrackedModelRouterForAgentChange, that onboardSession.updateSession calls
clearAgentScopedResumeState for the new selectedAgentName, and that sandbox
provider selection/recreation and sandbox reuse semantics
(formatSandboxAgentName changes) behave as expected (no residual scoped state,
provider refreshed, and resume succeeds). Ensure the E2E covers the router port
fallback (loadBlueprintProfile("routed")?.router.port || 4000) and verifies the
resumed sandbox lifecycle completes end-to-end.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 17235573-6de8-4035-ba27-975667b7c748
📒 Files selected for processing (1)
src/lib/onboard.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26196733519
|
jyaunches
left a comment
There was a problem hiding this comment.
Thanks for addressing the review feedback. The monolith growth blocker is resolved ( is now net-negative vs current and is green), the agent-scoped resume helpers were extracted, and the tracked model-router cleanup now runs before clearing routed session state.
I also kicked off the remaining selective regression E2E for against the current PR head :
https://github.com/NVIDIA/NemoClaw/actions/runs/26197650629
Approving from code review; please make sure that E2E finishes green before merge.
|
Follow-up on my approval comment above (the shell ate the backticked text): the monolith growth blocker is resolved: |
Summary
Validation
Notes
test-clipre-commit/pre-push coverage was attempted twice and failed in unrelated timeout/CLI-dispatch cases outside this change surface; pushed withSKIP=test-cliafter the focused suite and non-test hooks passed.Summary by CodeRabbit
Bug Fixes
New Features
Tests