fix(macos): restore VM inference and Hermes Discord paths#3445
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds VM rootfs DNS monkeypatching and driver-aware inference repair, non-interactive messaging-channel reuse, create-stream readiness output gating, and Discord WebSocket policy/config updates with tests. ChangesDiscord WebSocket messaging support
VM DNS monkeypatch and sandbox integration
Onboarding flow enhancements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
Keep the macOS VM-driver bridge-probe fix under the onboard entrypoint line budget by passing the driver option inline at the existing call sites. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Line 4449: The new declaration of bridgeProbeOptions is causing a net +1 line;
remove that extra line by inlining or merging it where used: locate the const
bridgeProbeOptions = { drivers: gatewayEnv.OPENSHELL_DRIVERS } and either inline
the object literal at its call site or merge its property into the nearest
existing options object so the standalone declaration is removed and the file
length is reduced by one line.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 485df84d-a4d8-4a69-97a4-b64d9f2f3e6d
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/gateway-sandbox-reachability.test.tssrc/lib/onboard/gateway-sandbox-reachability.ts
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/sandbox/create-stream.ts (1)
318-324:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFlush pending tail output before the close-time ready recovery check.
Line 323 checks
readyCheckOutputMatchedbefore the buffered trailing line is parsed. A final unterminated startup line can be missed, causing an incorrect non-zero result.Suggested fix
child.on("close", (code) => { // One last ready-check: the sandbox may have become Ready between the // last poll tick and the stream exit (e.g. SSH 255 after "Created sandbox:"). + if (pending) { + const trailing = pending; + pending = ""; + flushLine(trailing); + } if (code && code !== 0 && options.readyCheck) { try { if (options.readyCheck() && readyCheckOutputMatched) { finish(0, { forcedReady: true }); return;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/sandbox/create-stream.ts` around lines 318 - 324, The close handler checks readyCheckOutputMatched before the final buffered trailing output is parsed, so flush/parse any pending tail buffer before that check: inside the child.on("close", ...) callback call the same buffered-tail processing function used elsewhere (the routine that consumes the accumulated tail/line buffer used by the ready-check parsing) so any unterminated startup line is processed, then evaluate options.readyCheck() and readyCheckOutputMatched and call finish(0, { forcedReady: true }) if appropriate.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
5921-5934:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCI blocker: reduce net line growth in
src/lib/onboard.ts.Line 5921 introduces a multiline block, and
onboard-entrypoint-budgetis currently failing (+9 net lines). Please collapse this section (and trim a few additional lines if needed) before merge.✂️ One compacting option
- const selectedOpenShellDrivers = (process.env.OPENSHELL_DRIVERS ?? - (process.platform === "darwin" ? "vm" : "docker")) - .split(",") - .map((driver) => driver.trim()) - .filter(Boolean); - const waitForStartupOutputBeforeReadyDetach = selectedOpenShellDrivers.includes("vm"); + const waitForStartupOutputBeforeReadyDetach = (process.env.OPENSHELL_DRIVERS ?? + (process.platform === "darwin" ? "vm" : "docker")) + .split(",") + .some((driver) => driver.trim() === "vm"); const createResult = await streamSandboxCreate(createCommand, sandboxEnv, { @@ - readyCheckOutputPatterns: waitForStartupOutputBeforeReadyDetach - ? [/Setting up NemoClaw/] - : undefined, + readyCheckOutputPatterns: waitForStartupOutputBeforeReadyDetach ? [/Setting up NemoClaw/] : undefined, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 5921 - 5934, Collapse the multiline driver/env/setup block to reduce lines: replace the multi-line selectedOpenShellDrivers and waitForStartupOutputBeforeReadyDetach declarations and the streamSandboxCreate options object into a more compact form (e.g., single-line declaration for selectedOpenShellDrivers using process.env.OPENSHELL_DRIVERS ?? (process.platform === "darwin" ? "vm" : "docker") and inline the waitForStartupOutputBeforeReadyDetach ternary in readyCheckOutputPatterns) while keeping behavior identical for selectedOpenShellDrivers, waitForStartupOutputBeforeReadyDetach, and the readyCheck callback passed to streamSandboxCreate; update references to selectedOpenShellDrivers, waitForStartupOutputBeforeReadyDetach, createResult, streamSandboxCreate, readyCheck, and readyCheckOutputPatterns accordingly so the logic and patterns are unchanged but the code occupies fewer lines.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
4440-4573: Run the onboarding E2E slice for this change.Given this touches gateway reuse logic and sandbox readiness gating, run at least
sandbox-operations-e2eandopenshell-gateway-upgrade-e2eon this branch.As per coding guidelines:
src/lib/onboard.tsis core onboarding logic and includes explicit E2E recommendations for these flows.Also applies to: 5921-5935
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 4440 - 4573, This change affects gateway reuse and sandbox readiness in startDockerDriverGateway (and related helpers like registerDockerDriverGatewayEndpoint, verifySandboxBridgeGatewayReachableOrExit, isDockerDriverGatewayHttpReady), so run the onboarding E2E slice including at minimum sandbox-operations-e2e and openshell-gateway-upgrade-e2e against this branch (locally or in CI) to validate gateway reuse, restart/drift handling, and sandbox gating; checkout the branch, run the project's E2E runner for those two suites (or trigger the CI jobs), reproduce scenarios for existing gateway PID reuse, port-listener adoption, and fresh gateway startup, and report/fix any failing assertions before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/sandbox/create-stream.ts`:
- Around line 318-324: The close handler checks readyCheckOutputMatched before
the final buffered trailing output is parsed, so flush/parse any pending tail
buffer before that check: inside the child.on("close", ...) callback call the
same buffered-tail processing function used elsewhere (the routine that consumes
the accumulated tail/line buffer used by the ready-check parsing) so any
unterminated startup line is processed, then evaluate options.readyCheck() and
readyCheckOutputMatched and call finish(0, { forcedReady: true }) if
appropriate.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5921-5934: Collapse the multiline driver/env/setup block to reduce
lines: replace the multi-line selectedOpenShellDrivers and
waitForStartupOutputBeforeReadyDetach declarations and the streamSandboxCreate
options object into a more compact form (e.g., single-line declaration for
selectedOpenShellDrivers using process.env.OPENSHELL_DRIVERS ??
(process.platform === "darwin" ? "vm" : "docker") and inline the
waitForStartupOutputBeforeReadyDetach ternary in readyCheckOutputPatterns) while
keeping behavior identical for selectedOpenShellDrivers,
waitForStartupOutputBeforeReadyDetach, and the readyCheck callback passed to
streamSandboxCreate; update references to selectedOpenShellDrivers,
waitForStartupOutputBeforeReadyDetach, createResult, streamSandboxCreate,
readyCheck, and readyCheckOutputPatterns accordingly so the logic and patterns
are unchanged but the code occupies fewer lines.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4440-4573: This change affects gateway reuse and sandbox readiness
in startDockerDriverGateway (and related helpers like
registerDockerDriverGatewayEndpoint, verifySandboxBridgeGatewayReachableOrExit,
isDockerDriverGatewayHttpReady), so run the onboarding E2E slice including at
minimum sandbox-operations-e2e and openshell-gateway-upgrade-e2e against this
branch (locally or in CI) to validate gateway reuse, restart/drift handling, and
sandbox gating; checkout the branch, run the project's E2E runner for those two
suites (or trigger the CI jobs), reproduce scenarios for existing gateway PID
reuse, port-listener adoption, and fresh gateway startup, and report/fix any
failing assertions before merging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c27303de-f842-4197-87ca-9930efd4c0d0
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/sandbox/create-stream.test.tssrc/lib/sandbox/create-stream.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
2036-2058:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCI blocker:
src/lib/onboard.tsstill fails the entrypoint budget check.
onboard-entrypoint-budgetis failing becausesrc/lib/onboard.tsgrew by 34 lines. This needs to be neutralized before merge (for example, move the newly added messaging reuse helpers intosrc/lib/onboard/and import them here).Also applies to: 10668-10777
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2036 - 2058, The two new helpers getMessagingProviderNamesForChannel and getReusableStoredMessagingChannelsForNonInteractive are causing the entrypoint budget overflow; extract them into a new module (e.g., onboard messaging helpers) and import them into this file so the main onboard.ts shrinks. Move the implementations as-is but keep their dependencies (isNonInteractive, registry, getKnownMessagingChannels, providerExistsInGateway) as imports in the new module, export the two functions, then replace the inline functions in onboard.ts with imports of getMessagingProviderNamesForChannel and getReusableStoredMessagingChannelsForNonInteractive; ensure types/signatures are unchanged and update any local references.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2036-2040: The Slack provider name check in
getMessagingProviderNamesForChannel only returns `${sandboxName}-slack-bridge`,
which can allow reuse when `${sandboxName}-slack-app` is missing; update
getMessagingProviderNamesForChannel to return both `${sandboxName}-slack-bridge`
and `${sandboxName}-slack-app` for the "slack" channel and make the same change
in the other similar function/block around the second occurrence (lines
referenced near 2053-2057) so reusable Slack validation requires both providers
to be present.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2036-2058: The two new helpers getMessagingProviderNamesForChannel
and getReusableStoredMessagingChannelsForNonInteractive are causing the
entrypoint budget overflow; extract them into a new module (e.g., onboard
messaging helpers) and import them into this file so the main onboard.ts
shrinks. Move the implementations as-is but keep their dependencies
(isNonInteractive, registry, getKnownMessagingChannels, providerExistsInGateway)
as imports in the new module, export the two functions, then replace the inline
functions in onboard.ts with imports of getMessagingProviderNamesForChannel and
getReusableStoredMessagingChannelsForNonInteractive; ensure types/signatures are
unchanged and update any local references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a6a4fd4a-c223-472a-8449-4d7d30e09c43
📒 Files selected for processing (5)
agents/hermes/start.shsrc/lib/onboard.tstest/gateway-liveness-probe.test.tstest/onboard.test.tstest/sandbox-init.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 25793891047
|
Selective E2E Results — ❌ Some jobs failedRun: 25794009369
|
…ker-bridge-probe # Conflicts: # src/lib/onboard.ts # src/lib/onboard/gateway-sandbox-reachability.test.ts # src/lib/onboard/gateway-sandbox-reachability.ts # test/gateway-liveness-probe.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 25794586425
|
Selective E2E Results — ❌ Some jobs failedRun: 25795005190
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/connect.ts`:
- Around line 41-42: The current LEGACY_CLUSTER_DRIVERS set special-cases too
many drivers (null/undefined/""/"kubernetes") causing Docker to be treated as
non-legacy; change the logic so only the VM driver is special-cased: update the
LEGACY_CLUSTER_DRIVERS definition (and the other occurrence at the same block
referenced in the comment) to only include the VM identifier (e.g., "vm"),
ensuring Docker (e.g., openshellDriver: "docker") falls through the legacy DNS
repair path so repairSandboxInferenceRouteIfNeeded() still calls
runSetupDnsProxy(...); add a small regression test using openshellDriver:
"docker" to verify the legacy path is used.
In `@src/lib/actions/sandbox/vm-dns-monkeypatch.ts`:
- Around line 142-174: realpathIfPresent returning null can mean a dangling
symlink; treat that as a rejected case instead of "missing" to avoid following
the symlink later. After calling realpathIfPresent(target) in the function that
returns {ok, path} (using symbols realpathIfPresent, isPathInside,
opts.mustExist, rootfsReal, relativePath), add a check that calls fs.lstat (or a
helper lstatIfPresent) for target and if it exists and isSymbolicLink() but
realpath returned null, return { ok: false, reason: `refusing to patch
${path.join(...relativePath)} because it is a dangling symlink` } (respecting
existing messaging style); only treat as "missing" when lstat shows no entry.
Also add a regression test for a dangling symlink whose target does not exist.
In `@test/onboard.test.ts`:
- Line 5146: The regex used in the regression assertion makes the third argument
to createSandbox optional by using (?:, sandboxName)?, weakening the test;
update the pattern so the third argument is mandatory (remove the optional
quantifier) so the sequence that includes
getRecordedMessagingChannelsForResume(..., session, sandboxName),
selectedMessagingChannels assignments, setupMessagingChannels(),
readMessagingChannelConfigFromEnv(), onboardSession.updateSession(...
current.messagingChannels/current.messagingChannelConfig ...), and the call to
createSandbox(gpu, model, provider, preferredInferenceApi, sandboxName,
nextWebSearchConfig, selectedMessagingChannels, fromDockerfile, agent,
opts.controlUiPort || null, sandboxGpuConfig) requires sandboxName to be
present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 250033a0-2cdf-4157-8e53-82b65d3d10e8
📒 Files selected for processing (6)
src/lib/actions/sandbox/connect.tssrc/lib/actions/sandbox/vm-dns-monkeypatch.test.tssrc/lib/actions/sandbox/vm-dns-monkeypatch.tssrc/lib/onboard/vm-dns-monkeypatch.tstest/onboard.test.tstest/sandbox-connect-inference.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/sandbox-connect-inference.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/connect.ts`:
- Around line 166-172: The reapplyVmInferenceRoute function currently
short-circuits health checking when runOpenshell(["inference","set",...])
returns non-zero and omits --no-verify; change it to always append the
"--no-verify" flag to the runOpenshell args (matching the earlier route-switch
path) and do not use the command exit status to decide success—instead always
call isSandboxInferenceRouteHealthy(sandboxName) after runOpenshell (you can
still pass ignoreError: true) and return that probe result; update references to
runOpenshell and isSandboxInferenceHealthy in reapplyVmInferenceRoute
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1d989603-4ffd-422c-bf3c-6e62b3edbce3
📒 Files selected for processing (5)
src/lib/actions/sandbox/connect.tssrc/lib/actions/sandbox/vm-dns-monkeypatch.test.tssrc/lib/actions/sandbox/vm-dns-monkeypatch.tstest/onboard.test.tstest/sandbox-connect-inference.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- test/sandbox-connect-inference.test.ts
- test/onboard.test.ts
- src/lib/actions/sandbox/vm-dns-monkeypatch.ts
…ker-bridge-probe Signed-off-by: Carlos Villela <cvillela@nvidia.com> # Conflicts: # src/lib/sandbox/create-stream.test.ts # src/lib/sandbox/create-stream.ts
cv
left a comment
There was a problem hiding this comment.
Resolved the merge conflicts by merging main and preserving both the VM startup-output gate and Docker GPU failure-check paths in create-stream. Reviewed the VM DNS monkeypatch, messaging reuse, Hermes Discord config, and policy changes; no blocking issues found. Local focused suite and PR checks are green.
Selective E2E Results — ✅ All requested jobs passedRun: 25832402958
|
Selective E2E Results — ✅ All requested jobs passedRun: 25859064427
|
Selective E2E Results — ✅ All requested jobs passedRun: 25860706571
|
Selective E2E Results — ✅ All requested jobs passedRun: 25861634603
|
Selective E2E Results — ✅ All requested jobs passedRun: 25861533504
|
Selective E2E Results — ✅ All requested jobs passedRun: 25877688734
|
Selective E2E Results — ✅ All requested jobs passedRun: 25883112020
|
Summary
192.168.127.1) soinference.localresolves from inside VM sandboxes.Readyalone cannot advance onboarding before the sandbox startup command is actually running.*.discord.ggwebsocket policy, and stricter Slack provider reuse checks.Direction / Scope Guardrail
This PR is not the strategic macOS driver direction. It is a narrow compatibility bridge for already-created or explicitly selected OpenShell VM sandboxes while NemoClaw is pinned to the OpenShell behavior that exposed this regression.
Normal macOS onboarding should move back to Docker/Colima in #3454 (
fix(onboard): use Docker driver on macOS). This PR should not defaultOPENSHELL_DRIVERS=vm, should not add installer requirements for VM helper assets, and should not make VM the preferred macOS runtime.OpenShell #1375 has merged upstream and keeps VM driver selection opt-in. Once NemoClaw can consume that OpenShell release path, the durable direction is to rely on Docker/Colima for normal macOS onboarding and keep this VM shim only for explicit/legacy VM cases until it can be removed.
Root Cause
Earlier PR text blamed
#3441. That was too narrow and is not accurate for the final fix. The reverted Docker bridge reachability probe was one visible blocker, but it was not the underlyinginference.localfailure, and that bridge-probe code is no longer part of this PR's final diff.The affected failure chain is a macOS VM-driver mismatch:
8.8.8.8/8.8.4.4). Those can resolve public hostnames, but they cannot resolve OpenShell/NemoClaw synthetic hostnames such asinference.local.inference.localfailed, NemoClaw tried the legacy DNS repair path, which produced misleading gateway-container warnings instead of repairing VM DNS.Readybefore NemoClaw startup output appears. On macOS that allowed onboarding to detach before dashboard/Hermes/OpenClaw startup was actually observable.The Discord failures were downstream runtime issues exposed after the VM sandbox got far enough to run. Discord may use regional websocket hosts such as
gateway-us-east1-d.discord.gg, and Hermes guild-only configuration without explicit user IDs must permit guild members instead of rejecting every Discord user as unauthorized.Tradeoff / Follow-up
The DNS portion of this PR is intentionally a narrow emergency compatibility shim, not the ideal long-term owner boundary. It is Darwin + OpenShell VM-driver gated, best-effort, and disableable with
NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1, but it still depends on today's OpenShell VM rootfs layout, init-script shape, and gvproxy resolver IP (192.168.127.1). That is acceptable only as a compatibility bridge for explicit/legacy VM sandboxes.Durable follow-up is split by owner:
Regression Risk
openshellDriver === "vm"on Darwin, is best-effort, and can be disabled withNEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1.*.discord.gghandling with credential rewrite; it does not broadly open Discord REST beyond the existing Discord policy surface.-slack-bridgeand-slack-app, avoiding partial provider reuse.Validation
npm run build:cligit diff --checknpx vitest run src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts test/sandbox-connect-inference.test.ts test/onboard.test.ts --fileParallelism=falsehttps://inference.local/v1/modelsand chat completions returned 200 from inside the VM sandbox after the DNS patch.Summary by CodeRabbit
New Features
Bug Fixes
Tests