Add channel health diagnostics and gateway recovery fixes#855
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f4a48082c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const buffer = Buffer.allocUnsafe(bytesToRead); | ||
| await file.read(buffer, 0, bytesToRead, position); | ||
| content = `${buffer.toString('utf-8')}${content}`; |
There was a problem hiding this comment.
Respect short reads when tailing diagnostics logs
readTail allocates an unsafe buffer and appends buffer.toString('utf-8') without checking how many bytes were actually read. If gateway.log is truncated or rotated between stat() and read(), fs.read can return fewer bytes than requested, and the unread part of an unsafe buffer may contain stale memory, which can leak garbage into the diagnostics snapshot (and copied clipboard payload). Use bytesRead to slice before decoding (or switch to Buffer.alloc).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR improves channel reliability UX by adding gateway health diagnostics, exposing degraded/unresponsive states in the Channels UI with recovery actions, and sanitizing legacy OpenClaw acpx config/install metadata that can prevent gateway startup after upgrades.
Changes:
- Add gateway health summary + diagnostics snapshot endpoint and surface it in the Channels page (banner, restart, copy/view diagnostics).
- Track gateway transport diagnostics (heartbeat misses, RPC timeouts, socket close, last-alive timestamps) and enable non-Windows heartbeat-based recovery.
- Sanitize legacy
acpxoverrides and stale bundled install metadata to prevent gateway startup failures post-upgrade.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/sanitize-config.test.ts | Adds unit coverage for removing legacy acpx overrides and stale install metadata. |
| tests/unit/gateway-manager-heartbeat.test.ts | Updates heartbeat recovery expectations (restart on threshold; Windows remains conservative). |
| tests/unit/gateway-manager-diagnostics.test.ts | New unit coverage for diagnostics timestamps/counters and health summary behavior. |
| tests/unit/diagnostics-routes.test.ts | New unit coverage for diagnostics snapshot route payload (channel view + tailed logs). |
| tests/unit/channels-page.test.tsx | Adds unit coverage for degraded banner + copy/view diagnostics snapshot behavior. |
| tests/unit/channel-status.test.ts | Adds coverage for degraded overlay interactions and priority rules. |
| tests/unit/channel-routes.test.ts | Adds coverage for degraded channel health when channels.status times out but gateway is running. |
| tests/e2e/channels-health-diagnostics.spec.ts | New E2E coverage for degraded banner, gateway restart, and diagnostics copy. |
| src/types/channel.ts | Extends ChannelStatus to include degraded. |
| src/pages/Channels/index.tsx | Adds gateway health banner, restart action, and diagnostics snapshot copy/view UX. |
| src/lib/channel-status.ts | Adds gateway health overlay support to renderer-side status computation. |
| src/i18n/locales/en/channels.json | Adds localized strings for gateway health banner/actions and “degraded” status. |
| src/i18n/locales/zh/channels.json | Same as above (zh). |
| src/i18n/locales/ja/channels.json | Same as above (ja). |
| electron/utils/openclaw-auth.ts | Sanitizes legacy acpx config keys and removes stale plugins.installs.acpx metadata. |
| electron/utils/gateway-health.ts | New helper to compute gateway health summaries from status + diagnostics. |
| electron/utils/channel-status.ts | Adds gateway health overlay support to main-process status computation. |
| electron/gateway/manager.ts | Introduces diagnostics tracking + non-Windows heartbeat recovery restart behavior. |
| electron/api/server.ts | Registers new diagnostics route handler. |
| electron/api/routes/diagnostics.ts | Adds /api/diagnostics/gateway-snapshot endpoint with tailed logs + gateway/channel snapshot. |
| electron/api/routes/channels.ts | Returns gatewayHealth alongside channel view; exports diagnostics timestamps for snapshot route. |
| .gitignore | Ignores .claude/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const fetchDiagnosticsSnapshot = useCallback(async (): Promise<GatewayDiagnosticSnapshot> => { | ||
| const snapshot = await hostApiFetch<GatewayDiagnosticSnapshot>('/api/diagnostics/gateway-snapshot'); | ||
| setDiagnosticsSnapshot(snapshot); | ||
| return snapshot; |
There was a problem hiding this comment.
hostApiFetch via the Electron IPC proxy does not throw on non-2xx HTTP responses (it returns data.json even when data.ok/status indicate failure). If /api/diagnostics/gateway-snapshot returns a 500 { success:false, error }, this code will still set it as the snapshot and later copy/display it as if it were valid. Suggest validating the payload shape (e.g. capturedAt/platform/gateway) and/or explicitly throwing when success === false / error is present before calling setDiagnosticsSnapshot.
| const fetchDiagnosticsSnapshot = useCallback(async (): Promise<GatewayDiagnosticSnapshot> => { | |
| const snapshot = await hostApiFetch<GatewayDiagnosticSnapshot>('/api/diagnostics/gateway-snapshot'); | |
| setDiagnosticsSnapshot(snapshot); | |
| return snapshot; | |
| const isGatewayDiagnosticSnapshot = (value: unknown): value is GatewayDiagnosticSnapshot => { | |
| if (!value || typeof value !== 'object') { | |
| return false; | |
| } | |
| const snapshot = value as Record<string, unknown>; | |
| return ( | |
| typeof snapshot.capturedAt === 'string' && | |
| typeof snapshot.platform === 'object' && | |
| snapshot.platform !== null && | |
| typeof snapshot.gateway === 'object' && | |
| snapshot.gateway !== null | |
| ); | |
| }; | |
| const fetchDiagnosticsSnapshot = useCallback(async (): Promise<GatewayDiagnosticSnapshot> => { | |
| const response = await hostApiFetch<unknown>('/api/diagnostics/gateway-snapshot'); | |
| if (response && typeof response === 'object') { | |
| const payload = response as Record<string, unknown>; | |
| if (payload.success === false || typeof payload.error === 'string') { | |
| throw new Error(typeof payload.error === 'string' ? payload.error : 'Failed to fetch gateway diagnostics snapshot'); | |
| } | |
| } | |
| if (!isGatewayDiagnosticSnapshot(response)) { | |
| throw new Error('Invalid gateway diagnostics snapshot response'); | |
| } | |
| setDiagnosticsSnapshot(response); | |
| return response; |
| const handleRestartGateway = async () => { | ||
| try { | ||
| await hostApiFetch<{ success: boolean; error?: string }>('/api/gateway/restart', { | ||
| method: 'POST', | ||
| }); | ||
| setDiagnosticsSnapshot(null); | ||
| setShowDiagnostics(false); | ||
| toast.success(t('health.restartTriggered')); | ||
| void fetchPageData({ probe: true }); | ||
| } catch (restartError) { | ||
| toast.error(t('health.restartFailed', { error: String(restartError) })); | ||
| } |
There was a problem hiding this comment.
The restart request result is not checked. With the IPC host API proxy, a server-side 500 can come back as a resolved JSON { success:false, error }, so this path can incorrectly show health.restartTriggered even though the restart failed. Recommend checking the returned { success } flag (and surfacing error) and throwing/handling failure before resetting UI state + showing success toast.
| await file.read(buffer, 0, bytesToRead, position); | ||
| content = `${buffer.toString('utf-8')}${content}`; |
There was a problem hiding this comment.
readTail uses Buffer.allocUnsafe(bytesToRead) but ignores the bytesRead result from file.read(...). If the read returns fewer bytes than requested, buffer.toString() can include uninitialized memory, potentially leaking unrelated process data into the diagnostics payload. Use Buffer.alloc(...) or capture { bytesRead } and only stringify buffer.subarray(0, bytesRead).
| await file.read(buffer, 0, bytesToRead, position); | |
| content = `${buffer.toString('utf-8')}${content}`; | |
| const { bytesRead } = await file.read(buffer, 0, bytesToRead, position); | |
| content = `${buffer.subarray(0, bytesRead).toString('utf-8')}${content}`; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a005dea2b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const hasRuntimeError = runtimeAccounts.some((account) => typeof account.lastError === 'string' && account.lastError.trim()) | ||
| || Boolean(channelSummary?.error?.trim() || channelSummary?.lastError?.trim()); | ||
| const baseGroupStatus = pickChannelRuntimeStatus(runtimeAccounts, channelSummary); | ||
| const groupStatus = !gatewayStatus && ctx.gatewayManager.getStatus().state === 'running' | ||
| ? 'degraded' |
There was a problem hiding this comment.
Ignore filtered runtime errors when computing group status
hasRuntimeError is computed from the full runtimeAccounts array, but earlier in the same function runtime-only stale accounts are explicitly filtered out via shouldIncludeRuntimeAccountId. If a filtered stale row still has lastError, this branch suppresses the degraded overlay (!hasRuntimeError fails) and the group can stay connected even while the visible account rows are degraded under gateway health overlay. This produces inconsistent top-level channel status and can hide degraded state on the channel card; compute error/degraded gating from the same filtered account set used for rendering.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 149ddf4794
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (hasChannelRuntimeError(account)) return 'error'; | ||
| if (healthOverlay?.gatewayHealthState && healthOverlay.gatewayHealthState !== 'healthy') return 'degraded'; | ||
| if (isChannelRuntimeConnected(account)) return 'connected'; |
There was a problem hiding this comment.
Preserve connected precedence in account status computation
computeChannelRuntimeStatus now returns error before checking connectivity, so any runtime row that includes both connected/linked and a non-empty lastError is forced to error. pickChannelRuntimeStatus still treats connected as higher-priority, so the same payload can produce a connected channel group with an error account badge, which is a regression in consistency and can show false account failures when runtimes keep stale lastError text after recovery.
Useful? React with 👍 / 👎.
|
回复下这条状态设计:当前实现里 Channels 页会展示 5 种状态,而不是 3 种。
这里我先有意保留
如果后续希望把展示层再压成 3 态,我可以再单独收敛,但这个 PR 先保持 5 态以便更准确反映真实状态。 |


Summary
Why
Validation
pnpm run typecheckpnpm test tests/unit/gateway-manager-diagnostics.test.ts tests/unit/channel-status.test.ts tests/unit/channel-routes.test.ts tests/unit/channels-page.test.tsx tests/unit/diagnostics-routes.test.ts tests/unit/gateway-manager-heartbeat.test.ts tests/unit/sanitize-config.test.tspnpm exec playwright test tests/e2e/channels-health-diagnostics.spec.ts tests/e2e/gateway-lifecycle.spec.ts