Skip to content

Add channel health diagnostics and gateway recovery fixes#855

Merged
ashione merged 3 commits intomainfrom
auto/channels-health-diagnostics
Apr 15, 2026
Merged

Add channel health diagnostics and gateway recovery fixes#855
ashione merged 3 commits intomainfrom
auto/channels-health-diagnostics

Conversation

@ashione
Copy link
Copy Markdown
Contributor

@ashione ashione commented Apr 14, 2026

Summary

  • add gateway health summaries and diagnostics snapshots to the Channels flow
  • surface degraded/unresponsive channel health in the Channels UI with restart + diagnostics actions
  • sanitize stale OpenClaw acpx config/install metadata that can break gateway startup after upgrades
  • keep non-Windows heartbeat recovery enabled while preserving conservative Windows behavior

Why

  • the app could show channels as connected even when the gateway control plane was degraded or unresponsive
  • diagnostics were not easy to capture from the UI when the gateway was in a bad state
  • stale acpx config from older OpenClaw builds could prevent the gateway from starting
  • non-Windows gateway processes could remain alive but effectively stop responding, while Windows still needs the existing conservative recovery posture

Validation

  • pnpm run typecheck
  • pnpm test tests/unit/gateway-manager-diagnostics.test.ts tests/unit/channel-status.test.ts tests/unit/channel-routes.test.ts tests/unit/channels-page.test.tsx tests/unit/diagnostics-routes.test.ts tests/unit/gateway-manager-heartbeat.test.ts tests/unit/sanitize-config.test.ts
  • pnpm exec playwright test tests/e2e/channels-health-diagnostics.spec.ts tests/e2e/gateway-lifecycle.spec.ts

@ashione ashione marked this pull request as ready for review April 14, 2026 12:50
Copilot AI review requested due to automatic review settings April 14, 2026 12:50
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f4a48082c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread electron/api/routes/diagnostics.ts Outdated
Comment on lines +29 to +31
const buffer = Buffer.allocUnsafe(bytesToRead);
await file.read(buffer, 0, bytesToRead, position);
content = `${buffer.toString('utf-8')}${content}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect short reads when tailing diagnostics logs

readTail allocates an unsafe buffer and appends buffer.toString('utf-8') without checking how many bytes were actually read. If gateway.log is truncated or rotated between stat() and read(), fs.read can return fewer bytes than requested, and the unread part of an unsafe buffer may contain stale memory, which can leak garbage into the diagnostics snapshot (and copied clipboard payload). Use bytesRead to slice before decoding (or switch to Buffer.alloc).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves channel reliability UX by adding gateway health diagnostics, exposing degraded/unresponsive states in the Channels UI with recovery actions, and sanitizing legacy OpenClaw acpx config/install metadata that can prevent gateway startup after upgrades.

Changes:

  • Add gateway health summary + diagnostics snapshot endpoint and surface it in the Channels page (banner, restart, copy/view diagnostics).
  • Track gateway transport diagnostics (heartbeat misses, RPC timeouts, socket close, last-alive timestamps) and enable non-Windows heartbeat-based recovery.
  • Sanitize legacy acpx overrides and stale bundled install metadata to prevent gateway startup failures post-upgrade.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/sanitize-config.test.ts Adds unit coverage for removing legacy acpx overrides and stale install metadata.
tests/unit/gateway-manager-heartbeat.test.ts Updates heartbeat recovery expectations (restart on threshold; Windows remains conservative).
tests/unit/gateway-manager-diagnostics.test.ts New unit coverage for diagnostics timestamps/counters and health summary behavior.
tests/unit/diagnostics-routes.test.ts New unit coverage for diagnostics snapshot route payload (channel view + tailed logs).
tests/unit/channels-page.test.tsx Adds unit coverage for degraded banner + copy/view diagnostics snapshot behavior.
tests/unit/channel-status.test.ts Adds coverage for degraded overlay interactions and priority rules.
tests/unit/channel-routes.test.ts Adds coverage for degraded channel health when channels.status times out but gateway is running.
tests/e2e/channels-health-diagnostics.spec.ts New E2E coverage for degraded banner, gateway restart, and diagnostics copy.
src/types/channel.ts Extends ChannelStatus to include degraded.
src/pages/Channels/index.tsx Adds gateway health banner, restart action, and diagnostics snapshot copy/view UX.
src/lib/channel-status.ts Adds gateway health overlay support to renderer-side status computation.
src/i18n/locales/en/channels.json Adds localized strings for gateway health banner/actions and “degraded” status.
src/i18n/locales/zh/channels.json Same as above (zh).
src/i18n/locales/ja/channels.json Same as above (ja).
electron/utils/openclaw-auth.ts Sanitizes legacy acpx config keys and removes stale plugins.installs.acpx metadata.
electron/utils/gateway-health.ts New helper to compute gateway health summaries from status + diagnostics.
electron/utils/channel-status.ts Adds gateway health overlay support to main-process status computation.
electron/gateway/manager.ts Introduces diagnostics tracking + non-Windows heartbeat recovery restart behavior.
electron/api/server.ts Registers new diagnostics route handler.
electron/api/routes/diagnostics.ts Adds /api/diagnostics/gateway-snapshot endpoint with tailed logs + gateway/channel snapshot.
electron/api/routes/channels.ts Returns gatewayHealth alongside channel view; exports diagnostics timestamps for snapshot route.
.gitignore Ignores .claude/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +316 to +319
const fetchDiagnosticsSnapshot = useCallback(async (): Promise<GatewayDiagnosticSnapshot> => {
const snapshot = await hostApiFetch<GatewayDiagnosticSnapshot>('/api/diagnostics/gateway-snapshot');
setDiagnosticsSnapshot(snapshot);
return snapshot;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hostApiFetch via the Electron IPC proxy does not throw on non-2xx HTTP responses (it returns data.json even when data.ok/status indicate failure). If /api/diagnostics/gateway-snapshot returns a 500 { success:false, error }, this code will still set it as the snapshot and later copy/display it as if it were valid. Suggest validating the payload shape (e.g. capturedAt/platform/gateway) and/or explicitly throwing when success === false / error is present before calling setDiagnosticsSnapshot.

Suggested change
const fetchDiagnosticsSnapshot = useCallback(async (): Promise<GatewayDiagnosticSnapshot> => {
const snapshot = await hostApiFetch<GatewayDiagnosticSnapshot>('/api/diagnostics/gateway-snapshot');
setDiagnosticsSnapshot(snapshot);
return snapshot;
const isGatewayDiagnosticSnapshot = (value: unknown): value is GatewayDiagnosticSnapshot => {
if (!value || typeof value !== 'object') {
return false;
}
const snapshot = value as Record<string, unknown>;
return (
typeof snapshot.capturedAt === 'string' &&
typeof snapshot.platform === 'object' &&
snapshot.platform !== null &&
typeof snapshot.gateway === 'object' &&
snapshot.gateway !== null
);
};
const fetchDiagnosticsSnapshot = useCallback(async (): Promise<GatewayDiagnosticSnapshot> => {
const response = await hostApiFetch<unknown>('/api/diagnostics/gateway-snapshot');
if (response && typeof response === 'object') {
const payload = response as Record<string, unknown>;
if (payload.success === false || typeof payload.error === 'string') {
throw new Error(typeof payload.error === 'string' ? payload.error : 'Failed to fetch gateway diagnostics snapshot');
}
}
if (!isGatewayDiagnosticSnapshot(response)) {
throw new Error('Invalid gateway diagnostics snapshot response');
}
setDiagnosticsSnapshot(response);
return response;

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +333
const handleRestartGateway = async () => {
try {
await hostApiFetch<{ success: boolean; error?: string }>('/api/gateway/restart', {
method: 'POST',
});
setDiagnosticsSnapshot(null);
setShowDiagnostics(false);
toast.success(t('health.restartTriggered'));
void fetchPageData({ probe: true });
} catch (restartError) {
toast.error(t('health.restartFailed', { error: String(restartError) }));
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart request result is not checked. With the IPC host API proxy, a server-side 500 can come back as a resolved JSON { success:false, error }, so this path can incorrectly show health.restartTriggered even though the restart failed. Recommend checking the returned { success } flag (and surfacing error) and throwing/handling failure before resetting UI state + showing success toast.

Copilot uses AI. Check for mistakes.
Comment thread electron/api/routes/diagnostics.ts Outdated
Comment on lines +30 to +31
await file.read(buffer, 0, bytesToRead, position);
content = `${buffer.toString('utf-8')}${content}`;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readTail uses Buffer.allocUnsafe(bytesToRead) but ignores the bytesRead result from file.read(...). If the read returns fewer bytes than requested, buffer.toString() can include uninitialized memory, potentially leaking unrelated process data into the diagnostics payload. Use Buffer.alloc(...) or capture { bytesRead } and only stringify buffer.subarray(0, bytesRead).

Suggested change
await file.read(buffer, 0, bytesToRead, position);
content = `${buffer.toString('utf-8')}${content}`;
const { bytesRead } = await file.read(buffer, 0, bytesToRead, position);
content = `${buffer.subarray(0, bytesRead).toString('utf-8')}${content}`;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a005dea2b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread electron/api/routes/channels.ts Outdated
Comment on lines +641 to +645
const hasRuntimeError = runtimeAccounts.some((account) => typeof account.lastError === 'string' && account.lastError.trim())
|| Boolean(channelSummary?.error?.trim() || channelSummary?.lastError?.trim());
const baseGroupStatus = pickChannelRuntimeStatus(runtimeAccounts, channelSummary);
const groupStatus = !gatewayStatus && ctx.gatewayManager.getStatus().state === 'running'
? 'degraded'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore filtered runtime errors when computing group status

hasRuntimeError is computed from the full runtimeAccounts array, but earlier in the same function runtime-only stale accounts are explicitly filtered out via shouldIncludeRuntimeAccountId. If a filtered stale row still has lastError, this branch suppresses the degraded overlay (!hasRuntimeError fails) and the group can stay connected even while the visible account rows are degraded under gateway health overlay. This produces inconsistent top-level channel status and can hide degraded state on the channel card; compute error/degraded gating from the same filtered account set used for rendering.

Useful? React with 👍 / 👎.

@ashione ashione changed the title [codex] Add channel health diagnostics and gateway recovery fixes Add channel health diagnostics and gateway recovery fixes Apr 14, 2026
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 149ddf4794

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 84 to +86
if (hasChannelRuntimeError(account)) return 'error';
if (healthOverlay?.gatewayHealthState && healthOverlay.gatewayHealthState !== 'healthy') return 'degraded';
if (isChannelRuntimeConnected(account)) return 'connected';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve connected precedence in account status computation

computeChannelRuntimeStatus now returns error before checking connectivity, so any runtime row that includes both connected/link​ed and a non-empty lastError is forced to error. pickChannelRuntimeStatus still treats connected as higher-priority, so the same payload can produce a connected channel group with an error account badge, which is a regression in consistency and can show false account failures when runtimes keep stale lastError text after recovery.

Useful? React with 👍 / 👎.

@ashione
Copy link
Copy Markdown
Contributor Author

ashione commented Apr 14, 2026

image Now, we need 3 states : connected , degraded, disconnected

@ashione
Copy link
Copy Markdown
Contributor Author

ashione commented Apr 15, 2026

回复下这条状态设计:当前实现里 Channels 页会展示 5 种状态,而不是 3 种。

  • connected: 频道/账号运行正常
  • degraded: gateway 控制面劣化,但仍有部分连接或状态可见
  • connecting: adapter 正在运行或恢复中,但还没有健康信号
  • disconnected: 未连接或未运行
  • error: 频道/账号本身有明确 runtime/config error

这里我先有意保留 connectingerror,因为它们和 degraded / disconnected 的可操作性不同:

  • connecting 表示恢复中,不适合和真正断开混在一起
  • error 表示已经拿到了明确故障信息,优先级也和单纯 degraded 不一样

如果后续希望把展示层再压成 3 态,我可以再单独收敛,但这个 PR 先保持 5 态以便更准确反映真实状态。

@ashione
Copy link
Copy Markdown
Contributor Author

ashione commented Apr 15, 2026

image Now, we need 3 states : connected , degraded, disconnected

@su8su @vcfgv @hazeone

@ashione ashione merged commit 1f39d1a into main Apr 15, 2026
6 checks passed
@ashione ashione deleted the auto/channels-health-diagnostics branch April 15, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants