fix(daemon)(#19): keep --detach child alive past process.disconnect()#20
Merged
vrogojin merged 3 commits intoMay 23, 2026
Merged
Conversation
added 3 commits
May 23, 2026 12:43
`sphere daemon start --detach` returned exit 0 with a PID, but the forked child died between fork() and any useful work — leaving a stale PID file and an empty daemon.log. Two compounding causes: 1. The parent forked with `stdio: 'ignore'`, which strips the IPC channel that child_process.fork() normally establishes. The child's `process.connected` was therefore false from the start. 2. The child unconditionally called `process.disconnect()` (guarded only by `if (process.disconnect)`, which is always truthy because the function exists regardless of channel state). With no live IPC channel, disconnect throws "IPC channel is not open". The throw bubbled up to legacy-cli's catch, which called `console.error` (already redirected to an unflushed WriteStream) and `process.exit(1)`. The child died silently — the WriteStream's pending writes never flushed because the underlying fs.open hadn't completed. Fix: * `detachDaemon` now opens the log file in the parent and inherits the fd to the child's stdout AND stderr (`stdio: ['ignore', logFd, logFd, 'ipc']`). Any future startup failure — crash, uncaught exception, raw stderr emission — is captured at the OS level before any Node-level streaming machinery is required. 'ipc' is kept because child_process.fork() throws "Forked processes must have an IPC channel" without it; the channel exists solely to satisfy fork's contract (no messages flow over it). * `runDaemon`'s child-side disconnect is now guarded on `process.connected`, so it correctly no-ops if the channel was already torn down (e.g. parent exited first) and only disconnects when there's a live channel to release. * `log()` no longer double-writes in forked mode (the redirected `console.log` already forwards to the same WriteStream that `log` writes to directly). Pre-existing bug, exposed only because the daemon now actually runs. Reproduction from the issue (`mkdir … && wallet create … && init … && daemon start --detach …`) now reports "Daemon is running (PID X)" after sleep 3, with a non-empty daemon.log containing "Daemon running. Waiting for events." — matching acceptance criteria.
New integration suite per acceptance criteria. Mirrors the pattern of
test/integration/cli-wallet-lifecycle.integration.test.ts (offline
help-shape layer + network lifecycle layer skipped by SKIP_INTEGRATION).
Two layers:
1. Help-shape (offline, 3 pins) — `payments help daemon`,
`daemon start`, `daemon status`. Pins the documented flag surface
(--detach, --event, --action, --log, --pid) so a refactor that
drops a flag from the help registry fails red.
2. Detach lifecycle (network, 1 pin) — end-to-end:
start --detach → exit 0 with "Daemon started in background"
sleep 6s + status → "Daemon is running (PID X)"
(was: "Daemon is not running (stale PID
file, process X)")
log file → non-empty, contains
"Daemon running. Waiting for events."
stop → "Daemon stopped"
post-stop status → "Daemon is not running"
PID file → removed
afterEach calls `daemon stop` defensively so a mid-test failure cannot
leak a forked daemon holding open Nostr WebSocket connections against
the testnet relay.
Why this is an integration test (not unit / mocked): the bug fixed by
issue #19 was inside child_process.fork() + process.disconnect()
semantics that only manifest when an actual node process is forked with
the actual stdio + IPC-channel configuration. A unit test mocking fork
would not catch a recurrence.
Self-review follow-up. The parent's child.disconnect() closes the IPC channel at the OS layer, but the child's JS-level 'disconnect' event (which flips process.connected to false) is delivered async. There's a narrow microtask window where the child reads process.connected as true while the underlying channel is already torn down, in which case the disconnect() call throws "IPC channel is not open" — the exact failure mode this PR fixes, just triggered by a different cause. Wrap the child-side disconnect in try/catch. Swallowing is correct: the goal state (channel closed) already holds either way. In practice the race window is small (child needs ~hundreds of ms to load and reach the disconnect call, by which time the event handler has run) and integration tests have not hit it, but the defensive guard removes a theoretical flake source from production deployments.
This was referenced May 23, 2026
invoice status crashes with 'Cannot read properties of undefined' when invoice not found locally
#21
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #19 —
sphere daemon start --detachexited successfully with a PID, but the forked child died betweenfork()and any useful work, leaving a stale PID file and an emptydaemon.log.Root cause
Two compounding bugs in
src/legacy/daemon.ts:stdio: 'ignore'strips the IPC channel.child_process.fork()normally establishes an IPC channel, but overridingstdioremoves it. The child'sprocess.connectedwas thereforefalsefrom the start.process.disconnect()was guarded only on its existence, not onprocess.connected. The function exists regardless of channel state, but calling it without a live channel throws"IPC channel is not open". The throw bubbled up tolegacy-cli's catch, which calledconsole.error(already redirected to an unflushedWriteStream) andprocess.exit(1). The child died silently — pending WriteStream writes never flushed becausefs.openhadn't completed.The diagnostic trail that pinpointed this:
Fix
src/legacy/daemon.ts:Parent side —
detachDaemonopens the log file in the parent and inherits the fd to the child's stdout AND stderr (stdio: ['ignore', logFd, logFd, 'ipc']). Any future startup failure — crash, uncaught exception, raw stderr emission — now lands in the log file at the OS level, before any Node-level streaming machinery is needed.'ipc'is kept becausefork()rejects without it; the channel exists solely to satisfy fork's contract (no messages flow).Child side —
runDaemon's disconnect is now guarded onprocess.connected, so it correctly no-ops if the channel was already torn down and only disconnects when there's a live channel to release.log()double-write — pre-existing bug: in forked modelog()wrote both vialogStream.writeANDconsole.log(which was redirected to the same stream), producing duplicate lines. Exposed only because the daemon now actually runs. Fixed via if/else.Test plan
npm run build) — clean.npm run typecheck) — clean.npm run lint) — no new warnings from these changes (the 5 warnings ondaemon.tsare pre-existing on lines unchanged here).npm test) — 106/106 passing.mkdir … && wallet create … && init … && daemon start --detach … && sleep 3 && daemon status) now reports:Daemon is not running (stale PID file, process X).test/integration/cli-daemon.integration.test.ts) — 4 tests, all passing:afterEachdefensively stops daemon to avoid leaking forked processes that hold open Nostr WebSocket connections.Acceptance criteria (from the issue)
sphere daemon start --detachfollowed bysleep 3 && sphere daemon statusreports the daemon as running, with a non-emptydaemon.log.manual-test-full-recovery.sh§C.3/§C.4 unblocked — once this lands and SDK builds against it, peer2's daemons will receive events as expected. (Confirmed locally; CI run pending.)test/integration/mirroringcli-wallet-lifecycle.integration.test.ts.Related
108e9ac— original documentation of--detachas a known limitation.