Skip to content

fix(daemon)(#19): keep --detach child alive past process.disconnect()#20

Merged
vrogojin merged 3 commits into
integration/all-fixesfrom
fix/daemon-detach-immediate-exit-19
May 23, 2026
Merged

fix(daemon)(#19): keep --detach child alive past process.disconnect()#20
vrogojin merged 3 commits into
integration/all-fixesfrom
fix/daemon-detach-immediate-exit-19

Conversation

@vrogojin
Copy link
Copy Markdown
Contributor

Summary

Fixes #19sphere daemon start --detach exited successfully with a PID, but the forked child died between fork() and any useful work, leaving a stale PID file and an empty daemon.log.

Root cause

Two compounding bugs in src/legacy/daemon.ts:

  1. stdio: 'ignore' strips the IPC channel. child_process.fork() normally establishes an IPC channel, but overriding stdio removes it. The child's process.connected was therefore false from the start.

  2. process.disconnect() was guarded only on its existence, not on process.connected. The function exists regardless of channel state, but calling it without a live channel throws "IPC channel is not open". The throw bubbled up to legacy-cli's catch, which called console.error (already redirected to an unflushed WriteStream) and process.exit(1). The child died silently — pending WriteStream writes never flushed because fs.open hadn't completed.

The diagnostic trail that pinpointed this:

[1779532169143] forked-enter pid=474909 connected=false   ← no IPC channel
[1779532169143] pre-disconnect pid=474909 connected=false  ← about to throw
[1779532169143] exit pid=474909 code=1                     ← exited via legacy-cli catch

Fix

src/legacy/daemon.ts:

  • Parent sidedetachDaemon opens the log file in the parent and inherits the fd to the child's stdout AND stderr (stdio: ['ignore', logFd, logFd, 'ipc']). Any future startup failure — crash, uncaught exception, raw stderr emission — now lands in the log file at the OS level, before any Node-level streaming machinery is needed. 'ipc' is kept because fork() rejects without it; the channel exists solely to satisfy fork's contract (no messages flow).

  • Child siderunDaemon's disconnect is now guarded on process.connected, so it correctly no-ops if the channel was already torn down and only disconnects when there's a live channel to release.

  • log() double-write — pre-existing bug: in forked mode log() wrote both via logStream.write AND console.log (which was redirected to the same stream), producing duplicate lines. Exposed only because the daemon now actually runs. Fixed via if/else.

Test plan

  • Build (npm run build) — clean.
  • Typecheck (npm run typecheck) — clean.
  • Lint (npm run lint) — no new warnings from these changes (the 5 warnings on daemon.ts are pre-existing on lines unchanged here).
  • Unit tests (npm test) — 106/106 passing.
  • Issue's exact reproduction (mkdir … && wallet create … && init … && daemon start --detach … && sleep 3 && daemon status) now reports:
    Daemon is running (PID 498927).
    Log file size: 736 bytes
    
    Was: Daemon is not running (stale PID file, process X).
  • New integration test (test/integration/cli-daemon.integration.test.ts) — 4 tests, all passing:
    • 3 offline help-shape pins (daemon / daemon start / daemon status)
    • 1 network detach lifecycle pin (start → status running → log non-empty → stop → status not-running → PID file gone)
    • afterEach defensively stops daemon to avoid leaking forked processes that hold open Nostr WebSocket connections.

Acceptance criteria (from the issue)

  • sphere daemon start --detach followed by sleep 3 && sphere daemon status reports the daemon as running, with a non-empty daemon.log.
  • manual-test-full-recovery.sh §C.3/§C.4 unblocked — once this lands and SDK builds against it, peer2's daemons will receive events as expected. (Confirmed locally; CI run pending.)
  • New integration test under test/integration/ mirroring cli-wallet-lifecycle.integration.test.ts.

Related

  • sphere-sdk #226 / #227 (merged 2026-05-23) — invoice delivery via UXF DM, the work that motivated this fix.
  • sphere-sdk PR #222 / commit 108e9ac — original documentation of --detach as a known limitation.

Vladimir Rogojin added 3 commits May 23, 2026 12:43
`sphere daemon start --detach` returned exit 0 with a PID, but the
forked child died between fork() and any useful work — leaving a stale
PID file and an empty daemon.log. Two compounding causes:

1. The parent forked with `stdio: 'ignore'`, which strips the IPC
   channel that child_process.fork() normally establishes. The child's
   `process.connected` was therefore false from the start.

2. The child unconditionally called `process.disconnect()` (guarded only
   by `if (process.disconnect)`, which is always truthy because the
   function exists regardless of channel state). With no live IPC
   channel, disconnect throws "IPC channel is not open". The throw
   bubbled up to legacy-cli's catch, which called `console.error`
   (already redirected to an unflushed WriteStream) and `process.exit(1)`.
   The child died silently — the WriteStream's pending writes never
   flushed because the underlying fs.open hadn't completed.

Fix:

* `detachDaemon` now opens the log file in the parent and inherits the
  fd to the child's stdout AND stderr (`stdio: ['ignore', logFd, logFd,
  'ipc']`). Any future startup failure — crash, uncaught exception,
  raw stderr emission — is captured at the OS level before any
  Node-level streaming machinery is required. 'ipc' is kept because
  child_process.fork() throws "Forked processes must have an IPC
  channel" without it; the channel exists solely to satisfy fork's
  contract (no messages flow over it).

* `runDaemon`'s child-side disconnect is now guarded on
  `process.connected`, so it correctly no-ops if the channel was already
  torn down (e.g. parent exited first) and only disconnects when
  there's a live channel to release.

* `log()` no longer double-writes in forked mode (the redirected
  `console.log` already forwards to the same WriteStream that `log`
  writes to directly). Pre-existing bug, exposed only because the
  daemon now actually runs.

Reproduction from the issue (`mkdir … && wallet create … && init …
&& daemon start --detach …`) now reports "Daemon is running (PID X)"
after sleep 3, with a non-empty daemon.log containing
"Daemon running. Waiting for events." — matching acceptance criteria.
New integration suite per acceptance criteria. Mirrors the pattern of
test/integration/cli-wallet-lifecycle.integration.test.ts (offline
help-shape layer + network lifecycle layer skipped by SKIP_INTEGRATION).

Two layers:

  1. Help-shape (offline, 3 pins) — `payments help daemon`,
     `daemon start`, `daemon status`. Pins the documented flag surface
     (--detach, --event, --action, --log, --pid) so a refactor that
     drops a flag from the help registry fails red.

  2. Detach lifecycle (network, 1 pin) — end-to-end:
       start --detach     →  exit 0 with "Daemon started in background"
       sleep 6s + status  →  "Daemon is running (PID X)"
                             (was: "Daemon is not running (stale PID
                              file, process X)")
       log file           →  non-empty, contains
                             "Daemon running. Waiting for events."
       stop               →  "Daemon stopped"
       post-stop status   →  "Daemon is not running"
       PID file           →  removed

afterEach calls `daemon stop` defensively so a mid-test failure cannot
leak a forked daemon holding open Nostr WebSocket connections against
the testnet relay.

Why this is an integration test (not unit / mocked): the bug fixed by
issue #19 was inside child_process.fork() + process.disconnect()
semantics that only manifest when an actual node process is forked with
the actual stdio + IPC-channel configuration. A unit test mocking fork
would not catch a recurrence.
Self-review follow-up. The parent's child.disconnect() closes the IPC
channel at the OS layer, but the child's JS-level 'disconnect' event
(which flips process.connected to false) is delivered async. There's a
narrow microtask window where the child reads process.connected as true
while the underlying channel is already torn down, in which case the
disconnect() call throws "IPC channel is not open" — the exact failure
mode this PR fixes, just triggered by a different cause.

Wrap the child-side disconnect in try/catch. Swallowing is correct: the
goal state (channel closed) already holds either way.

In practice the race window is small (child needs ~hundreds of ms to
load and reach the disconnect call, by which time the event handler
has run) and integration tests have not hit it, but the defensive
guard removes a theoretical flake source from production deployments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant