fix: reliability fixes for handoff crash detection, output capture, and status detection#352
fix: reliability fixes for handoff crash detection, output capture, and status detection#352chrisdfir wants to merge 9 commits into
Conversation
…of raising When a worker crashes, its DB row is deleted but the pane keeps emitting (shell prompt, "Resume this session" banner, tmux redraws). _process_chunk called provider_manager.get_provider() unguarded, which raises ValueError for an unknown terminal, and run()'s handler logged a full traceback per stray chunk — spamming the log until the pane went quiet. Guard the lookup like get_status() already does: on ValueError, release any residual buffer/latch state once and drop the chunk at debug level. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch env
Generalize per-provider launch-environment injection instead of hand-rolling
inline shell prefixes in each provider's command builder. Providers declare a
class-level DEFAULT_LAUNCH_ENV; the service layer resolves it by provider type
(before the instance exists, since the pane env is needed at tmux session/window
creation) and merges it into extra_env for both the supervisor and every worker.
- base: BaseProvider.DEFAULT_LAUNCH_ENV = {}
- claude_code: DEFAULT_LAUNCH_ENV = {'BUN_JSC_useDFGJIT': '0'} to disable the
Bun DFG JIT tier whose background compile thread can segfault mid-task
- manager: default_launch_env(provider_type) + a registry kept exhaustive over
ProviderType by test
- terminal_service: merge under operator --env (so --env still overrides)
Unlike the process-local --env store, this is re-derived from the provider class
on every spawn, so it survives a cao-server restart.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…orn-down terminal
handoff tears the worker down when it finishes (DB row deleted), so a supervisor
that re-reads it via get_terminal_output got a 404 — the in-band handoff result
was the only copy of the worker's report. delete_terminal already persists the
full plain-text {id}.scrollback plus a {id}.snapshot.json recording the provider.
get_output now falls back to those when terminal metadata is gone: FULL returns
the raw scrollback; LAST re-extracts the final message with the recorded
provider's stateless extractor, falling back to raw if extraction can't run. A
genuinely-unknown terminal (no snapshot) still raises not-found (404).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uccess A segfaulted CLI freezes its pane. A frozen screen with a leftover response marker can satisfy the COMPLETED heuristic trivially — death is permanent stability, so the confirm-stable window passes BECAUSE the process is dead — and run_agent_step then extracts a truncated message and returns success. A frozen screen that matches no marker instead pins status at PROCESSING, so the wait runs the full timeout and a 5s crash is mislabeled as a 600s timeout. Add a fail-safe pane-liveness probe (worker_returned_to_shell): a live agent keeps its CLI as the pane foreground; only a crashed/exited one reverts to the shell baseline captured at launch. Probe at both decision points so a crash becomes kind="error" (-> 502 -> handoff success=False, "worker process exited") instead of a false success or a mislabeled timeout. Generalize the shell-baseline capture (previously kiro-only) to every provider. The probe returns False on any uncertainty, so healthy and indeterminate workers are never aborted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ables extract_last_message_from_script terminated at the first full-width U+2500 run, assuming it was always the TUI turn-separator above the input box. But Claude's renderer draws a markdown thematic break (---) and plain table dividers with the same pure-dash run, so any report containing one was clipped mid-message — e.g. a recon report lost its architecture summary and test counts after a '---'. Disambiguate by framing: a dash run terminates extraction only when the next non-blank line is the ❯/> prompt (it frames the input box) or the capture ends. A rule followed by more report text is content — drop the rule line and keep extracting. Real Claude output never places response content after the turn separator (the footer sits below the box), so this is correct for live output; the prior synthetic test that asserted otherwise is updated to a realistic layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
black-format the new tests and type _PROVIDER_CLASS_BY_TYPE / provider_class_for as type[BaseProvider] so the re-extraction path is not Any-typed (clears the only new mypy finding; pre-existing pyte typing notes in status_monitor are unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ponse
The TUI effort indicator renders as '● high · /effort' (also medium/low/xhigh/
max/none). EXTRACTION_RESPONSE_PATTERN's line-start anchor only excluded it when
it rendered inline ('… esc to interrupt ● high · /effort'), but it also renders
on its OWN line. There the leading '●' was read as a response marker, so:
- get_status saw 'marker + ❯ box' on the idle launch screen and returned a false
COMPLETED, which held stable through the handoff's 10s confirm window -> the
handoff 'completed' in ~10s before the worker did any work; and
- extraction returned 'high · /effort' as the message.
Add a negative lookahead rejecting a start-of-line marker whose line ends in
'/effort'. Real responses still match, complete, and extract; a real response
above the footer still anchors correctly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the worker lifecycle/handoff path so worker crashes, torn-down terminals, and Claude Code TUI quirks don’t produce silent false-successes, truncated outputs, or noisy status-monitor failures.
Changes:
- Add crash/liveness detection using a shell-baseline + foreground-process probe to avoid false COMPLETED/timeout outcomes.
- Improve output handling by re-serving persisted scrollback for deleted terminals and making Claude Code extraction resilient to in-report “rule/table” separators and effort footers.
- Introduce provider-scoped
DEFAULT_LAUNCH_ENVdefaults resolved pre-instantiation and injected at tmux pane creation.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/services/test_terminal_service_full.py | Adds regression tests for provider launch-env injection, persisted output fallback, and liveness probe behavior. |
| test/services/test_status_monitor.py | Adds tests ensuring stray output for deleted terminals is tolerated without exceptions and clears residual state. |
| test/services/test_agent_step.py | Adds tests asserting crashed-worker detection prevents false success/timeout reporting. |
| test/providers/test_provider_manager_unit.py | Adds tests for resolving DEFAULT_LAUNCH_ENV per provider type (including exhaustiveness). |
| test/providers/test_claude_code_unit.py | Adds tests preventing Claude Code report clipping on rules/tables and excluding the effort footer from extraction/completion. |
| src/cli_agent_orchestrator/services/terminal_service.py | Injects provider launch-env defaults into tmux pane env, captures shell baseline, adds persisted-output fallback, and introduces worker_returned_to_shell(). |
| src/cli_agent_orchestrator/services/status_monitor.py | Makes _process_chunk tolerant of missing terminals by dropping stray output and clearing related state. |
| src/cli_agent_orchestrator/services/agent_step.py | Uses liveness probe to classify crashed workers as kind="error" instead of success/timeout. |
| src/cli_agent_orchestrator/providers/manager.py | Adds provider-type → class mapping plus helpers to resolve class-level defaults and stateless extractors. |
| src/cli_agent_orchestrator/providers/claude_code.py | Adds DEFAULT_LAUNCH_ENV, refines extraction response pattern to ignore effort footer, and improves separator handling to avoid truncation. |
| src/cli_agent_orchestrator/providers/base.py | Introduces DEFAULT_LAUNCH_ENV as a supported provider-level declarative default. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| provider = provider_manager.get_provider(terminal_id) | ||
| except ValueError: | ||
| # Terminal's DB row is gone but stray output chunks are still | ||
| # arriving — e.g. a crashed worker whose pane keeps emitting the | ||
| # shell prompt / "Resume this session" banner and tmux redraws | ||
| # after its record was deleted, or chunks draining out of the | ||
| # event bus after delete_terminal stopped the FIFO reader. Release | ||
| # any residual buffer/latch state once and drop the chunk. Without | ||
| # this, get_provider raised ValueError for every such chunk and | ||
| # run()'s handler logged a full traceback per chunk, spamming the | ||
| # log until the pane went quiet. Mirrors the defensive | ||
| # get_provider() handling already in get_status(). | ||
| if terminal_id in self._buffers or terminal_id in self._last_status: | ||
| self.clear_terminal(terminal_id) | ||
| logger.debug("Dropping output for unknown/deleted terminal %s", terminal_id) | ||
| return | ||
| except Exception: | ||
| logger.exception("Error resolving provider for %s; dropping chunk", terminal_id) | ||
| return |
| re-extracts the final message with the recorded provider's (stateless) | ||
| extractor, falling back to the raw scrollback if extraction can't run. | ||
|
|
||
| Returns None when no snapshot exists (genuinely-unknown terminal). |
…t done
A response-marker line ending in the '…' ellipsis ('⏺ Reading 1 file…',
'● Running command…') is an active tool call, not a finished answer. When the
worker emits a preamble and then thinks quietly for >10s, the live spinner
scrolls out of the rolling buffer, the PROCESSING checks miss it, and get_status
falls through to COMPLETED on the marker+idle-prompt. That false-completes a
handoff in ~10s, returning the preamble instead of the report.
Treat the freshest response-marker line ending in '…' as PROCESSING. Gated on
last_completion is None, so a genuinely finished turn (which shows the
'✻ Worked for Ns' summary) still completes. Only the single-char '…' (U+2026)
the TUI emits is matched, so prose ending in ASCII '...' cannot false-trigger.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #352 +/- ##
=======================================
Coverage ? 87.47%
=======================================
Files ? 115
Lines ? 13624
Branches ? 0
=======================================
Hits ? 11917
Misses ? 1707
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
handoff() placed the worker in the cao-server process CWD instead of the supervisor's project directory: assign()/_create_terminal resolve the conductor's working directory when none is pinned, but _handoff_impl never did. So a supervisor launched in ~/dev/personal/myrepo handed off to a developer that ran in whatever directory the daemon was started from — operating on the wrong tree (recon only worked because it used absolute paths). Resolve the conductor's working directory in _handoff_impl when the caller didn't pin one, mirroring _create_terminal, so the worker runs in the same project directory as the supervisor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| snapshot_path = TERMINAL_LOG_DIR / f"{terminal_id}.snapshot.json" | ||
| provider_type = _json.loads(snapshot_path.read_text(encoding="utf-8"))["provider"] | ||
| provider_cls = provider_class_for(provider_type) | ||
| if provider_cls is not None: | ||
| # extract_last_message_from_script reads only class-level regexes, | ||
| # not live terminal state, so a throwaway instance is sufficient. | ||
| extractor = provider_cls(terminal_id, "", "") | ||
| return extractor.extract_last_message_from_script(raw) |
--refresh-meta: rewrite metadata for every open PR and exit, no reviews. Cheap way to correct stale CI/labels/size flags on the dashboard (metadata is a point-in-time snapshot, so CI can go stale between review runs) without spending review agents. Synthesis-stall guard: rewrite the Step-3 wait rule to bias hard toward synthesizing. A parked session produces no report at all — worse than a report missing one angle. Explicitly close the "wait for the highest-signal reviewer" escape hatch that kept tripping the guard (awslabs#329, awslabs#352), and add a 0-1-findings fallback that writes an "incomplete" report rather than parking.
|
@chrisdfir thanks for your contribution! I rechecked I found one confirmed P2 in the new persisted-output path:
extractor = provider_cls(terminal_id, "", "")But the delete-time snapshot already stores Relevant spots:
The new tests only cover the Claude path:
So this provider gap is currently untested. Suggested fix: parse the full snapshot in Focused checks I ran locally: uv run pytest test/services/test_agent_step.py test/services/test_status_monitor.py test/services/test_terminal_service_full.py test/providers/test_claude_code_unit.py test/providers/test_provider_manager_unit.py test/mcp_server/test_handoff_equivalence.py -q
uv run black --check src/ test/
uv run isort --check-only src/ test/
git diff --check origin/main...HEADEverything above passed, but I would still fix the persisted-output gap before FindingsP2 - torn-down
|
The supervisor now fetches existing human comments/reviews on the PR in Step 1 (bots + the dashboard user excluded), and at synthesis (Step 4) checks each finding against them: anything a maintainer already raised is moved out of Blocking/Important/Nits into a new "Prior feedback" section that credits them and says we concur, rather than restating it. Only human feedback is deduped — a finding only a bot (Copilot/CodeQL) raised is still reported. Keeps the posted comment focused on net-new findings so it adds signal instead of repeating a maintainer (motivated by awslabs#352, where our persisted- output nit overlapped haofeif's P2 while our ellipsis-hang + scrollback-leak findings were genuinely new).
Summary
A
claude_codeworker segfaulted mid-handoff during a read-only recon run. Debugging that one incident surfaced a chain of independent reliability bugs in the handoff/teardown/status path — each of which can turn a worker problem into a silent bad result (a crash reported as success, a report truncated, a 10s "completion" that did no work). This PR fixes all of them.Each fix is its own commit with a self-contained message, so the PR can be reviewed commit-by-commit. All changes are unit-tested.
The fixes
status_monitor: drop stray output for deleted terminals instead of raising.When a worker dies, its DB row is deleted but the pane keeps emitting (shell prompt, the "Resume this session" banner, tmux redraws).
_process_chunkcalledprovider_manager.get_provider()unguarded, which raisesValueErrorfor an unknown terminal, andrun()logged a full traceback per stray chunk — flooding the log until the pane went quiet. Guarded likeget_status()already is.agent_step: detect a crashed worker instead of reporting false success.A segfaulted CLI freezes its pane. A frozen screen with a leftover response marker satisfies the
COMPLETEDheuristic trivially — death is permanent stability, so the confirm-stable window passes because the process is dead — and the handoff returnssuccesswith a truncated message. (A frozen screen matching no marker instead pinsPROCESSINGand the wait burns the full timeout, mislabeling a 5s crash as a 600s timeout.) Adds a fail-safe pane-liveness probe (worker_returned_to_shell): a live agent keeps its CLI as the pane foreground; only a crashed/exited one reverts to the shell baseline captured at launch. Probed at both decision points so a crash becomeskind="error"→ 502 → handoffsuccess=False. Generalizes the shell-baseline capture (previously kiro-only) to every provider. The probe returnsFalseon any uncertainty, so healthy/indeterminate workers are never aborted.claude_code: stop clipping reports at in-report markdown rules / tables.extract_last_message_from_scriptstopped at the first full-width─run, assuming it was the TUI turn-separator. But the renderer draws a markdown thematic break (---) and plain table dividers with the same pure-dash run, so any report containing one was clipped mid-message. Now a dash run terminates extraction only when it frames the❯input box (next non-blank line is the prompt, or the capture ends); a rule followed by more report text is content.terminal_service: serve persisted scrollback when re-reading a torn-down terminal.handofftears the worker down when it finishes (DB row deleted), so re-reading it viaget_terminal_outputreturned 404 — the in-band handoff result was the only copy.delete_terminalalready persists{id}.scrollback+{id}.snapshot.json;get_outputnow falls back to them (FULL → raw scrollback; LAST → re-extract with the recorded provider). A genuinely-unknown terminal still 404s.providers: declarativeDEFAULT_LAUNCH_ENVfor per-provider launch env.Generalizes per-provider launch-environment injection instead of hand-rolling inline shell prefixes. Providers declare a class-level
DEFAULT_LAUNCH_ENV; the service layer resolves it by provider type (before the instance exists, since the pane env is assembled at session/window creation) and merges it intoextra_envfor both supervisor and workers, under any operator--env. Unlike the process-local--envstore, it is re-derived from the provider class on every spawn, so it survives acao-serverrestart. First user:claude_codesetsBUN_JSC_useDFGJIT=0to disable the Bun DFG-JIT tier whose background compile thread can segfault mid-task.claude_code: don't read the start-of-line effort footer as a response.The effort indicator renders as
● high · /effort(also medium/low/xhigh/…).EXTRACTION_RESPONSE_PATTERN's line-start anchor only excluded it when rendered inline (… esc to interrupt ● high · /effort), but it also renders on its own line, where the leading●was read as a response marker — flippingget_statusto a falseCOMPLETEDon the idle launch screen (the❯box + this footer satisfy the completion heuristic), so a handoff "completed" in ~10s before the worker did any work and extraction returnedhigh · /effortas the message. Adds a negative lookahead rejecting a start-of-line marker whose line ends in/effort.claude_code: treat an in-progress tool-use marker as PROCESSING, not done.A response-marker line ending in the
…ellipsis (⏺ Reading 1 file…,● Running command…) is an active tool call, not a finished answer. When the worker emits a preamble and then thinks quietly for >10s, the live spinner scrolls out of the rolling buffer, the PROCESSING checks miss it, andget_statusfalls through toCOMPLETEDon the marker + idle prompt — false-completing a handoff in ~10s and returning the preamble instead of the report. Now the freshest response-marker line ending in…reads asPROCESSING. Gated onlast_completion is None, so a genuinely finished turn (which shows the✻ Worked for Nssummary) still completes; only the single-char…(U+2026) the TUI emits is matched, so prose ending in ASCII...cannot false-trigger.handoff: inherit the conductor's working directory.handoff()placed the worker in the cao-server process CWD instead of the supervisor's project directory.assign()/_create_terminalresolve the conductor's working directory when none is pinned, but_handoff_implnever did — so a supervisor launched in~/dev/personal/myrepohanded off to a developer that ran in whatever directory the daemon was started from, operating on the wrong tree (recon only worked because it used absolute paths)._handoff_implnow resolves the conductor's working directory when the caller didn't pin one, mirroring_create_terminal.Testing
uv run pytest test/ --ignore=test/e2e -m "not integration"→ 3226 passed.DEFAULT_LAUNCH_ENVresolution + injection, effort-footer exclusion, in-progress tool-use detection, handoff working-directory inheritance).black,isort,mypy src/clean (no new findings; pre-existing pyte typing notes instatus_monitorunchanged).npm run typecheck+test:unit) and returns an un-truncated report.🤖 Generated with Claude Code