Skip to content

fix: reliability fixes for handoff crash detection, output capture, and status detection#352

Open
chrisdfir wants to merge 9 commits into
awslabs:mainfrom
chrisdfir:reliability-fixes
Open

fix: reliability fixes for handoff crash detection, output capture, and status detection#352
chrisdfir wants to merge 9 commits into
awslabs:mainfrom
chrisdfir:reliability-fixes

Conversation

@chrisdfir

@chrisdfir chrisdfir commented Jun 30, 2026

Copy link
Copy Markdown

Summary

A claude_code worker segfaulted mid-handoff during a read-only recon run. Debugging that one incident surfaced a chain of independent reliability bugs in the handoff/teardown/status path — each of which can turn a worker problem into a silent bad result (a crash reported as success, a report truncated, a 10s "completion" that did no work). This PR fixes all of them.

Each fix is its own commit with a self-contained message, so the PR can be reviewed commit-by-commit. All changes are unit-tested.

The fixes

  1. status_monitor: drop stray output for deleted terminals instead of raising.
    When a worker dies, its DB row is deleted but the pane keeps emitting (shell prompt, the "Resume this session" banner, tmux redraws). _process_chunk called provider_manager.get_provider() unguarded, which raises ValueError for an unknown terminal, and run() logged a full traceback per stray chunk — flooding the log until the pane went quiet. Guarded like get_status() already is.

  2. agent_step: detect a crashed worker instead of reporting false success.
    A segfaulted CLI freezes its pane. A frozen screen with a leftover response marker satisfies the COMPLETED heuristic trivially — death is permanent stability, so the confirm-stable window passes because the process is dead — and the handoff returns success with a truncated message. (A frozen screen matching no marker instead pins PROCESSING and the wait burns the full timeout, mislabeling a 5s crash as a 600s timeout.) Adds a fail-safe pane-liveness probe (worker_returned_to_shell): a live agent keeps its CLI as the pane foreground; only a crashed/exited one reverts to the shell baseline captured at launch. Probed at both decision points so a crash becomes kind="error" → 502 → handoff success=False. Generalizes the shell-baseline capture (previously kiro-only) to every provider. The probe returns False on any uncertainty, so healthy/indeterminate workers are never aborted.

  3. claude_code: stop clipping reports at in-report markdown rules / tables.
    extract_last_message_from_script stopped at the first full-width run, assuming it was the TUI turn-separator. But the renderer draws a markdown thematic break (---) and plain table dividers with the same pure-dash run, so any report containing one was clipped mid-message. Now a dash run terminates extraction only when it frames the input box (next non-blank line is the prompt, or the capture ends); a rule followed by more report text is content.

  4. terminal_service: serve persisted scrollback when re-reading a torn-down terminal.
    handoff tears the worker down when it finishes (DB row deleted), so re-reading it via get_terminal_output returned 404 — the in-band handoff result was the only copy. delete_terminal already persists {id}.scrollback + {id}.snapshot.json; get_output now falls back to them (FULL → raw scrollback; LAST → re-extract with the recorded provider). A genuinely-unknown terminal still 404s.

  5. providers: declarative DEFAULT_LAUNCH_ENV for per-provider launch env.
    Generalizes per-provider launch-environment injection instead of hand-rolling inline shell prefixes. Providers declare a class-level DEFAULT_LAUNCH_ENV; the service layer resolves it by provider type (before the instance exists, since the pane env is assembled at session/window creation) and merges it into extra_env for both supervisor and workers, under any operator --env. Unlike the process-local --env store, it is re-derived from the provider class on every spawn, so it survives a cao-server restart. First user: claude_code sets BUN_JSC_useDFGJIT=0 to disable the Bun DFG-JIT tier whose background compile thread can segfault mid-task.

  6. claude_code: don't read the start-of-line effort footer as a response.
    The effort indicator renders as ● high · /effort (also medium/low/xhigh/…). EXTRACTION_RESPONSE_PATTERN's line-start anchor only excluded it when rendered inline (… esc to interrupt ● high · /effort), but it also renders on its own line, where the leading was read as a response marker — flipping get_status to a false COMPLETED on the idle launch screen (the box + this footer satisfy the completion heuristic), so a handoff "completed" in ~10s before the worker did any work and extraction returned high · /effort as the message. Adds a negative lookahead rejecting a start-of-line marker whose line ends in /effort.

  7. claude_code: treat an in-progress tool-use marker as PROCESSING, not done.
    A response-marker line ending in the ellipsis (⏺ Reading 1 file…, ● Running command…) is an active tool call, not a finished answer. When the worker emits a preamble and then thinks quietly for >10s, the live spinner scrolls out of the rolling buffer, the PROCESSING checks miss it, and get_status falls through to COMPLETED on the marker + idle prompt — false-completing a handoff in ~10s and returning the preamble instead of the report. Now the freshest response-marker line ending in reads as PROCESSING. Gated on last_completion is None, so a genuinely finished turn (which shows the ✻ Worked for Ns summary) still completes; only the single-char (U+2026) the TUI emits is matched, so prose ending in ASCII ... cannot false-trigger.

  8. handoff: inherit the conductor's working directory.
    handoff() placed the worker in the cao-server process CWD instead of the supervisor's project directory. assign()/_create_terminal resolve the conductor's working directory when none is pinned, but _handoff_impl never did — so a supervisor launched in ~/dev/personal/myrepo handed off to a developer that ran in whatever directory the daemon was started from, operating on the wrong tree (recon only worked because it used absolute paths). _handoff_impl now resolves the conductor's working directory when the caller didn't pin one, mirroring _create_terminal.

Testing

  • Full non-integration suite green: uv run pytest test/ --ignore=test/e2e -m "not integration"3226 passed.
  • New regression tests for each fix (status_monitor deleted-terminal, crash detection + liveness probe, extraction of in-report rules, persisted-output fallback, DEFAULT_LAUNCH_ENV resolution + injection, effort-footer exclusion, in-progress tool-use detection, handoff working-directory inheritance).
  • black, isort, mypy src/ clean (no new findings; pre-existing pyte typing notes in status_monitor unchanged).
  • Verified live end-to-end against a real recon: pane-env delivery confirmed; a previously-crashing/false-completing handoff now runs a real ~40s handoff (worker actually executes npm run typecheck + test:unit) and returns an un-truncated report.

🤖 Generated with Claude Code

chrisdfir and others added 7 commits June 29, 2026 19:47
…of raising

When a worker crashes, its DB row is deleted but the pane keeps emitting
(shell prompt, "Resume this session" banner, tmux redraws). _process_chunk
called provider_manager.get_provider() unguarded, which raises ValueError for
an unknown terminal, and run()'s handler logged a full traceback per stray
chunk — spamming the log until the pane went quiet.

Guard the lookup like get_status() already does: on ValueError, release any
residual buffer/latch state once and drop the chunk at debug level.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch env

Generalize per-provider launch-environment injection instead of hand-rolling
inline shell prefixes in each provider's command builder. Providers declare a
class-level DEFAULT_LAUNCH_ENV; the service layer resolves it by provider type
(before the instance exists, since the pane env is needed at tmux session/window
creation) and merges it into extra_env for both the supervisor and every worker.

- base: BaseProvider.DEFAULT_LAUNCH_ENV = {}
- claude_code: DEFAULT_LAUNCH_ENV = {'BUN_JSC_useDFGJIT': '0'} to disable the
  Bun DFG JIT tier whose background compile thread can segfault mid-task
- manager: default_launch_env(provider_type) + a registry kept exhaustive over
  ProviderType by test
- terminal_service: merge under operator --env (so --env still overrides)

Unlike the process-local --env store, this is re-derived from the provider class
on every spawn, so it survives a cao-server restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…orn-down terminal

handoff tears the worker down when it finishes (DB row deleted), so a supervisor
that re-reads it via get_terminal_output got a 404 — the in-band handoff result
was the only copy of the worker's report. delete_terminal already persists the
full plain-text {id}.scrollback plus a {id}.snapshot.json recording the provider.

get_output now falls back to those when terminal metadata is gone: FULL returns
the raw scrollback; LAST re-extracts the final message with the recorded
provider's stateless extractor, falling back to raw if extraction can't run. A
genuinely-unknown terminal (no snapshot) still raises not-found (404).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uccess

A segfaulted CLI freezes its pane. A frozen screen with a leftover response
marker can satisfy the COMPLETED heuristic trivially — death is permanent
stability, so the confirm-stable window passes BECAUSE the process is dead — and
run_agent_step then extracts a truncated message and returns success. A frozen
screen that matches no marker instead pins status at PROCESSING, so the wait
runs the full timeout and a 5s crash is mislabeled as a 600s timeout.

Add a fail-safe pane-liveness probe (worker_returned_to_shell): a live agent
keeps its CLI as the pane foreground; only a crashed/exited one reverts to the
shell baseline captured at launch. Probe at both decision points so a crash
becomes kind="error" (-> 502 -> handoff success=False, "worker process exited")
instead of a false success or a mislabeled timeout. Generalize the shell-baseline
capture (previously kiro-only) to every provider. The probe returns False on any
uncertainty, so healthy and indeterminate workers are never aborted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ables

extract_last_message_from_script terminated at the first full-width U+2500 run,
assuming it was always the TUI turn-separator above the input box. But Claude's
renderer draws a markdown thematic break (---) and plain table dividers with the
same pure-dash run, so any report containing one was clipped mid-message — e.g.
a recon report lost its architecture summary and test counts after a '---'.

Disambiguate by framing: a dash run terminates extraction only when the next
non-blank line is the ❯/> prompt (it frames the input box) or the capture ends.
A rule followed by more report text is content — drop the rule line and keep
extracting. Real Claude output never places response content after the turn
separator (the footer sits below the box), so this is correct for live output;
the prior synthetic test that asserted otherwise is updated to a realistic layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
black-format the new tests and type _PROVIDER_CLASS_BY_TYPE /
provider_class_for as type[BaseProvider] so the re-extraction path is not
Any-typed (clears the only new mypy finding; pre-existing pyte typing
notes in status_monitor are unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ponse

The TUI effort indicator renders as '● high · /effort' (also medium/low/xhigh/
max/none). EXTRACTION_RESPONSE_PATTERN's line-start anchor only excluded it when
it rendered inline ('… esc to interrupt ● high · /effort'), but it also renders
on its OWN line. There the leading '●' was read as a response marker, so:
- get_status saw 'marker + ❯ box' on the idle launch screen and returned a false
  COMPLETED, which held stable through the handoff's 10s confirm window -> the
  handoff 'completed' in ~10s before the worker did any work; and
- extraction returned 'high · /effort' as the message.

Add a negative lookahead rejecting a start-of-line marker whose line ends in
'/effort'. Real responses still match, complete, and extract; a real response
above the footer still anchors correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 30, 2026 22:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the worker lifecycle/handoff path so worker crashes, torn-down terminals, and Claude Code TUI quirks don’t produce silent false-successes, truncated outputs, or noisy status-monitor failures.

Changes:

  • Add crash/liveness detection using a shell-baseline + foreground-process probe to avoid false COMPLETED/timeout outcomes.
  • Improve output handling by re-serving persisted scrollback for deleted terminals and making Claude Code extraction resilient to in-report “rule/table” separators and effort footers.
  • Introduce provider-scoped DEFAULT_LAUNCH_ENV defaults resolved pre-instantiation and injected at tmux pane creation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/services/test_terminal_service_full.py Adds regression tests for provider launch-env injection, persisted output fallback, and liveness probe behavior.
test/services/test_status_monitor.py Adds tests ensuring stray output for deleted terminals is tolerated without exceptions and clears residual state.
test/services/test_agent_step.py Adds tests asserting crashed-worker detection prevents false success/timeout reporting.
test/providers/test_provider_manager_unit.py Adds tests for resolving DEFAULT_LAUNCH_ENV per provider type (including exhaustiveness).
test/providers/test_claude_code_unit.py Adds tests preventing Claude Code report clipping on rules/tables and excluding the effort footer from extraction/completion.
src/cli_agent_orchestrator/services/terminal_service.py Injects provider launch-env defaults into tmux pane env, captures shell baseline, adds persisted-output fallback, and introduces worker_returned_to_shell().
src/cli_agent_orchestrator/services/status_monitor.py Makes _process_chunk tolerant of missing terminals by dropping stray output and clearing related state.
src/cli_agent_orchestrator/services/agent_step.py Uses liveness probe to classify crashed workers as kind="error" instead of success/timeout.
src/cli_agent_orchestrator/providers/manager.py Adds provider-type → class mapping plus helpers to resolve class-level defaults and stateless extractors.
src/cli_agent_orchestrator/providers/claude_code.py Adds DEFAULT_LAUNCH_ENV, refines extraction response pattern to ignore effort footer, and improves separator handling to avoid truncation.
src/cli_agent_orchestrator/providers/base.py Introduces DEFAULT_LAUNCH_ENV as a supported provider-level declarative default.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +115 to +134
try:
provider = provider_manager.get_provider(terminal_id)
except ValueError:
# Terminal's DB row is gone but stray output chunks are still
# arriving — e.g. a crashed worker whose pane keeps emitting the
# shell prompt / "Resume this session" banner and tmux redraws
# after its record was deleted, or chunks draining out of the
# event bus after delete_terminal stopped the FIFO reader. Release
# any residual buffer/latch state once and drop the chunk. Without
# this, get_provider raised ValueError for every such chunk and
# run()'s handler logged a full traceback per chunk, spamming the
# log until the pane went quiet. Mirrors the defensive
# get_provider() handling already in get_status().
if terminal_id in self._buffers or terminal_id in self._last_status:
self.clear_terminal(terminal_id)
logger.debug("Dropping output for unknown/deleted terminal %s", terminal_id)
return
except Exception:
logger.exception("Error resolving provider for %s; dropping chunk", terminal_id)
return
re-extracts the final message with the recorded provider's (stateless)
extractor, falling back to the raw scrollback if extraction can't run.

Returns None when no snapshot exists (genuinely-unknown terminal).
…t done

A response-marker line ending in the '…' ellipsis ('⏺ Reading 1 file…',
'● Running command…') is an active tool call, not a finished answer. When the
worker emits a preamble and then thinks quietly for >10s, the live spinner
scrolls out of the rolling buffer, the PROCESSING checks miss it, and get_status
falls through to COMPLETED on the marker+idle-prompt. That false-completes a
handoff in ~10s, returning the preamble instead of the report.

Treat the freshest response-marker line ending in '…' as PROCESSING. Gated on
last_completion is None, so a genuinely finished turn (which shows the
'✻ Worked for Ns' summary) still completes. Only the single-char '…' (U+2026)
the TUI emits is matched, so prose ending in ASCII '...' cannot false-trigger.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.10638% with 14 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@0214f23). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...li_agent_orchestrator/services/terminal_service.py 81.25% 9 Missing ⚠️
.../cli_agent_orchestrator/services/status_monitor.py 70.00% 3 Missing ⚠️
src/cli_agent_orchestrator/mcp_server/server.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #352   +/-   ##
=======================================
  Coverage        ?   87.47%           
=======================================
  Files           ?      115           
  Lines           ?    13624           
  Branches        ?        0           
=======================================
  Hits            ?    11917           
  Misses          ?     1707           
  Partials        ?        0           
Flag Coverage Δ
unittests 87.47% <85.10%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

handoff() placed the worker in the cao-server process CWD instead of the
supervisor's project directory: assign()/_create_terminal resolve the
conductor's working directory when none is pinned, but _handoff_impl never did.
So a supervisor launched in ~/dev/personal/myrepo handed off to a developer that
ran in whatever directory the daemon was started from — operating on the wrong
tree (recon only worked because it used absolute paths).

Resolve the conductor's working directory in _handoff_impl when the caller
didn't pin one, mirroring _create_terminal, so the worker runs in the same
project directory as the supervisor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 30, 2026 22:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comment on lines +687 to +694
snapshot_path = TERMINAL_LOG_DIR / f"{terminal_id}.snapshot.json"
provider_type = _json.loads(snapshot_path.read_text(encoding="utf-8"))["provider"]
provider_cls = provider_class_for(provider_type)
if provider_cls is not None:
# extract_last_message_from_script reads only class-level regexes,
# not live terminal state, so a throwaway instance is sufficient.
extractor = provider_cls(terminal_id, "", "")
return extractor.extract_last_message_from_script(raw)
gutosantos82 added a commit to gutosantos82/cli-agent-orchestrator that referenced this pull request Jul 1, 2026
--refresh-meta: rewrite metadata for every open PR and exit, no reviews.
Cheap way to correct stale CI/labels/size flags on the dashboard (metadata
is a point-in-time snapshot, so CI can go stale between review runs) without
spending review agents.

Synthesis-stall guard: rewrite the Step-3 wait rule to bias hard toward
synthesizing. A parked session produces no report at all — worse than a
report missing one angle. Explicitly close the "wait for the highest-signal
reviewer" escape hatch that kept tripping the guard (awslabs#329, awslabs#352), and add a
0-1-findings fallback that writes an "incomplete" report rather than parking.
@haofeif

haofeif commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

@chrisdfir thanks for your contribution!

I rechecked #352 carefully on the latest head (117d5f3) against current
origin/main (5dcf319).

I found one confirmed P2 in the new persisted-output path:

_persisted_output() rebuilds the provider extractor as:

extractor = provider_cls(terminal_id, "", "")

But the delete-time snapshot already stores agent_profile, and both
QCliProvider and KiroCliProvider require it positionally in __init__.
So for torn-down q_cli / kiro_cli terminals, get_output(..., LAST) now
throws during reconstruction, swallows that exception, and silently falls back
to raw scrollback instead of returning the provider-extracted final answer.

Relevant spots:

  • src/cli_agent_orchestrator/services/terminal_service.py:687-694
  • src/cli_agent_orchestrator/services/terminal_service.py:902-907
  • src/cli_agent_orchestrator/providers/q_cli.py:30-36
  • src/cli_agent_orchestrator/providers/kiro_cli.py:132-138

The new tests only cover the Claude path:

  • test/services/test_terminal_service_full.py:1270-1283

So this provider gap is currently untested.

Suggested fix: parse the full snapshot in _persisted_output() and pass
agent_profile when rebuilding the extractor, or add a small provider-manager
helper for snapshot-based extractor creation. I’d also add persisted LAST
coverage for at least q_cli and kiro_cli.

Focused checks I ran locally:

uv run pytest test/services/test_agent_step.py test/services/test_status_monitor.py test/services/test_terminal_service_full.py test/providers/test_claude_code_unit.py test/providers/test_provider_manager_unit.py test/mcp_server/test_handoff_equivalence.py -q
uv run black --check src/ test/
uv run isort --check-only src/ test/
git diff --check origin/main...HEAD

Everything above passed, but I would still fix the persisted-output gap before
merge.

Findings

P2 - torn-down LAST output still breaks for q_cli and kiro_cli

_persisted_output() rebuilds a provider extractor as:

  • src/cli_agent_orchestrator/services/terminal_service.py:687-694

It always does:

extractor = provider_cls(terminal_id, "", "")

That only works for providers whose constructor arguments after window_name are optional.

But the delete-time snapshot already persists agent_profile:

  • src/cli_agent_orchestrator/services/terminal_service.py:902-907

And both of these providers require it positionally:

  • src/cli_agent_orchestrator/providers/q_cli.py:30-36
  • src/cli_agent_orchestrator/providers/kiro_cli.py:132-138

So for torn-down Q/Kiro terminals, the new LAST recovery path throws during provider reconstruction, swallows the exception, and silently falls back to raw scrollback instead of the provider-extracted final answer.

That means the new "recover output after worker teardown" behavior is only fully working for providers like claude_code; it is still broken for two major providers that the product already supports.

The current tests only cover the Claude path:

  • test/services/test_terminal_service_full.py:1270-1283

There is no regression coverage for persisted LAST output on q_cli or kiro_cli, which is why this slips through.

Why I still classify this as P2, not P1

I rechecked the actual user-facing blast radius before settling on severity.

I still think this is P2 because:

  • the main handoff success path still returns the worker result inline from run_agent_step; this bug does not break the primary completion path
  • the torn-down reread path no longer 404s; it degrades to raw scrollback
  • the regression is limited to providers whose extractor reconstruction needs snapshot metadata (q_cli / kiro_cli), not all providers
  • this is a correctness/contract gap in a fallback read path, not data loss or a broken core execution flow

What would make it P1 for me:

  • handoff/assign returning the wrong result on the primary success path
  • rereads of torn-down workers failing entirely again (404 / 500)
  • terminal teardown losing the only copy of the worker output
  • the bug affecting the common live get_output(..., LAST) path rather than only the persisted fallback

So I would still ask for a fix before merge, but I do not think it rises to a P1 blocker.

Repro I confirmed locally

With a persisted snapshot like:

{"terminal_id":"dead","provider":"q_cli","agent_profile":"developer"}

and scrollback:

[developer] >
> final answer
[developer] >

the current code returns the full raw scrollback from get_output("dead", OutputMode.LAST).

But the actual provider extractor works when instantiated with the recorded profile:

QCliProvider("t", "", "", "developer").extract_last_message_from_script(raw)
# => "final answer"

Suggested fix

Parse the full snapshot in _persisted_output() and pass agent_profile when rebuilding the extractor, or add a small provider-manager helper that constructs a stateless extractor from snapshot metadata instead of guessing constructor shape in terminal_service.

I would also add direct regression tests for persisted LAST output on at least:

  • q_cli
  • kiro_cli

What I rechecked besides the finding

  • crash-vs-timeout/error classification in run_agent_step
  • handoff working-directory inheritance
  • deleted-terminal chunk handling in StatusMonitor
  • provider shell-baseline capture / restore
  • Claude Code launch-env, status, and extraction changes

I did not confirm another blocker from those areas.

Checks run

In /private/tmp/cao-pr352:

uv run pytest test/services/test_agent_step.py test/services/test_status_monitor.py test/services/test_terminal_service_full.py test/providers/test_claude_code_unit.py test/providers/test_provider_manager_unit.py test/mcp_server/test_handoff_equivalence.py -q
# 246 passed

uv run black --check src/ test/
# passed

uv run isort --check-only src/ test/
# passed

git diff --check origin/main...HEAD
# passed

Review conclusion

I would hold merge on this until the persisted LAST extraction path uses the snapshot metadata it already saves and has coverage beyond claude_code.

@haofeif haofeif added the bug Something isn't working label Jul 1, 2026
gutosantos82 added a commit to gutosantos82/cli-agent-orchestrator that referenced this pull request Jul 1, 2026
The supervisor now fetches existing human comments/reviews on the PR in
Step 1 (bots + the dashboard user excluded), and at synthesis (Step 4) checks
each finding against them: anything a maintainer already raised is moved out
of Blocking/Important/Nits into a new "Prior feedback" section that credits
them and says we concur, rather than restating it. Only human feedback is
deduped — a finding only a bot (Copilot/CodeQL) raised is still reported.

Keeps the posted comment focused on net-new findings so it adds signal
instead of repeating a maintainer (motivated by awslabs#352, where our persisted-
output nit overlapped haofeif's P2 while our ellipsis-hang + scrollback-leak
findings were genuinely new).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants