fix: reliability fixes for handoff crash detection, output capture, and status detection by chrisdfir · Pull Request #352 · awslabs/cli-agent-orchestrator

chrisdfir · 2026-06-30T22:04:59Z

Summary

A claude_code worker segfaulted mid-handoff during a read-only recon run. Debugging that one incident surfaced a chain of independent reliability bugs in the handoff/teardown/status path — each of which can turn a worker problem into a silent bad result (a crash reported as success, a report truncated, a 10s "completion" that did no work). This PR fixes all of them.

Each fix is its own commit with a self-contained message, so the PR can be reviewed commit-by-commit. All changes are unit-tested.

The fixes

status_monitor: drop stray output for deleted terminals instead of raising.
When a worker dies, its DB row is deleted but the pane keeps emitting (shell prompt, the "Resume this session" banner, tmux redraws). _process_chunk called provider_manager.get_provider() unguarded, which raises ValueError for an unknown terminal, and run() logged a full traceback per stray chunk — flooding the log until the pane went quiet. Guarded like get_status() already is.
agent_step: detect a crashed worker instead of reporting false success.
A segfaulted CLI freezes its pane. A frozen screen with a leftover response marker satisfies the COMPLETED heuristic trivially — death is permanent stability, so the confirm-stable window passes because the process is dead — and the handoff returns success with a truncated message. (A frozen screen matching no marker instead pins PROCESSING and the wait burns the full timeout, mislabeling a 5s crash as a 600s timeout.) Adds a fail-safe pane-liveness probe (worker_returned_to_shell): a live agent keeps its CLI as the pane foreground; only a crashed/exited one reverts to the shell baseline captured at launch. Probed at both decision points so a crash becomes kind="error" → 502 → handoff success=False. Generalizes the shell-baseline capture (previously kiro-only) to every provider. The probe returns False on any uncertainty, so healthy/indeterminate workers are never aborted.
claude_code: stop clipping reports at in-report markdown rules / tables.
extract_last_message_from_script stopped at the first full-width ─ run, assuming it was the TUI turn-separator. But the renderer draws a markdown thematic break (---) and plain table dividers with the same pure-dash run, so any report containing one was clipped mid-message. Now a dash run terminates extraction only when it frames the ❯ input box (next non-blank line is the prompt, or the capture ends); a rule followed by more report text is content.
terminal_service: serve persisted scrollback when re-reading a torn-down terminal.
handoff tears the worker down when it finishes (DB row deleted), so re-reading it via get_terminal_output returned 404 — the in-band handoff result was the only copy. delete_terminal already persists {id}.scrollback + {id}.snapshot.json; get_output now falls back to them (FULL → raw scrollback; LAST → re-extract with the recorded provider). A genuinely-unknown terminal still 404s.
providers: declarative DEFAULT_LAUNCH_ENV for per-provider launch env.
Generalizes per-provider launch-environment injection instead of hand-rolling inline shell prefixes. Providers declare a class-level DEFAULT_LAUNCH_ENV; the service layer resolves it by provider type (before the instance exists, since the pane env is assembled at session/window creation) and merges it into extra_env for both supervisor and workers, under any operator --env. Unlike the process-local --env store, it is re-derived from the provider class on every spawn, so it survives a cao-server restart. First user: claude_code sets BUN_JSC_useDFGJIT=0 to disable the Bun DFG-JIT tier whose background compile thread can segfault mid-task.
claude_code: don't read the start-of-line effort footer as a response.
The effort indicator renders as ● high · /effort (also medium/low/xhigh/…). EXTRACTION_RESPONSE_PATTERN's line-start anchor only excluded it when rendered inline (… esc to interrupt ● high · /effort), but it also renders on its own line, where the leading ● was read as a response marker — flipping get_status to a false COMPLETED on the idle launch screen (the ❯ box + this footer satisfy the completion heuristic), so a handoff "completed" in ~10s before the worker did any work and extraction returned high · /effort as the message. Adds a negative lookahead rejecting a start-of-line marker whose line ends in /effort.
claude_code: treat an in-progress tool-use marker as PROCESSING, not done.
A response-marker line ending in the … ellipsis (⏺ Reading 1 file…, ● Running command…) is an active tool call, not a finished answer. When the worker emits a preamble and then thinks quietly for >10s, the live spinner scrolls out of the rolling buffer, the PROCESSING checks miss it, and get_status falls through to COMPLETED on the marker + idle prompt — false-completing a handoff in ~10s and returning the preamble instead of the report. Now the freshest response-marker line ending in … reads as PROCESSING. Gated on last_completion is None, so a genuinely finished turn (which shows the ✻ Worked for Ns summary) still completes; only the single-char … (U+2026) the TUI emits is matched, so prose ending in ASCII ... cannot false-trigger.
handoff: inherit the conductor's working directory.
handoff() placed the worker in the cao-server process CWD instead of the supervisor's project directory. assign()/_create_terminal resolve the conductor's working directory when none is pinned, but _handoff_impl never did — so a supervisor launched in ~/dev/personal/myrepo handed off to a developer that ran in whatever directory the daemon was started from, operating on the wrong tree (recon only worked because it used absolute paths). _handoff_impl now resolves the conductor's working directory when the caller didn't pin one, mirroring _create_terminal.

Testing

Full non-integration suite green: uv run pytest test/ --ignore=test/e2e -m "not integration" → 3226 passed.
New regression tests for each fix (status_monitor deleted-terminal, crash detection + liveness probe, extraction of in-report rules, persisted-output fallback, DEFAULT_LAUNCH_ENV resolution + injection, effort-footer exclusion, in-progress tool-use detection, handoff working-directory inheritance).
black, isort, mypy src/ clean (no new findings; pre-existing pyte typing notes in status_monitor unchanged).
Verified live end-to-end against a real recon: pane-env delivery confirmed; a previously-crashing/false-completing handoff now runs a real ~40s handoff (worker actually executes npm run typecheck + test:unit) and returns an un-truncated report.

🤖 Generated with Claude Code

…of raising When a worker crashes, its DB row is deleted but the pane keeps emitting (shell prompt, "Resume this session" banner, tmux redraws). _process_chunk called provider_manager.get_provider() unguarded, which raises ValueError for an unknown terminal, and run()'s handler logged a full traceback per stray chunk — spamming the log until the pane went quiet. Guard the lookup like get_status() already does: on ValueError, release any residual buffer/latch state once and drop the chunk at debug level. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ch env Generalize per-provider launch-environment injection instead of hand-rolling inline shell prefixes in each provider's command builder. Providers declare a class-level DEFAULT_LAUNCH_ENV; the service layer resolves it by provider type (before the instance exists, since the pane env is needed at tmux session/window creation) and merges it into extra_env for both the supervisor and every worker. - base: BaseProvider.DEFAULT_LAUNCH_ENV = {} - claude_code: DEFAULT_LAUNCH_ENV = {'BUN_JSC_useDFGJIT': '0'} to disable the Bun DFG JIT tier whose background compile thread can segfault mid-task - manager: default_launch_env(provider_type) + a registry kept exhaustive over ProviderType by test - terminal_service: merge under operator --env (so --env still overrides) Unlike the process-local --env store, this is re-derived from the provider class on every spawn, so it survives a cao-server restart. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…orn-down terminal handoff tears the worker down when it finishes (DB row deleted), so a supervisor that re-reads it via get_terminal_output got a 404 — the in-band handoff result was the only copy of the worker's report. delete_terminal already persists the full plain-text {id}.scrollback plus a {id}.snapshot.json recording the provider. get_output now falls back to those when terminal metadata is gone: FULL returns the raw scrollback; LAST re-extracts the final message with the recorded provider's stateless extractor, falling back to raw if extraction can't run. A genuinely-unknown terminal (no snapshot) still raises not-found (404). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uccess A segfaulted CLI freezes its pane. A frozen screen with a leftover response marker can satisfy the COMPLETED heuristic trivially — death is permanent stability, so the confirm-stable window passes BECAUSE the process is dead — and run_agent_step then extracts a truncated message and returns success. A frozen screen that matches no marker instead pins status at PROCESSING, so the wait runs the full timeout and a 5s crash is mislabeled as a 600s timeout. Add a fail-safe pane-liveness probe (worker_returned_to_shell): a live agent keeps its CLI as the pane foreground; only a crashed/exited one reverts to the shell baseline captured at launch. Probe at both decision points so a crash becomes kind="error" (-> 502 -> handoff success=False, "worker process exited") instead of a false success or a mislabeled timeout. Generalize the shell-baseline capture (previously kiro-only) to every provider. The probe returns False on any uncertainty, so healthy and indeterminate workers are never aborted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ables extract_last_message_from_script terminated at the first full-width U+2500 run, assuming it was always the TUI turn-separator above the input box. But Claude's renderer draws a markdown thematic break (---) and plain table dividers with the same pure-dash run, so any report containing one was clipped mid-message — e.g. a recon report lost its architecture summary and test counts after a '---'. Disambiguate by framing: a dash run terminates extraction only when the next non-blank line is the ❯/> prompt (it frames the input box) or the capture ends. A rule followed by more report text is content — drop the rule line and keep extracting. Real Claude output never places response content after the turn separator (the footer sits below the box), so this is correct for live output; the prior synthetic test that asserted otherwise is updated to a realistic layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

black-format the new tests and type _PROVIDER_CLASS_BY_TYPE / provider_class_for as type[BaseProvider] so the re-extraction path is not Any-typed (clears the only new mypy finding; pre-existing pyte typing notes in status_monitor are unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ponse The TUI effort indicator renders as '● high · /effort' (also medium/low/xhigh/ max/none). EXTRACTION_RESPONSE_PATTERN's line-start anchor only excluded it when it rendered inline ('… esc to interrupt ● high · /effort'), but it also renders on its OWN line. There the leading '●' was read as a response marker, so: - get_status saw 'marker + ❯ box' on the idle launch screen and returned a false COMPLETED, which held stable through the handoff's 10s confirm window -> the handoff 'completed' in ~10s before the worker did any work; and - extraction returned 'high · /effort' as the message. Add a negative lookahead rejecting a start-of-line marker whose line ends in '/effort'. Real responses still match, complete, and extract; a real response above the footer still anchors correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR hardens the worker lifecycle/handoff path so worker crashes, torn-down terminals, and Claude Code TUI quirks don’t produce silent false-successes, truncated outputs, or noisy status-monitor failures.

Changes:

Add crash/liveness detection using a shell-baseline + foreground-process probe to avoid false COMPLETED/timeout outcomes.
Improve output handling by re-serving persisted scrollback for deleted terminals and making Claude Code extraction resilient to in-report “rule/table” separators and effort footers.
Introduce provider-scoped DEFAULT_LAUNCH_ENV defaults resolved pre-instantiation and injected at tmux pane creation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/services/test_terminal_service_full.py	Adds regression tests for provider launch-env injection, persisted output fallback, and liveness probe behavior.
test/services/test_status_monitor.py	Adds tests ensuring stray output for deleted terminals is tolerated without exceptions and clears residual state.
test/services/test_agent_step.py	Adds tests asserting crashed-worker detection prevents false success/timeout reporting.
test/providers/test_provider_manager_unit.py	Adds tests for resolving `DEFAULT_LAUNCH_ENV` per provider type (including exhaustiveness).
test/providers/test_claude_code_unit.py	Adds tests preventing Claude Code report clipping on rules/tables and excluding the effort footer from extraction/completion.
src/cli_agent_orchestrator/services/terminal_service.py	Injects provider launch-env defaults into tmux pane env, captures shell baseline, adds persisted-output fallback, and introduces `worker_returned_to_shell()`.
src/cli_agent_orchestrator/services/status_monitor.py	Makes `_process_chunk` tolerant of missing terminals by dropping stray output and clearing related state.
src/cli_agent_orchestrator/services/agent_step.py	Uses liveness probe to classify crashed workers as `kind="error"` instead of success/timeout.
src/cli_agent_orchestrator/providers/manager.py	Adds provider-type → class mapping plus helpers to resolve class-level defaults and stateless extractors.
src/cli_agent_orchestrator/providers/claude_code.py	Adds `DEFAULT_LAUNCH_ENV`, refines extraction response pattern to ignore effort footer, and improves separator handling to avoid truncation.
src/cli_agent_orchestrator/providers/base.py	Introduces `DEFAULT_LAUNCH_ENV` as a supported provider-level declarative default.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        try:
+            provider = provider_manager.get_provider(terminal_id)
+        except ValueError:
+            # Terminal's DB row is gone but stray output chunks are still
+            # arriving — e.g. a crashed worker whose pane keeps emitting the
+            # shell prompt / "Resume this session" banner and tmux redraws
+            # after its record was deleted, or chunks draining out of the
+            # event bus after delete_terminal stopped the FIFO reader. Release
+            # any residual buffer/latch state once and drop the chunk. Without
+            # this, get_provider raised ValueError for every such chunk and
+            # run()'s handler logged a full traceback per chunk, spamming the
+            # log until the pane went quiet. Mirrors the defensive
+            # get_provider() handling already in get_status().
+            if terminal_id in self._buffers or terminal_id in self._last_status:
+                self.clear_terminal(terminal_id)
+            logger.debug("Dropping output for unknown/deleted terminal %s", terminal_id)
+            return
+        except Exception:
+            logger.exception("Error resolving provider for %s; dropping chunk", terminal_id)
+            return


+    re-extracts the final message with the recorded provider's (stateless)
+    extractor, falling back to the raw scrollback if extraction can't run.
+
+    Returns None when no snapshot exists (genuinely-unknown terminal).


…t done A response-marker line ending in the '…' ellipsis ('⏺ Reading 1 file…', '● Running command…') is an active tool call, not a finished answer. When the worker emits a preamble and then thinks quietly for >10s, the live spinner scrolls out of the rolling buffer, the PROCESSING checks miss it, and get_status falls through to COMPLETED on the marker+idle-prompt. That false-completes a handoff in ~10s, returning the preamble instead of the report. Treat the freshest response-marker line ending in '…' as PROCESSING. Gated on last_completion is None, so a genuinely finished turn (which shows the '✻ Worked for Ns' summary) still completes. Only the single-char '…' (U+2026) the TUI emits is matched, so prose ending in ASCII '...' cannot false-trigger. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-30T22:42:35Z

Codecov Report

❌ Patch coverage is 85.10638% with 14 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@0214f23). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...li_agent_orchestrator/services/terminal_service.py	81.25%	9 Missing ⚠️
.../cli_agent_orchestrator/services/status_monitor.py	70.00%	3 Missing ⚠️
src/cli_agent_orchestrator/mcp_server/server.py	81.81%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #352   +/-   ##
=======================================
  Coverage        ?   87.47%           
=======================================
  Files           ?      115           
  Lines           ?    13624           
  Branches        ?        0           
=======================================
  Hits            ?    11917           
  Misses          ?     1707           
  Partials        ?        0

Flag	Coverage Δ
unittests	`87.47% <85.10%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

handoff() placed the worker in the cao-server process CWD instead of the supervisor's project directory: assign()/_create_terminal resolve the conductor's working directory when none is pinned, but _handoff_impl never did. So a supervisor launched in ~/dev/personal/myrepo handed off to a developer that ran in whatever directory the daemon was started from — operating on the wrong tree (recon only worked because it used absolute paths). Resolve the conductor's working directory in _handoff_impl when the caller didn't pin one, mirroring _create_terminal, so the worker runs in the same project directory as the supervisor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

+        snapshot_path = TERMINAL_LOG_DIR / f"{terminal_id}.snapshot.json"
+        provider_type = _json.loads(snapshot_path.read_text(encoding="utf-8"))["provider"]
+        provider_cls = provider_class_for(provider_type)
+        if provider_cls is not None:
+            # extract_last_message_from_script reads only class-level regexes,
+            # not live terminal state, so a throwaway instance is sufficient.
+            extractor = provider_cls(terminal_id, "", "")
+            return extractor.extract_last_message_from_script(raw)


--refresh-meta: rewrite metadata for every open PR and exit, no reviews. Cheap way to correct stale CI/labels/size flags on the dashboard (metadata is a point-in-time snapshot, so CI can go stale between review runs) without spending review agents. Synthesis-stall guard: rewrite the Step-3 wait rule to bias hard toward synthesizing. A parked session produces no report at all — worse than a report missing one angle. Explicitly close the "wait for the highest-signal reviewer" escape hatch that kept tripping the guard (awslabs#329, awslabs#352), and add a 0-1-findings fallback that writes an "incomplete" report rather than parking.

haofeif · 2026-07-01T03:14:04Z

@chrisdfir thanks for your contribution!

I rechecked #352 carefully on the latest head (117d5f3) against current
origin/main (5dcf319).

I found one confirmed P2 in the new persisted-output path:

_persisted_output() rebuilds the provider extractor as:

extractor = provider_cls(terminal_id, "", "")

But the delete-time snapshot already stores agent_profile, and both
QCliProvider and KiroCliProvider require it positionally in __init__.
So for torn-down q_cli / kiro_cli terminals, get_output(..., LAST) now
throws during reconstruction, swallows that exception, and silently falls back
to raw scrollback instead of returning the provider-extracted final answer.

Relevant spots:

src/cli_agent_orchestrator/services/terminal_service.py:687-694
src/cli_agent_orchestrator/services/terminal_service.py:902-907
src/cli_agent_orchestrator/providers/q_cli.py:30-36
src/cli_agent_orchestrator/providers/kiro_cli.py:132-138

The new tests only cover the Claude path:

test/services/test_terminal_service_full.py:1270-1283

So this provider gap is currently untested.

Suggested fix: parse the full snapshot in _persisted_output() and pass
agent_profile when rebuilding the extractor, or add a small provider-manager
helper for snapshot-based extractor creation. I’d also add persisted LAST
coverage for at least q_cli and kiro_cli.

Focused checks I ran locally:

uv run pytest test/services/test_agent_step.py test/services/test_status_monitor.py test/services/test_terminal_service_full.py test/providers/test_claude_code_unit.py test/providers/test_provider_manager_unit.py test/mcp_server/test_handoff_equivalence.py -q
uv run black --check src/ test/
uv run isort --check-only src/ test/
git diff --check origin/main...HEAD

Everything above passed, but I would still fix the persisted-output gap before
merge.

Findings

P2 - torn-down `LAST` output still breaks for `q_cli` and `kiro_cli`

_persisted_output() rebuilds a provider extractor as:

src/cli_agent_orchestrator/services/terminal_service.py:687-694

It always does:

extractor = provider_cls(terminal_id, "", "")

That only works for providers whose constructor arguments after window_name are optional.

But the delete-time snapshot already persists agent_profile:

src/cli_agent_orchestrator/services/terminal_service.py:902-907

And both of these providers require it positionally:

src/cli_agent_orchestrator/providers/q_cli.py:30-36
src/cli_agent_orchestrator/providers/kiro_cli.py:132-138

So for torn-down Q/Kiro terminals, the new LAST recovery path throws during provider reconstruction, swallows the exception, and silently falls back to raw scrollback instead of the provider-extracted final answer.

That means the new "recover output after worker teardown" behavior is only fully working for providers like claude_code; it is still broken for two major providers that the product already supports.

The current tests only cover the Claude path:

test/services/test_terminal_service_full.py:1270-1283

There is no regression coverage for persisted LAST output on q_cli or kiro_cli, which is why this slips through.

Why I still classify this as P2, not P1

I rechecked the actual user-facing blast radius before settling on severity.

I still think this is P2 because:

the main handoff success path still returns the worker result inline from run_agent_step; this bug does not break the primary completion path
the torn-down reread path no longer 404s; it degrades to raw scrollback
the regression is limited to providers whose extractor reconstruction needs snapshot metadata (q_cli / kiro_cli), not all providers
this is a correctness/contract gap in a fallback read path, not data loss or a broken core execution flow

What would make it P1 for me:

handoff/assign returning the wrong result on the primary success path
rereads of torn-down workers failing entirely again (404 / 500)
terminal teardown losing the only copy of the worker output
the bug affecting the common live get_output(..., LAST) path rather than only the persisted fallback

So I would still ask for a fix before merge, but I do not think it rises to a P1 blocker.

Repro I confirmed locally

With a persisted snapshot like:

{"terminal_id":"dead","provider":"q_cli","agent_profile":"developer"}

and scrollback:

[developer] >
> final answer
[developer] >

the current code returns the full raw scrollback from get_output("dead", OutputMode.LAST).

But the actual provider extractor works when instantiated with the recorded profile:

QCliProvider("t", "", "", "developer").extract_last_message_from_script(raw)
# => "final answer"

Suggested fix

Parse the full snapshot in _persisted_output() and pass agent_profile when rebuilding the extractor, or add a small provider-manager helper that constructs a stateless extractor from snapshot metadata instead of guessing constructor shape in terminal_service.

I would also add direct regression tests for persisted LAST output on at least:

q_cli
kiro_cli

What I rechecked besides the finding

crash-vs-timeout/error classification in run_agent_step
handoff working-directory inheritance
deleted-terminal chunk handling in StatusMonitor
provider shell-baseline capture / restore
Claude Code launch-env, status, and extraction changes

I did not confirm another blocker from those areas.

Checks run

In /private/tmp/cao-pr352:

uv run pytest test/services/test_agent_step.py test/services/test_status_monitor.py test/services/test_terminal_service_full.py test/providers/test_claude_code_unit.py test/providers/test_provider_manager_unit.py test/mcp_server/test_handoff_equivalence.py -q
# 246 passed

uv run black --check src/ test/
# passed

uv run isort --check-only src/ test/
# passed

git diff --check origin/main...HEAD
# passed

Review conclusion

I would hold merge on this until the persisted LAST extraction path uses the snapshot metadata it already saves and has coverage beyond claude_code.

The supervisor now fetches existing human comments/reviews on the PR in Step 1 (bots + the dashboard user excluded), and at synthesis (Step 4) checks each finding against them: anything a maintainer already raised is moved out of Blocking/Important/Nits into a new "Prior feedback" section that credits them and says we concur, rather than restating it. Only human feedback is deduped — a finding only a bot (Copilot/CodeQL) raised is still reported. Keeps the posted comment focused on net-new findings so it adds signal instead of repeating a maintainer (motivated by awslabs#352, where our persisted- output nit overlapped haofeif's P2 while our ellipsis-hang + scrollback-leak findings were genuinely new).

chrisdfir and others added 7 commits June 29, 2026 19:47

Copilot AI review requested due to automatic review settings June 30, 2026 22:05

Copilot started reviewing on behalf of chrisdfir June 30, 2026 22:05 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 30, 2026 22:59

Copilot started reviewing on behalf of chrisdfir June 30, 2026 22:59 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

haofeif added the bug Something isn't working label Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: reliability fixes for handoff crash detection, output capture, and status detection#352

fix: reliability fixes for handoff crash detection, output capture, and status detection#352
chrisdfir wants to merge 9 commits into
awslabs:mainfrom
chrisdfir:reliability-fixes

chrisdfir commented Jun 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

haofeif commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

chrisdfir commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The fixes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

haofeif commented Jul 1, 2026

Findings

P2 - torn-down LAST output still breaks for q_cli and kiro_cli

Why I still classify this as P2, not P1

Repro I confirmed locally

Suggested fix

What I rechecked besides the finding

Checks run

Review conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chrisdfir commented Jun 30, 2026 •

edited

Loading

codecov-commenter commented Jun 30, 2026 •

edited

Loading

P2 - torn-down `LAST` output still breaks for `q_cli` and `kiro_cli`