Skip to content

fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites#1064

Open
svarlamov wants to merge 11 commits intomainfrom
fix/daemon-wrapper-daemon-flakiness
Open

fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites#1064
svarlamov wants to merge 11 commits intomainfrom
fix/daemon-wrapper-daemon-flakiness

Conversation

@svarlamov
Copy link
Copy Markdown
Member

@svarlamov svarlamov commented Apr 12, 2026

Summary

Three root causes of daemon/wrapper-daemon test flakiness fixed:

  • Snapshot timestamp bug: payload_timestamp_ns in trace_normalizer.rs was calling Value::as_u64 on git trace2's time field, which is an RFC3339 string. It fell through to t_abs (elapsed seconds since program start, ~2–3ms), making finished_at_ns nearly zero. The mtime-cutoff filter then treated every file as "modified after cutoff", producing empty carryover snapshots and silently dropping amend-authorship working-log data. Fix: parse time as RFC3339 via chrono.

  • Missing pending-root slot for current_dir invocations: Commands run via current_dir (not -C) have no -C in argv, so the git trace2 start event carries cwd=None → no worktree hint → no PendingRoot slot → command was processed via the no-op "applied" path with no deferred snapshot capture. Subdirectory amend tests failed because the carryover snapshot was never captured. Fix: when replace_pending_root_entry finds no slot but the command has a resolved family_key, insert it directly as a ReadyCommand (late registration).

  • Async drain races captured-checkpoint tests: The async per-family drain workers made captured checkpoints (delegated with wait=false) truly fire-and-forget. Tests that called git_ai checkpoint mock_ai <files> and immediately read current_working_logs() raced against the daemon. The production code path is correct. Fix in the test harness only: git_ai_with_env and git_ai_with_stdin now capture the daemon's total completion-log baseline before running a checkpoint, then wait for it to advance if the output contains "Checkpoint queued" (the existing production marker for async-delegated checkpoints).

Test plan

  • All 2937 integration tests pass at 4, 8, 12, and 16 threads with 0 failures
  • Verified status_ignore and subdirs amend tests pass in daemon and wrapper-daemon modes
  • No production code paths changed by the test harness fix

🤖 Generated with Claude Code


Open with Devin

svarlamov and others added 5 commits April 12, 2026 18:10
…ites

Three root causes addressed:

1. **Snapshot timestamp bug (trace_normalizer.rs)**: `payload_timestamp_ns`
   was calling `Value::as_u64` on the `time` field, but git trace2 sends
   `time` as an RFC3339 string. This fell through to `t_abs` (elapsed
   seconds since program start, ~2–3ms), making `finished_at_ns` nearly
   zero. The mtime-cutoff filter in `read_worktree_snapshot_for_files_at_or_before`
   then treated every file as "modified after cutoff", producing empty
   carryover snapshots and causing amend-authorship attribution to silently
   drop all working-log data.
   Fix: parse `time` as RFC3339 via chrono before falling back to `t_abs`.

2. **Missing pending-root slot for `current_dir` invocations (daemon.rs)**:
   Commands run via `current_dir` (not the `-C` flag) have no `-C` in
   their argv, so the git trace2 `start` event carries `cwd=None` → no
   worktree hint → no `PendingRoot` slot → `replace_pending_root_entry`
   returned `None` → the command was processed via the "applied" (no-op)
   path with no deferred snapshot capture. Subdirectory amend tests failed
   because the carryover snapshot was never captured.
   Fix: when `replace_pending_root_entry` finds no slot but the command
   has a resolved `family_key`, insert it directly as a `ReadyCommand`
   into the family sequencer (late registration), preserving deferred
   capture and correct drain ordering.

3. **Async drain races captured-checkpoint tests (test harness)**:
   Fix 2's async per-family drain workers made captured checkpoints
   (delegated with `wait=false`) truly fire-and-forget. Tests that called
   `git_ai checkpoint mock_ai <files>` and immediately read
   `current_working_logs()` raced against the daemon processing the
   queued checkpoint. The production code path is correct; the test harness
   needed to wait for completion.
   Fix: `git_ai_with_env` and `git_ai_with_stdin` now capture the daemon's
   total completion-log count before running a checkpoint command, then
   wait for the count to advance if the output contains "Checkpoint queued"
   (the marker the production code already prints for async-delegated
   checkpoints). Live checkpoints already block until complete, so no
   waiting is needed for them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test harness gap

- Drain worker: create `Notified` future *before* the drain phase so any
  `notify_one()` fired during drain stores a Tokio permit and is not lost.
  The `notified.await` at the bottom of the loop consumes it immediately,
  triggering an extra drain pass rather than sleeping indefinitely.
- Remove dead code: `has_log` and `was_snapshot_none` variables that were
  assigned but immediately silenced with `let _ = var` after debug logging
  was removed in a prior commit.
- Test harness: `sync_queued_checkpoint_if_needed` now also waits when the
  output contains "Checkpoint dispatched" (multi-repo mix of queued/completed
  repos). Both "Checkpoint queued" and "Checkpoint dispatched" contain
  "queued" as a substring, so the check is simplified to `.contains("queued")`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wrap long `modified_after_cutoff` assignment to satisfy rustfmt
- Collapse nested `if is_none() { if let Some(...) { } }` into a single
  `if is_none() && let Some(...) = ...` to satisfy clippy::collapsible_if
- Apply `cargo fmt` to catch any remaining formatting drift

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two parallel-attribution tests were flaky because the async drain
worker could process a checkpoint AFTER the git commit subprocess
finished.  At that point the file is committed/clean, so
resolve_explicit_path_execution sees status_entry.is_none() and skips
it — no AI attribution is recorded.

Fix by waiting for all checkpoint completions (filtered to
primary_command == "checkpoint") before calling git commit:

* TestRepo: add daemon_checkpoint_completion_count() and
  wait_for_daemon_checkpoint_completion_count() that filter the
  completion log by primary_command == "checkpoint".

* WorkdirRaceHarness: add checkpoint_completion_count() and
  wait_for_checkpoint_completion_count() with the same filter.

* spawn_worktree_ai_stream: accept pre_stream_checkpoint_baseline and
  expected_new_checkpoint_completions; after the write/add loop, wait
  for baseline + expected checkpoints before committing.

* concurrent_checkpoint test: capture checkpoint_baseline before
  threads, then wait for baseline + file_count checkpoints between
  git add and git commit.

* parallel_worktree_streams test: both workers use baseline +
  2*file_count so BOTH wait for ALL 16 checkpoints from both workers
  before either commits, preventing the shared-counter race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ents

- Remove WorkdirRaceHarness::completion_count and
  wait_for_completion_count which were never called (dead_code becomes
  a hard error under -D warnings in CI).
- Add #[allow(clippy::too_many_arguments)] on spawn_worktree_ai_stream
  which now takes 8 parameters after adding the checkpoint-baseline
  pair needed for the race-condition fix.
- rustfmt is satisfied by removing the multi-line chain in the now-
  deleted completion_count method.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +5419 to +5421
if let Some(existing) = map.get(family) {
return Ok(existing.clone());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 get_or_create_drain_notifier returns stale notifier for a dead drain worker, silently losing notifications

If a per-family drain worker exits unexpectedly (e.g., side_effect_exec_lock returns Err at src/daemon.rs:5450-5452 due to a poisoned std::sync::Mutex), its Arc<Notify> clone is dropped, bringing the strong_count of the map entry down to 1. However, get_or_create_drain_notifier only checks whether an entry exists in the map — it does not verify the worker is still alive. Subsequent callers receive the stale Arc<Notify> and call notify_one(), but no worker is running to consume the permit. All sequencer entries for that family (commands and checkpoints) are silently stuck until gc_stale_family_state runs (every 500 trace payloads) and removes the dead entry, after which the next call spawns a fresh worker.

For wait=true checkpoint requests, this means the oneshot::Receiver returned to the control-socket handler will block indefinitely until GC happens to run and a subsequent event for the same family triggers a new worker spawn — potentially causing long hangs.

Suggested change
if let Some(existing) = map.get(family) {
return Ok(existing.clone());
}
if let Some(existing) = map.get(family) {
if Arc::strong_count(existing) > 1 {
return Ok(existing.clone());
}
// The drain worker for this family has exited (only the map holds
// a reference). Remove the stale entry and fall through to spawn
// a new worker.
map.remove(family);
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

svarlamov and others added 2 commits April 13, 2026 01:49
…ent burst tests

Same race as the concurrent/parallel tests: the async drain may process
a checkpoint AFTER git commit finishes, at which point the file is
committed/clean and resolve_explicit_path_execution skips it.

- high_throughput_ai_commit_burst: capture checkpoint_baseline before
  the 16-file loop, wait for baseline + 16 checkpoint completions
  before git commit.
- concurrent_worktree_burst: capture checkpoint_baseline before the
  10-file-per-worker loop, wait for baseline + 20 completions (both
  workers combined) before either worker commits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orking_dir

In daemon mode, configure_git_ai_env sets GIT_AI_DAEMON_CHECKPOINT_DELEGATE=true,
which causes checkpoints to be queued to the daemon asynchronously.
git_ai_from_working_dir is used for cross-repo and cross-CWD scenarios
where the checkpoint may target a *different* repo's family than self
(e.g. nested subrepos).  Waiting on self's family completion log after
an async delegate would either time out (wrong family) or race.

Fix: remove the delegate flag immediately after configure_git_ai_env so
the checkpoint runs synchronously in the subprocess, guaranteeing the
working log is written before the function returns.  This restores
correct behavior for all cross-repo attribution tests that check the
working log immediately after the checkpoint call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment on lines +5454 to +5470
let _guard = exec_lock.lock().await;
if coordinator.is_shutting_down() {
break;
}
if let Err(e) = coordinator
.drain_ready_family_sequencer_entries_locked(&family_clone)
.await
{
debug_log(&format!(
"drain worker error for family {}: {}",
family_clone, e
));
}
// Wait for the next notification now that the drain pass is complete.
// Any `notify_one()` that fired during the drain will have stored a
// permit, so this await will return immediately in that case.
notified.await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Drain worker holds exec_lock during notified.await, blocking the lock indefinitely between drain passes

The drain worker loop acquires _guard = exec_lock.lock().await at line 5454 and then waits on notified.await at line 5470 — but _guard lives until the end of the loop iteration (line 5471), so the side_effect_exec_lock for this family is held for the entire duration the worker sleeps waiting for a notification. Currently the drain worker appears to be the sole acquirer of this lock (the two previous callers in replace_pending_root_entry and append_checkpoint_to_family_sequencer were changed to no longer acquire it), so this doesn't cause a deadlock today. However, side_effect_exec_lock() is still a public method and any future code that acquires it for the same family will block indefinitely until the drain worker happens to receive a notification. The fix is to drop the guard before awaiting: drop(_guard); before notified.await;.

Suggested change
let _guard = exec_lock.lock().await;
if coordinator.is_shutting_down() {
break;
}
if let Err(e) = coordinator
.drain_ready_family_sequencer_entries_locked(&family_clone)
.await
{
debug_log(&format!(
"drain worker error for family {}: {}",
family_clone, e
));
}
// Wait for the next notification now that the drain pass is complete.
// Any `notify_one()` that fired during the drain will have stored a
// permit, so this await will return immediately in that case.
notified.await;
let _guard = exec_lock.lock().await;
if coordinator.is_shutting_down() {
break;
}
if let Err(e) = coordinator
.drain_ready_family_sequencer_entries_locked(&family_clone)
.await
{
debug_log(&format!(
"drain worker error for family {}: {}",
family_clone, e
));
}
drop(_guard);
// Wait for the next notification now that the drain pass is complete.
// Any `notify_one()` that fired during the drain will have stored a
// permit, so this await will return immediately in that case.
notified.await;
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

…n wrapper-daemon

In wrapper-daemon mode, apply_default_config_patch sets async_mode=true
in the feature flags.  checkpoint_delegation_enabled() checks async_mode
FIRST, so env_remove("GIT_AI_DAEMON_CHECKPOINT_DELEGATE") had no effect
-- the checkpoint was still delegated asynchronously.

Add GIT_AI_CHECKPOINT_FORCE_SYNC env var to checkpoint_delegation_enabled()
that takes priority over both async_mode and the delegate env var.  Set
this in git_ai_from_working_dir (replacing the previous env_remove call)
so cross-repo and subrepo tests always get a synchronous checkpoint
regardless of which test mode is active.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment on lines +1687 to +1691
if let Some(error_entry) = entries
.iter()
.skip(baseline_count as usize)
.find(|entry| entry.status == "error")
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Error-detection skip() in wait_for_daemon_checkpoint_completion_count uses checkpoint count to skip over all-type entries

In wait_for_daemon_checkpoint_completion_count, baseline_count is the number of checkpoint-specific entries (from daemon_checkpoint_completion_count()), but entries returned by daemon_completion_entries_for_family() contains entries of all types (commands, checkpoints, etc.). Using skip(baseline_count as usize) on the full entry list skips fewer entries than intended when non-checkpoint entries precede the baseline checkpoints. This can expose pre-existing error entries to the find(|entry| entry.status == "error") check, potentially causing a false panic on an old unrelated error.

For example, if at baseline there are 5 checkpoint entries and 5 non-checkpoint entries (10 total), skip(5) leaves 5 entries visible — some of which may be pre-baseline non-checkpoint errors. The analogous wait_for_daemon_total_completion_count (test_repo.rs:1605) correctly uses a total-entry baseline for its skip().

Prompt for agents
In wait_for_daemon_checkpoint_completion_count (test_repo.rs around line 1676), the error-scanning skip(baseline_count as usize) uses the checkpoint-specific baseline count to skip entries in the full (all-type) entry list. This is inconsistent with wait_for_daemon_total_completion_count which correctly uses a total count for skipping.

To fix: either (a) capture the total entry count at the start and use that for the skip, or (b) filter the entries for checkpoints before applying the skip and error check, or (c) simply remove the skip and scan all entries for errors (since pre-existing errors would have already panicked in prior waits).

Approach (a) is most aligned with the existing wait_for_daemon_total_completion_count pattern. You would need to pass a separate total_baseline alongside the checkpoint baseline, or compute the total count internally.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

svarlamov and others added 3 commits April 13, 2026 04:02
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_needed

Replace daemon_total_completion_count / wait_for_daemon_total_completion_count
with daemon_checkpoint_completion_count / wait_for_daemon_checkpoint_completion_count
in the checkpoint sync barrier.

Previously, the barrier could return prematurely when a command completion entry
(not a checkpoint entry) advanced the total count to baseline+1 before the actual
async checkpoint was processed by the daemon. This caused intermittent attribution
failures in wrapper-daemon mode tests such as
squash_merge::test_prepare_working_log_squash_multiple_sessions_in_worktree,
where a subsequent commit was processed before the preceding checkpoint had written
its working-log update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracked ones

In all three wait-for-daemon-completion helpers
(wait_for_daemon_completion_count, wait_for_daemon_completion_sessions,
wait_for_daemon_checkpoint_completion_count), the idle-progress timer was
only reset when a *tracked* or *checkpoint* entry appeared in the log.

In wrapper-daemon mode, an untracked operation (e.g. git-add) can sit at
the front of the FamilySequencer, blocking the tracked checkout/checkpoint
behind it from being processed. While the daemon is busy draining that
git-add entry, no tracked entries appear, so the idle timer fires after
20 s — even though the daemon is actively making progress.

Fix by also tracking the raw total-entry count: any new completion log
entry (tracked or not) now resets last_progress. The functions still only
*return* when the specific condition is satisfied; this change only makes
the idle-timeout fence more lenient toward daemon activity on loaded CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant