fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites by svarlamov · Pull Request #1064 · git-ai-project/git-ai

svarlamov · 2026-04-12T18:15:46Z

Summary

Three root causes of daemon/wrapper-daemon test flakiness fixed:

Snapshot timestamp bug: payload_timestamp_ns in trace_normalizer.rs was calling Value::as_u64 on git trace2's time field, which is an RFC3339 string. It fell through to t_abs (elapsed seconds since program start, ~2–3ms), making finished_at_ns nearly zero. The mtime-cutoff filter then treated every file as "modified after cutoff", producing empty carryover snapshots and silently dropping amend-authorship working-log data. Fix: parse time as RFC3339 via chrono.
Missing pending-root slot for current_dir invocations: Commands run via current_dir (not -C) have no -C in argv, so the git trace2 start event carries cwd=None → no worktree hint → no PendingRoot slot → command was processed via the no-op "applied" path with no deferred snapshot capture. Subdirectory amend tests failed because the carryover snapshot was never captured. Fix: when replace_pending_root_entry finds no slot but the command has a resolved family_key, insert it directly as a ReadyCommand (late registration).
Async drain races captured-checkpoint tests: The async per-family drain workers made captured checkpoints (delegated with wait=false) truly fire-and-forget. Tests that called git_ai checkpoint mock_ai <files> and immediately read current_working_logs() raced against the daemon. The production code path is correct. Fix in the test harness only: git_ai_with_env and git_ai_with_stdin now capture the daemon's total completion-log baseline before running a checkpoint, then wait for it to advance if the output contains "Checkpoint queued" (the existing production marker for async-delegated checkpoints).

Test plan

All 2937 integration tests pass at 4, 8, 12, and 16 threads with 0 failures
Verified status_ignore and subdirs amend tests pass in daemon and wrapper-daemon modes
No production code paths changed by the test harness fix

🤖 Generated with Claude Code

…ites Three root causes addressed: 1. **Snapshot timestamp bug (trace_normalizer.rs)**: `payload_timestamp_ns` was calling `Value::as_u64` on the `time` field, but git trace2 sends `time` as an RFC3339 string. This fell through to `t_abs` (elapsed seconds since program start, ~2–3ms), making `finished_at_ns` nearly zero. The mtime-cutoff filter in `read_worktree_snapshot_for_files_at_or_before` then treated every file as "modified after cutoff", producing empty carryover snapshots and causing amend-authorship attribution to silently drop all working-log data. Fix: parse `time` as RFC3339 via chrono before falling back to `t_abs`. 2. **Missing pending-root slot for `current_dir` invocations (daemon.rs)**: Commands run via `current_dir` (not the `-C` flag) have no `-C` in their argv, so the git trace2 `start` event carries `cwd=None` → no worktree hint → no `PendingRoot` slot → `replace_pending_root_entry` returned `None` → the command was processed via the "applied" (no-op) path with no deferred snapshot capture. Subdirectory amend tests failed because the carryover snapshot was never captured. Fix: when `replace_pending_root_entry` finds no slot but the command has a resolved `family_key`, insert it directly as a `ReadyCommand` into the family sequencer (late registration), preserving deferred capture and correct drain ordering. 3. **Async drain races captured-checkpoint tests (test harness)**: Fix 2's async per-family drain workers made captured checkpoints (delegated with `wait=false`) truly fire-and-forget. Tests that called `git_ai checkpoint mock_ai <files>` and immediately read `current_working_logs()` raced against the daemon processing the queued checkpoint. The production code path is correct; the test harness needed to wait for completion. Fix: `git_ai_with_env` and `git_ai_with_stdin` now capture the daemon's total completion-log count before running a checkpoint command, then wait for the count to advance if the output contains "Checkpoint queued" (the marker the production code already prints for async-delegated checkpoints). Live checkpoints already block until complete, so no waiting is needed for them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…test harness gap - Drain worker: create `Notified` future *before* the drain phase so any `notify_one()` fired during drain stores a Tokio permit and is not lost. The `notified.await` at the bottom of the loop consumes it immediately, triggering an extra drain pass rather than sleeping indefinitely. - Remove dead code: `has_log` and `was_snapshot_none` variables that were assigned but immediately silenced with `let _ = var` after debug logging was removed in a prior commit. - Test harness: `sync_queued_checkpoint_if_needed` now also waits when the output contains "Checkpoint dispatched" (multi-repo mix of queued/completed repos). Both "Checkpoint queued" and "Checkpoint dispatched" contain "queued" as a substring, so the check is simplified to `.contains("queued")`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Wrap long `modified_after_cutoff` assignment to satisfy rustfmt - Collapse nested `if is_none() { if let Some(...) { } }` into a single `if is_none() && let Some(...) = ...` to satisfy clippy::collapsible_if - Apply `cargo fmt` to catch any remaining formatting drift Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The two parallel-attribution tests were flaky because the async drain worker could process a checkpoint AFTER the git commit subprocess finished. At that point the file is committed/clean, so resolve_explicit_path_execution sees status_entry.is_none() and skips it — no AI attribution is recorded. Fix by waiting for all checkpoint completions (filtered to primary_command == "checkpoint") before calling git commit: * TestRepo: add daemon_checkpoint_completion_count() and wait_for_daemon_checkpoint_completion_count() that filter the completion log by primary_command == "checkpoint". * WorkdirRaceHarness: add checkpoint_completion_count() and wait_for_checkpoint_completion_count() with the same filter. * spawn_worktree_ai_stream: accept pre_stream_checkpoint_baseline and expected_new_checkpoint_completions; after the write/add loop, wait for baseline + expected checkpoints before committing. * concurrent_checkpoint test: capture checkpoint_baseline before threads, then wait for baseline + file_count checkpoints between git add and git commit. * parallel_worktree_streams test: both workers use baseline + 2*file_count so BOTH wait for ALL 16 checkpoints from both workers before either commits, preventing the shared-counter race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ents - Remove WorkdirRaceHarness::completion_count and wait_for_completion_count which were never called (dead_code becomes a hard error under -D warnings in CI). - Add #[allow(clippy::too_many_arguments)] on spawn_worktree_ai_stream which now takes 8 parameters after adding the checkpoint-baseline pair needed for the race-condition fix. - rustfmt is satisfied by removing the multi-line chain in the now- deleted completion_count method. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-04-13T01:40:29Z

src/daemon.rs

+        if let Some(existing) = map.get(family) {
+            return Ok(existing.clone());
+        }


🔴 get_or_create_drain_notifier returns stale notifier for a dead drain worker, silently losing notifications

If a per-family drain worker exits unexpectedly (e.g., side_effect_exec_lock returns Err at src/daemon.rs:5450-5452 due to a poisoned std::sync::Mutex), its Arc<Notify> clone is dropped, bringing the strong_count of the map entry down to 1. However, get_or_create_drain_notifier only checks whether an entry exists in the map — it does not verify the worker is still alive. Subsequent callers receive the stale Arc<Notify> and call notify_one(), but no worker is running to consume the permit. All sequencer entries for that family (commands and checkpoints) are silently stuck until gc_stale_family_state runs (every 500 trace payloads) and removes the dead entry, after which the next call spawns a fresh worker.

For wait=true checkpoint requests, this means the oneshot::Receiver returned to the control-socket handler will block indefinitely until GC happens to run and a subsequent event for the same family triggers a new worker spawn — potentially causing long hangs.

Suggested change

if let Some(existing) = map.get(family) {

return Ok(existing.clone());

}

if let Some(existing) = map.get(family) {

if Arc::strong_count(existing) > 1 {

return Ok(existing.clone());

}

// The drain worker for this family has exited (only the map holds

// a reference). Remove the stale entry and fall through to spawn

// a new worker.

map.remove(family);

}

Was this helpful? React with 👍 or 👎 to provide feedback.

…ent burst tests Same race as the concurrent/parallel tests: the async drain may process a checkpoint AFTER git commit finishes, at which point the file is committed/clean and resolve_explicit_path_execution skips it. - high_throughput_ai_commit_burst: capture checkpoint_baseline before the 16-file loop, wait for baseline + 16 checkpoint completions before git commit. - concurrent_worktree_burst: capture checkpoint_baseline before the 10-file-per-worker loop, wait for baseline + 20 completions (both workers combined) before either worker commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…orking_dir In daemon mode, configure_git_ai_env sets GIT_AI_DAEMON_CHECKPOINT_DELEGATE=true, which causes checkpoints to be queued to the daemon asynchronously. git_ai_from_working_dir is used for cross-repo and cross-CWD scenarios where the checkpoint may target a *different* repo's family than self (e.g. nested subrepos). Waiting on self's family completion log after an async delegate would either time out (wrong family) or race. Fix: remove the delegate flag immediately after configure_git_ai_env so the checkpoint runs synchronously in the subprocess, guaranteeing the working log is written before the function returns. This restores correct behavior for all cross-repo attribution tests that check the working log immediately after the checkpoint call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

devin-ai-integration · 2026-04-13T02:35:41Z

src/daemon.rs

+                let _guard = exec_lock.lock().await;
+                if coordinator.is_shutting_down() {
+                    break;
+                }
+                if let Err(e) = coordinator
+                    .drain_ready_family_sequencer_entries_locked(&family_clone)
+                    .await
+                {
+                    debug_log(&format!(
+                        "drain worker error for family {}: {}",
+                        family_clone, e
+                    ));
+                }
+                // Wait for the next notification now that the drain pass is complete.
+                // Any `notify_one()` that fired during the drain will have stored a
+                // permit, so this await will return immediately in that case.
+                notified.await;


🟡 Drain worker holds exec_lock during notified.await, blocking the lock indefinitely between drain passes

The drain worker loop acquires _guard = exec_lock.lock().await at line 5454 and then waits on notified.await at line 5470 — but _guard lives until the end of the loop iteration (line 5471), so the side_effect_exec_lock for this family is held for the entire duration the worker sleeps waiting for a notification. Currently the drain worker appears to be the sole acquirer of this lock (the two previous callers in replace_pending_root_entry and append_checkpoint_to_family_sequencer were changed to no longer acquire it), so this doesn't cause a deadlock today. However, side_effect_exec_lock() is still a public method and any future code that acquires it for the same family will block indefinitely until the drain worker happens to receive a notification. The fix is to drop the guard before awaiting: drop(_guard); before notified.await;.

Suggested change

let _guard = exec_lock.lock().await;

if coordinator.is_shutting_down() {

break;

}

if let Err(e) = coordinator

.drain_ready_family_sequencer_entries_locked(&family_clone)

.await

{

debug_log(&format!(

"drain worker error for family {}: {}",

family_clone, e

));

}

// Wait for the next notification now that the drain pass is complete.

// Any `notify_one()` that fired during the drain will have stored a

// permit, so this await will return immediately in that case.

notified.await;

let _guard = exec_lock.lock().await;

if coordinator.is_shutting_down() {

break;

}

if let Err(e) = coordinator

.drain_ready_family_sequencer_entries_locked(&family_clone)

.await

{

debug_log(&format!(

"drain worker error for family {}: {}",

family_clone, e

));

}

drop(_guard);

// Wait for the next notification now that the drain pass is complete.

// Any `notify_one()` that fired during the drain will have stored a

// permit, so this await will return immediately in that case.

notified.await;

Was this helpful? React with 👍 or 👎 to provide feedback.

…n wrapper-daemon In wrapper-daemon mode, apply_default_config_patch sets async_mode=true in the feature flags. checkpoint_delegation_enabled() checks async_mode FIRST, so env_remove("GIT_AI_DAEMON_CHECKPOINT_DELEGATE") had no effect -- the checkpoint was still delegated asynchronously. Add GIT_AI_CHECKPOINT_FORCE_SYNC env var to checkpoint_delegation_enabled() that takes priority over both async_mode and the delegate env var. Set this in git_ai_from_working_dir (replacing the previous env_remove call) so cross-repo and subrepo tests always get a synchronous checkpoint regardless of which test mode is active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-04-13T03:05:45Z

tests/integration/repos/test_repo.rs

+            if let Some(error_entry) = entries
+                .iter()
+                .skip(baseline_count as usize)
+                .find(|entry| entry.status == "error")
+            {


🟡 Error-detection skip() in wait_for_daemon_checkpoint_completion_count uses checkpoint count to skip over all-type entries

In wait_for_daemon_checkpoint_completion_count, baseline_count is the number of checkpoint-specific entries (from daemon_checkpoint_completion_count()), but entries returned by daemon_completion_entries_for_family() contains entries of all types (commands, checkpoints, etc.). Using skip(baseline_count as usize) on the full entry list skips fewer entries than intended when non-checkpoint entries precede the baseline checkpoints. This can expose pre-existing error entries to the find(|entry| entry.status == "error") check, potentially causing a false panic on an old unrelated error.

For example, if at baseline there are 5 checkpoint entries and 5 non-checkpoint entries (10 total), skip(5) leaves 5 entries visible — some of which may be pre-baseline non-checkpoint errors. The analogous wait_for_daemon_total_completion_count (test_repo.rs:1605) correctly uses a total-entry baseline for its skip().

Prompt for agents

In wait_for_daemon_checkpoint_completion_count (test_repo.rs around line 1676), the error-scanning skip(baseline_count as usize) uses the checkpoint-specific baseline count to skip entries in the full (all-type) entry list. This is inconsistent with wait_for_daemon_total_completion_count which correctly uses a total count for skipping. To fix: either (a) capture the total entry count at the start and use that for the skip, or (b) filter the entries for checkpoints before applying the skip and error check, or (c) simply remove the skip and scan all entries for errors (since pre-existing errors would have already panicked in prior waits). Approach (a) is most aligned with the existing wait_for_daemon_total_completion_count pattern. You would need to pass a separate total_baseline alongside the checkpoint baseline, or compute the total count internally.

Was this helpful? React with 👍 or 👎 to provide feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_needed Replace daemon_total_completion_count / wait_for_daemon_total_completion_count with daemon_checkpoint_completion_count / wait_for_daemon_checkpoint_completion_count in the checkpoint sync barrier. Previously, the barrier could return prematurely when a command completion entry (not a checkpoint entry) advanced the total count to baseline+1 before the actual async checkpoint was processed by the daemon. This caused intermittent attribution failures in wrapper-daemon mode tests such as squash_merge::test_prepare_working_log_squash_multiple_sessions_in_worktree, where a subsequent commit was processed before the preceding checkpoint had written its working-log update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tracked ones In all three wait-for-daemon-completion helpers (wait_for_daemon_completion_count, wait_for_daemon_completion_sessions, wait_for_daemon_checkpoint_completion_count), the idle-progress timer was only reset when a *tracked* or *checkpoint* entry appeared in the log. In wrapper-daemon mode, an untracked operation (e.g. git-add) can sit at the front of the FamilySequencer, blocking the tracked checkout/checkpoint behind it from being processed. While the daemon is busy draining that git-add entry, no tracked entries appear, so the idle timer fires after 20 s — even though the daemon is actively making progress. Fix by also tracking the raw total-entry count: any new completion log entry (tracked or not) now resets last_progress. The functions still only *return* when the specific condition is satisfied; this change only makes the idle-timeout fence more lenient toward daemon activity on loaded CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

svarlamov and others added 5 commits April 12, 2026 18:10

devin-ai-integration bot reviewed Apr 13, 2026

View reviewed changes

svarlamov and others added 2 commits April 13, 2026 01:49

devin-ai-integration bot reviewed Apr 13, 2026

View reviewed changes

svarlamov and others added 3 commits April 13, 2026 04:02

chore: trigger fresh CI run to verify ubuntu wrapper-daemon flakiness

19134b5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites#1064

fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites#1064
svarlamov wants to merge 11 commits intomainfrom
fix/daemon-wrapper-daemon-flakiness

svarlamov commented Apr 12, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Apr 13, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Apr 13, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        if let Some(existing) = map.get(family) {
-            return Ok(existing.clone());
-        }
+        if let Some(existing) = map.get(family) {
+            if Arc::strong_count(existing) > 1 {
+                return Ok(existing.clone());
+            }
+            // The drain worker for this family has exited (only the map holds
+            // a reference).  Remove the stale entry and fall through to spawn
+            // a new worker.
+            map.remove(family);
+        }

Conversation

svarlamov commented Apr 12, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

svarlamov commented Apr 12, 2026 •

edited by devin-ai-integration bot

Loading