fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites#1064
fix(daemon): eliminate flakiness in daemon and wrapper-daemon test suites#1064
Conversation
…ites Three root causes addressed: 1. **Snapshot timestamp bug (trace_normalizer.rs)**: `payload_timestamp_ns` was calling `Value::as_u64` on the `time` field, but git trace2 sends `time` as an RFC3339 string. This fell through to `t_abs` (elapsed seconds since program start, ~2–3ms), making `finished_at_ns` nearly zero. The mtime-cutoff filter in `read_worktree_snapshot_for_files_at_or_before` then treated every file as "modified after cutoff", producing empty carryover snapshots and causing amend-authorship attribution to silently drop all working-log data. Fix: parse `time` as RFC3339 via chrono before falling back to `t_abs`. 2. **Missing pending-root slot for `current_dir` invocations (daemon.rs)**: Commands run via `current_dir` (not the `-C` flag) have no `-C` in their argv, so the git trace2 `start` event carries `cwd=None` → no worktree hint → no `PendingRoot` slot → `replace_pending_root_entry` returned `None` → the command was processed via the "applied" (no-op) path with no deferred snapshot capture. Subdirectory amend tests failed because the carryover snapshot was never captured. Fix: when `replace_pending_root_entry` finds no slot but the command has a resolved `family_key`, insert it directly as a `ReadyCommand` into the family sequencer (late registration), preserving deferred capture and correct drain ordering. 3. **Async drain races captured-checkpoint tests (test harness)**: Fix 2's async per-family drain workers made captured checkpoints (delegated with `wait=false`) truly fire-and-forget. Tests that called `git_ai checkpoint mock_ai <files>` and immediately read `current_working_logs()` raced against the daemon processing the queued checkpoint. The production code path is correct; the test harness needed to wait for completion. Fix: `git_ai_with_env` and `git_ai_with_stdin` now capture the daemon's total completion-log count before running a checkpoint command, then wait for the count to advance if the output contains "Checkpoint queued" (the marker the production code already prints for async-delegated checkpoints). Live checkpoints already block until complete, so no waiting is needed for them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test harness gap
- Drain worker: create `Notified` future *before* the drain phase so any
`notify_one()` fired during drain stores a Tokio permit and is not lost.
The `notified.await` at the bottom of the loop consumes it immediately,
triggering an extra drain pass rather than sleeping indefinitely.
- Remove dead code: `has_log` and `was_snapshot_none` variables that were
assigned but immediately silenced with `let _ = var` after debug logging
was removed in a prior commit.
- Test harness: `sync_queued_checkpoint_if_needed` now also waits when the
output contains "Checkpoint dispatched" (multi-repo mix of queued/completed
repos). Both "Checkpoint queued" and "Checkpoint dispatched" contain
"queued" as a substring, so the check is simplified to `.contains("queued")`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wrap long `modified_after_cutoff` assignment to satisfy rustfmt
- Collapse nested `if is_none() { if let Some(...) { } }` into a single
`if is_none() && let Some(...) = ...` to satisfy clippy::collapsible_if
- Apply `cargo fmt` to catch any remaining formatting drift
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two parallel-attribution tests were flaky because the async drain worker could process a checkpoint AFTER the git commit subprocess finished. At that point the file is committed/clean, so resolve_explicit_path_execution sees status_entry.is_none() and skips it — no AI attribution is recorded. Fix by waiting for all checkpoint completions (filtered to primary_command == "checkpoint") before calling git commit: * TestRepo: add daemon_checkpoint_completion_count() and wait_for_daemon_checkpoint_completion_count() that filter the completion log by primary_command == "checkpoint". * WorkdirRaceHarness: add checkpoint_completion_count() and wait_for_checkpoint_completion_count() with the same filter. * spawn_worktree_ai_stream: accept pre_stream_checkpoint_baseline and expected_new_checkpoint_completions; after the write/add loop, wait for baseline + expected checkpoints before committing. * concurrent_checkpoint test: capture checkpoint_baseline before threads, then wait for baseline + file_count checkpoints between git add and git commit. * parallel_worktree_streams test: both workers use baseline + 2*file_count so BOTH wait for ALL 16 checkpoints from both workers before either commits, preventing the shared-counter race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ents - Remove WorkdirRaceHarness::completion_count and wait_for_completion_count which were never called (dead_code becomes a hard error under -D warnings in CI). - Add #[allow(clippy::too_many_arguments)] on spawn_worktree_ai_stream which now takes 8 parameters after adding the checkpoint-baseline pair needed for the race-condition fix. - rustfmt is satisfied by removing the multi-line chain in the now- deleted completion_count method. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| if let Some(existing) = map.get(family) { | ||
| return Ok(existing.clone()); | ||
| } |
There was a problem hiding this comment.
🔴 get_or_create_drain_notifier returns stale notifier for a dead drain worker, silently losing notifications
If a per-family drain worker exits unexpectedly (e.g., side_effect_exec_lock returns Err at src/daemon.rs:5450-5452 due to a poisoned std::sync::Mutex), its Arc<Notify> clone is dropped, bringing the strong_count of the map entry down to 1. However, get_or_create_drain_notifier only checks whether an entry exists in the map — it does not verify the worker is still alive. Subsequent callers receive the stale Arc<Notify> and call notify_one(), but no worker is running to consume the permit. All sequencer entries for that family (commands and checkpoints) are silently stuck until gc_stale_family_state runs (every 500 trace payloads) and removes the dead entry, after which the next call spawns a fresh worker.
For wait=true checkpoint requests, this means the oneshot::Receiver returned to the control-socket handler will block indefinitely until GC happens to run and a subsequent event for the same family triggers a new worker spawn — potentially causing long hangs.
| if let Some(existing) = map.get(family) { | |
| return Ok(existing.clone()); | |
| } | |
| if let Some(existing) = map.get(family) { | |
| if Arc::strong_count(existing) > 1 { | |
| return Ok(existing.clone()); | |
| } | |
| // The drain worker for this family has exited (only the map holds | |
| // a reference). Remove the stale entry and fall through to spawn | |
| // a new worker. | |
| map.remove(family); | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
…ent burst tests Same race as the concurrent/parallel tests: the async drain may process a checkpoint AFTER git commit finishes, at which point the file is committed/clean and resolve_explicit_path_execution skips it. - high_throughput_ai_commit_burst: capture checkpoint_baseline before the 16-file loop, wait for baseline + 16 checkpoint completions before git commit. - concurrent_worktree_burst: capture checkpoint_baseline before the 10-file-per-worker loop, wait for baseline + 20 completions (both workers combined) before either worker commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orking_dir In daemon mode, configure_git_ai_env sets GIT_AI_DAEMON_CHECKPOINT_DELEGATE=true, which causes checkpoints to be queued to the daemon asynchronously. git_ai_from_working_dir is used for cross-repo and cross-CWD scenarios where the checkpoint may target a *different* repo's family than self (e.g. nested subrepos). Waiting on self's family completion log after an async delegate would either time out (wrong family) or race. Fix: remove the delegate flag immediately after configure_git_ai_env so the checkpoint runs synchronously in the subprocess, guaranteeing the working log is written before the function returns. This restores correct behavior for all cross-repo attribution tests that check the working log immediately after the checkpoint call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| let _guard = exec_lock.lock().await; | ||
| if coordinator.is_shutting_down() { | ||
| break; | ||
| } | ||
| if let Err(e) = coordinator | ||
| .drain_ready_family_sequencer_entries_locked(&family_clone) | ||
| .await | ||
| { | ||
| debug_log(&format!( | ||
| "drain worker error for family {}: {}", | ||
| family_clone, e | ||
| )); | ||
| } | ||
| // Wait for the next notification now that the drain pass is complete. | ||
| // Any `notify_one()` that fired during the drain will have stored a | ||
| // permit, so this await will return immediately in that case. | ||
| notified.await; |
There was a problem hiding this comment.
🟡 Drain worker holds exec_lock during notified.await, blocking the lock indefinitely between drain passes
The drain worker loop acquires _guard = exec_lock.lock().await at line 5454 and then waits on notified.await at line 5470 — but _guard lives until the end of the loop iteration (line 5471), so the side_effect_exec_lock for this family is held for the entire duration the worker sleeps waiting for a notification. Currently the drain worker appears to be the sole acquirer of this lock (the two previous callers in replace_pending_root_entry and append_checkpoint_to_family_sequencer were changed to no longer acquire it), so this doesn't cause a deadlock today. However, side_effect_exec_lock() is still a public method and any future code that acquires it for the same family will block indefinitely until the drain worker happens to receive a notification. The fix is to drop the guard before awaiting: drop(_guard); before notified.await;.
| let _guard = exec_lock.lock().await; | |
| if coordinator.is_shutting_down() { | |
| break; | |
| } | |
| if let Err(e) = coordinator | |
| .drain_ready_family_sequencer_entries_locked(&family_clone) | |
| .await | |
| { | |
| debug_log(&format!( | |
| "drain worker error for family {}: {}", | |
| family_clone, e | |
| )); | |
| } | |
| // Wait for the next notification now that the drain pass is complete. | |
| // Any `notify_one()` that fired during the drain will have stored a | |
| // permit, so this await will return immediately in that case. | |
| notified.await; | |
| let _guard = exec_lock.lock().await; | |
| if coordinator.is_shutting_down() { | |
| break; | |
| } | |
| if let Err(e) = coordinator | |
| .drain_ready_family_sequencer_entries_locked(&family_clone) | |
| .await | |
| { | |
| debug_log(&format!( | |
| "drain worker error for family {}: {}", | |
| family_clone, e | |
| )); | |
| } | |
| drop(_guard); | |
| // Wait for the next notification now that the drain pass is complete. | |
| // Any `notify_one()` that fired during the drain will have stored a | |
| // permit, so this await will return immediately in that case. | |
| notified.await; |
Was this helpful? React with 👍 or 👎 to provide feedback.
…n wrapper-daemon
In wrapper-daemon mode, apply_default_config_patch sets async_mode=true
in the feature flags. checkpoint_delegation_enabled() checks async_mode
FIRST, so env_remove("GIT_AI_DAEMON_CHECKPOINT_DELEGATE") had no effect
-- the checkpoint was still delegated asynchronously.
Add GIT_AI_CHECKPOINT_FORCE_SYNC env var to checkpoint_delegation_enabled()
that takes priority over both async_mode and the delegate env var. Set
this in git_ai_from_working_dir (replacing the previous env_remove call)
so cross-repo and subrepo tests always get a synchronous checkpoint
regardless of which test mode is active.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| if let Some(error_entry) = entries | ||
| .iter() | ||
| .skip(baseline_count as usize) | ||
| .find(|entry| entry.status == "error") | ||
| { |
There was a problem hiding this comment.
🟡 Error-detection skip() in wait_for_daemon_checkpoint_completion_count uses checkpoint count to skip over all-type entries
In wait_for_daemon_checkpoint_completion_count, baseline_count is the number of checkpoint-specific entries (from daemon_checkpoint_completion_count()), but entries returned by daemon_completion_entries_for_family() contains entries of all types (commands, checkpoints, etc.). Using skip(baseline_count as usize) on the full entry list skips fewer entries than intended when non-checkpoint entries precede the baseline checkpoints. This can expose pre-existing error entries to the find(|entry| entry.status == "error") check, potentially causing a false panic on an old unrelated error.
For example, if at baseline there are 5 checkpoint entries and 5 non-checkpoint entries (10 total), skip(5) leaves 5 entries visible — some of which may be pre-baseline non-checkpoint errors. The analogous wait_for_daemon_total_completion_count (test_repo.rs:1605) correctly uses a total-entry baseline for its skip().
Prompt for agents
In wait_for_daemon_checkpoint_completion_count (test_repo.rs around line 1676), the error-scanning skip(baseline_count as usize) uses the checkpoint-specific baseline count to skip entries in the full (all-type) entry list. This is inconsistent with wait_for_daemon_total_completion_count which correctly uses a total count for skipping.
To fix: either (a) capture the total entry count at the start and use that for the skip, or (b) filter the entries for checkpoints before applying the skip and error check, or (c) simply remove the skip and scan all entries for errors (since pre-existing errors would have already panicked in prior waits).
Approach (a) is most aligned with the existing wait_for_daemon_total_completion_count pattern. You would need to pass a separate total_baseline alongside the checkpoint baseline, or compute the total count internally.
Was this helpful? React with 👍 or 👎 to provide feedback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_needed Replace daemon_total_completion_count / wait_for_daemon_total_completion_count with daemon_checkpoint_completion_count / wait_for_daemon_checkpoint_completion_count in the checkpoint sync barrier. Previously, the barrier could return prematurely when a command completion entry (not a checkpoint entry) advanced the total count to baseline+1 before the actual async checkpoint was processed by the daemon. This caused intermittent attribution failures in wrapper-daemon mode tests such as squash_merge::test_prepare_working_log_squash_multiple_sessions_in_worktree, where a subsequent commit was processed before the preceding checkpoint had written its working-log update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracked ones In all three wait-for-daemon-completion helpers (wait_for_daemon_completion_count, wait_for_daemon_completion_sessions, wait_for_daemon_checkpoint_completion_count), the idle-progress timer was only reset when a *tracked* or *checkpoint* entry appeared in the log. In wrapper-daemon mode, an untracked operation (e.g. git-add) can sit at the front of the FamilySequencer, blocking the tracked checkout/checkpoint behind it from being processed. While the daemon is busy draining that git-add entry, no tracked entries appear, so the idle timer fires after 20 s — even though the daemon is actively making progress. Fix by also tracking the raw total-entry count: any new completion log entry (tracked or not) now resets last_progress. The functions still only *return* when the specific condition is satisfied; this change only makes the idle-timeout fence more lenient toward daemon activity on loaded CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Three root causes of daemon/wrapper-daemon test flakiness fixed:
Snapshot timestamp bug:
payload_timestamp_nsintrace_normalizer.rswas callingValue::as_u64on git trace2'stimefield, which is an RFC3339 string. It fell through tot_abs(elapsed seconds since program start, ~2–3ms), makingfinished_at_nsnearly zero. The mtime-cutoff filter then treated every file as "modified after cutoff", producing empty carryover snapshots and silently dropping amend-authorship working-log data. Fix: parsetimeas RFC3339 via chrono.Missing pending-root slot for
current_dirinvocations: Commands run viacurrent_dir(not-C) have no-Cin argv, so the git trace2startevent carriescwd=None→ no worktree hint → noPendingRootslot → command was processed via the no-op "applied" path with no deferred snapshot capture. Subdirectory amend tests failed because the carryover snapshot was never captured. Fix: whenreplace_pending_root_entryfinds no slot but the command has a resolvedfamily_key, insert it directly as aReadyCommand(late registration).Async drain races captured-checkpoint tests: The async per-family drain workers made captured checkpoints (delegated with
wait=false) truly fire-and-forget. Tests that calledgit_ai checkpoint mock_ai <files>and immediately readcurrent_working_logs()raced against the daemon. The production code path is correct. Fix in the test harness only:git_ai_with_envandgit_ai_with_stdinnow capture the daemon's total completion-log baseline before running a checkpoint, then wait for it to advance if the output contains "Checkpoint queued" (the existing production marker for async-delegated checkpoints).Test plan
status_ignoreandsubdirsamend tests pass in daemon and wrapper-daemon modes🤖 Generated with Claude Code