Skip to content

bug: token-aligned diff splits attribution mid-line, causing overrode boundary to drop subsequent AI lines from git note #995

@svarlamov

Description

@svarlamov

Summary

When a Human checkpoint computes its diff using build_token_aligned_diffs, the LCS algorithm can find a matching token in the middle of a line, placing the human attribution boundary 42+ chars into a line rather than at the line start. This mid-line split causes:

  • Line N (split line): has both AI chars (prefix) + human chars (suffix) → overrode is set → kept in line_attributions (visible in git note as "human overrode AI")
  • Lines N+1..M (subsequent new lines): covered only by human attribution → overrode = Nonestripped by attributions_to_line_attributions_for_checkpoint

The git note ends up with a gap at lines N+1..M — those AI-written lines are attributed to human. Line N itself appears as human (with overrode) but lines after it are silently dropped.

Reproduced from session 3c663e49-9c9f-4225-aea3-8efb93ab4471, commit 89cdae17, file tests/integration/rebase_realworld.rs lines 603–614.

Root Cause

In src/authorship/attribution_tracker.rs, attributions_to_line_attributions_for_checkpoint strips pure-human lines:

// src/authorship/attribution_tracker.rs ~line 2073
merged_line_authors.retain(|line_attr| {
    line_attr.author_id != CheckpointKind::Human.to_str() || line_attr.overrode.is_some()
});

overrode is only set when a line has both AI and human char-level attributions overlapping it (find_dominant_author_for_line_candidates). If the Human checkpoint's attribution boundary starts mid-line, the split line gets overrode but subsequent lines do not — they're pure human and get stripped.

Why the boundary ends up mid-line

The Human checkpoint uses build_token_aligned_diffs (not force_split). The LCS token matching finds a common token between old content and new content. In the original session:

  • Old content (line 603 before Subagent A's edit): // sha4 = C5': all 5 files
  • New content (line 603 after Subagent A's edit): assert_blame_sample_at_commit(&repo, &chain[3], "users.py", ...

Both lines contain the token chain[3]. The LCS match anchors at chain[3], so the human attribution starts at char 23933 — which is 42 chars into line 603, not at the line boundary (char 23891).

Line 603 chars [23891, 23983):
  [23891, 23933) = "    assert_blame_sample_at_commit(&repo, &"  ← AI attributed
  [23933, 23983) = "chain[3], \"users.py\", ..."                ← Human attributed
                   ^--- split here (LCS token match at `chain[3]`)

Line 603: AI + Human overlap → overrode set → KEPT in note
Lines 604-614: only Human attribution → stripped → ABSENT from note

Exact Data from the Real Bug

From the final char attributions in checkpoint #71 (blob ed766fcc60e13c3c):

AI:    {start: 23042, end: 23933, author: '36ee87f956a9e26f'}
Human: {start: 23933, end: 24521, author: 'human', ts: 1775504264429}

Line 603 = chars [23891, 23983)
  → straddles the AI/Human boundary at 23933
  → overrode is set for line 603

Lines 604-614 = chars [23983, 24521)
  → entirely within Human attribution [23933, 24521)
  → overrode = None → stripped by retain()

Final git note: 1-602, 615-8664 as AI → gap at 603-614 = human.

Reproduction

This bug is secondary to #994 (daemon race) — you need the Human checkpoint to form first. Once it does, construct a file where a prior AI-attributed line shares a token suffix with the new AI content being added:

# Old line 603 (in last AI checkpoint): "    // sha4 = C5': all 5 files changed"
# New line 603 (written by parallel subagent): "    assert_blame_sample_at_commit(&repo, &chain[3], ...)"
# Common LCS token: chain[3]
# → Human attribution starts mid-line 603 at chain[3]
# → Lines 604+ are pure human → stripped

Affected Code

  • src/authorship/attribution_tracker.rsattributions_to_line_attributions_for_checkpoint retain filter (lines ~2073–2076)
  • src/authorship/attribution_tracker.rsfind_dominant_author_for_line_candidates — only sets overrode when both AI and human overlap the same line
  • src/authorship/attribution_tracker.rsbuild_token_aligned_diffs — LCS can produce non-line-aligned boundaries

Potential Fix Directions

  1. Clamp attribution boundaries to line starts in build_token_aligned_diffs when the diff is for a Human checkpoint — ensuring human attribution never starts mid-line.
  2. In attributions_to_line_attributions_for_checkpoint, when a human line immediately follows a line with overrode, carry forward the overrode context so subsequent lines aren't silently stripped.
  3. In AI checkpoints, scan for human char attributions that have no overrode in adjacent lines and re-evaluate whether those chars should be reclaimed.

Relationship to #994

These two bugs compound each other:

Without #994, this bug cannot manifest (no spurious Human checkpoint). Fixing #994 alone would fix the observed 12-line misattribution. But this bug would still affect any intentional human edits that partially overlap with AI-attributed content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions