Skip to content

fix: de-duplicate identical editions in all_editions (#317)#319

Open
grossir wants to merge 2 commits into
mainfrom
fix-317-duplicate-editions
Open

fix: de-duplicate identical editions in all_editions (#317)#319
grossir wants to merge 2 commits into
mainfrom
fix-317-duplicate-editions

Conversation

@grossir

@grossir grossir commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

Fixes #317. get_citations() was returning the same Edition object twice in all_editions, e.g.:

c = get_citations("238 F.3d 273")[0]
len(c.all_editions)                     # 2 (was 1 pre-#305)
len(set(c.all_editions))                # 1  -> identical objects
c.all_editions[0] is c.all_editions[1]  # True

Root cause

The whitespace relaxation from #305 turns each period/space in a reporter regex into \s*. For a reporter with a whitespace-only variation like F. 3d, the variation regex becomes F\.\s*\s*3d, which also matches the zero-space text F.3d. So two extractors fire on the same span — the exact one and the variation one — and both register the same Edition object (one into exact_editions, one into variation_editions). CitationToken.merge only deduped within each list, never across them, so all_editions = exact + variation carried the duplicate. The S.W.2d custom regex hit the same bug independently (its regex omits $edition), predating #305.

Fix

  • New _dedupe_editions() helper: order-preservingly dedupes each list and drops any variation edition already present as an exact match.
  • Routed ResourceCitation.__post_init__, CitationToken.__post_init__, and CitationToken.merge through it.
  • Genuinely distinct editions (real ambiguity) are preserved.

Impact

Fixes downstream false positives in callers that use len(citation.all_editions) to detect reporter ambiguity (e.g. CourtListener's update_casenames_wl_dataset, which was skipping every whitespace-variation reporter as "unable to disambiguate").

Hash note: this changes the hash of law and journal citations that previously carried a duplicate edition. Case citations are unaffected — CaseCitation.__hash__ never depended on all_editions. Resolution grouping within a run is unchanged; only absolute hash values (never persisted) differ across versions.

Tests

  • New test_no_duplicate_editions — covers F.3d, U.S., the S.W.2d collapse, and B.R. (genuine ambiguity must survive dedup).
  • Updated test_reporter_tokenizer assertion.
  • Full suite: 54 tests pass.

🤖 Generated with Claude Code

grossir and others added 2 commits June 30, 2026 20:05
- Add _dedupe_editions() helper that order-preservingly dedupes each list
  and drops variation editions already present as exact matches, so a single
  Edition is never listed twice while genuinely distinct editions (real
  ambiguity, e.g. "B.R.") are preserved.
- Route ResourceCitation.__post_init__, CitationToken.__post_init__, and
  CitationToken.merge through the helper; all_editions is now the deduped
  exact + variation union.
- Root cause: the #305 whitespace relaxation makes a reporter's canonical
  regex and its whitespace-only variation regex (e.g. F.3d / F. 3d) both
  match the plain text, so the same Edition is registered as both an exact
  match and a variation. The S.W.2d custom regex hit this before #305 too.
- Fixes downstream false positives in callers that use len(all_editions) to
  detect reporter ambiguity (e.g. CourtListener).
- Add test_no_duplicate_editions (F.3d, U.S., S.W.2d collapse, and B.R.
  ambiguity preservation); update test_reporter_tokenizer assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@grossir grossir moved this to PRs to Review in Sprint (Case Law) Jul 1, 2026
@grossir grossir requested a review from quevon24 July 1, 2026 01:09
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

The Eyecite Report 👁️

Gains and Losses

There were 0 gains and 0 losses.

Total citations found: base 15988, PR 15988 (net +0).

Click here to see details.
id Gain Loss

Time Chart

image

Generated Files

Base (main) Output
PR Output
Full Output CSV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: PRs to Review

Development

Successfully merging this pull request may close these issues.

all_editions contains duplicate (identical) editions for whitespace-spaced reporters since 2.7.7

2 participants