fix: de-duplicate identical editions in all_editions (#317)#319
Open
grossir wants to merge 2 commits into
Open
fix: de-duplicate identical editions in all_editions (#317)#319grossir wants to merge 2 commits into
grossir wants to merge 2 commits into
Conversation
- Add _dedupe_editions() helper that order-preservingly dedupes each list and drops variation editions already present as exact matches, so a single Edition is never listed twice while genuinely distinct editions (real ambiguity, e.g. "B.R.") are preserved. - Route ResourceCitation.__post_init__, CitationToken.__post_init__, and CitationToken.merge through the helper; all_editions is now the deduped exact + variation union. - Root cause: the #305 whitespace relaxation makes a reporter's canonical regex and its whitespace-only variation regex (e.g. F.3d / F. 3d) both match the plain text, so the same Edition is registered as both an exact match and a variation. The S.W.2d custom regex hit this before #305 too. - Fixes downstream false positives in callers that use len(all_editions) to detect reporter ambiguity (e.g. CourtListener). - Add test_no_duplicate_editions (F.3d, U.S., S.W.2d collapse, and B.R. ambiguity preservation); update test_reporter_tokenizer assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
The Eyecite Report 👁️Gains and LossesThere were 0 gains and 0 losses. Total citations found: base 15988, PR 15988 (net +0). Click here to see details.
Time ChartGenerated Files |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What
Fixes #317.
get_citations()was returning the sameEditionobject twice inall_editions, e.g.:Root cause
The whitespace relaxation from #305 turns each period/space in a reporter regex into
\s*. For a reporter with a whitespace-only variation likeF. 3d, the variation regex becomesF\.\s*\s*3d, which also matches the zero-space textF.3d. So two extractors fire on the same span — the exact one and the variation one — and both register the sameEditionobject (one intoexact_editions, one intovariation_editions).CitationToken.mergeonly deduped within each list, never across them, soall_editions = exact + variationcarried the duplicate. TheS.W.2dcustom regex hit the same bug independently (its regex omits$edition), predating #305.Fix
_dedupe_editions()helper: order-preservingly dedupes each list and drops any variation edition already present as an exact match.ResourceCitation.__post_init__,CitationToken.__post_init__, andCitationToken.mergethrough it.Impact
Fixes downstream false positives in callers that use
len(citation.all_editions)to detect reporter ambiguity (e.g. CourtListener'supdate_casenames_wl_dataset, which was skipping every whitespace-variation reporter as "unable to disambiguate").Hash note: this changes the hash of law and journal citations that previously carried a duplicate edition. Case citations are unaffected —
CaseCitation.__hash__never depended onall_editions. Resolution grouping within a run is unchanged; only absolute hash values (never persisted) differ across versions.Tests
test_no_duplicate_editions— coversF.3d,U.S., theS.W.2dcollapse, andB.R.(genuine ambiguity must survive dedup).test_reporter_tokenizerassertion.🤖 Generated with Claude Code