Skip to content

refactor: decompose helpers.py into domain-focused modules#303

Open
Wolfvin wants to merge 1 commit into
freelawproject:mainfrom
Wolfvin:refactor/split-helpers-into-domain-modules
Open

refactor: decompose helpers.py into domain-focused modules#303
Wolfvin wants to merge 1 commit into
freelawproject:mainfrom
Wolfvin:refactor/split-helpers-into-domain-modules

Conversation

@Wolfvin

@Wolfvin Wolfvin commented Jun 14, 2026

Copy link
Copy Markdown

What was refactored and why

The 1146-line helpers.py was a god module containing 10+ public functions spanning 5 distinct domains: citation metadata extraction, case name finding, pin cite extraction, court matching, token pattern matching, and citation filtering. This made the file difficult to navigate, maintain, and test in isolation.

New module structure

Module Responsibility Key functions
citation_metadata.py Extract and attach metadata to citations add_post_citation, add_pre_citation, add_law_metadata, add_journal_metadata, get_year
case_name.py Find case names in plain text and HTML find_case_name, find_case_name_in_html
pin_cite.py Pin cite extraction extract_pin_cite
court_matching.py Court code lookup from citation strings get_court_by_paren
token_matching.py Token pattern matching utility match_on_tokens
citation_filter.py Citation filtering and disambiguation filter_citations, disambiguate_reporters

helpers.py is preserved as a thin re-export module for full backward compatibility. All existing imports (from eyecite.helpers import ...) continue to work unchanged.

Refactoring principles applied

  • DECOMPOSITION: 1146-line god module split into 6 focused modules (each < 200 lines)
  • COHESION: Each module groups functions that share a single domain
  • SINGLE RESPONSIBILITY: Each module has one clear purpose
  • REDUCE COUPLING: Deferred imports updated to point to specific new modules
  • BACKWARD COMPATIBILITY: All existing imports preserved via re-exports

Verification Results

This refactoring was verified using the Regrets regression testing framework with dual-truth verification:

KEBENARAN 1 — Raw Output Comparison

All entry function outputs are identical before and after refactoring:

Cluster Input (sample) Output Match
find-case-citation "1 U.S. 1" ✅ Identical
find-law-citation "Mass. Gen. Laws ch. 1, § 2" ✅ Identical
find-journal-citation "1 Minn. L. Rev. 1" ✅ Identical
find-supra-citation "Adarand, supra, at 240" ✅ Identical
find-id-citation "1 U.S. 1. Id. at 2" ✅ Identical
clean-html "<p>1 U.S. 1</p>" ✅ Identical
clean-whitespace "1 U.S. 1" ✅ Identical
clean-underscores "1 U.S. ___ 1" ✅ Identical
resolve-citations "Foo v. Bar, 1 U.S. 1..." ✅ Identical
get-year-validation "1999" ✅ Identical
get-court-by-paren "2d Cir" ✅ Identical

KEBENARAN 2 — Fingerprint Comparison

All behavioral fingerprints match:

Cluster Fingerprint Before Fingerprint After Match
find-case-citation 4vu1cs8 4vu1cs8
find-law-citation 4iuaxdk 4iuaxdk
find-journal-citation 1dphgtl 1dphgtl
find-supra-citation 4i7izr5 4i7izr5
find-id-citation 2xa6o4n 2xa6o4n
clean-html 3a7rklb 3a7rklb
clean-whitespace 4p1mzbe 4p1mzbe
clean-underscores 3xk6cql 3xk6cql
resolve-citations 4bfnrj3 4bfnrj3
get-year-validation 1awb7q1 1awb7q1
get-court-by-paren 104ugq6 104ugq6

Chain Hash Comparison

Chain Hash Before Hash After Match
full-citation-pipeline 1nhozlg 1nhozlg
short-cite-resolution 4egd5zg 4egd5zg

Additional Verification

  • Drift detection: 5 consecutive runs, zero drift on all clusters
  • Existing test suite: 20 non-hyperscan tests pass
  • Core functionality: all citation types (case, law, journal, supra, id) extracted correctly

Split the 1146-line helpers.py into 6 focused modules following
single-responsibility and domain-cohesion principles:

- citation_metadata.py: extract and attach metadata to citations
  (add_post_citation, add_pre_citation, add_law_metadata,
   add_journal_metadata, get_year, clean_pin_cite, process_parenthetical)

- case_name.py: find case names in plain text and HTML
  (find_case_name, find_case_name_in_html, plus all helper functions)

- pin_cite.py: pin cite extraction
  (extract_pin_cite)

- court_matching.py: court code lookup from citation strings
  (get_court_by_paren)

- token_matching.py: token pattern matching utility
  (match_on_tokens, MAX_MATCH_CHARS)

- citation_filter.py: citation filtering and disambiguation
  (filter_citations, disambiguate_reporters, overlapping_citations, joke_cite)

helpers.py preserved as thin re-export module for backward compatibility.
All existing imports (from eyecite.helpers import ...) continue to work.

Verified with Regrets regression testing:
- 11 clusters: all GREEN (find-case-citation, find-law-citation,
  find-journal-citation, find-supra-citation, find-id-citation,
  clean-html, clean-whitespace, clean-underscores, resolve-citations,
  get-year-validation, get-court-by-paren)
- 2 chains: all match (full-citation-pipeline, short-cite-resolution)
- Zero drift across 5 consecutive runs
- Raw output identical to pre-refactor baseline (KEBENARAN 1)
- All fingerprints match pre-refactor baseline (KEBENARAN 2)
- Existing test suite: 20 non-hyperscan tests pass
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@mlissner

Copy link
Copy Markdown
Member

Thanks! Are you using eyecite for something or is this a drive-by PR from an AI or something?

A few immediate thoughts:

  • Looks like we need some linting work
  • Please sign the SLA
  • We don't need backwards compat
  • Pin cites can probably go in with citations

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants