refactor: decompose helpers.py into domain-focused modules by Wolfvin · Pull Request #303 · freelawproject/eyecite

Wolfvin · 2026-06-14T02:50:38Z

What was refactored and why

The 1146-line helpers.py was a god module containing 10+ public functions spanning 5 distinct domains: citation metadata extraction, case name finding, pin cite extraction, court matching, token pattern matching, and citation filtering. This made the file difficult to navigate, maintain, and test in isolation.

New module structure

Module	Responsibility	Key functions
`citation_metadata.py`	Extract and attach metadata to citations	`add_post_citation`, `add_pre_citation`, `add_law_metadata`, `add_journal_metadata`, `get_year`
`case_name.py`	Find case names in plain text and HTML	`find_case_name`, `find_case_name_in_html`
`pin_cite.py`	Pin cite extraction	`extract_pin_cite`
`court_matching.py`	Court code lookup from citation strings	`get_court_by_paren`
`token_matching.py`	Token pattern matching utility	`match_on_tokens`
`citation_filter.py`	Citation filtering and disambiguation	`filter_citations`, `disambiguate_reporters`

helpers.py is preserved as a thin re-export module for full backward compatibility. All existing imports (from eyecite.helpers import ...) continue to work unchanged.

Refactoring principles applied

DECOMPOSITION: 1146-line god module split into 6 focused modules (each < 200 lines)
COHESION: Each module groups functions that share a single domain
SINGLE RESPONSIBILITY: Each module has one clear purpose
REDUCE COUPLING: Deferred imports updated to point to specific new modules
BACKWARD COMPATIBILITY: All existing imports preserved via re-exports

Verification Results

This refactoring was verified using the Regrets regression testing framework with dual-truth verification:

KEBENARAN 1 — Raw Output Comparison

All entry function outputs are identical before and after refactoring:

Cluster	Input (sample)	Output Match
find-case-citation	"1 U.S. 1"	✅ Identical
find-law-citation	"Mass. Gen. Laws ch. 1, § 2"	✅ Identical
find-journal-citation	"1 Minn. L. Rev. 1"	✅ Identical
find-supra-citation	"Adarand, supra, at 240"	✅ Identical
find-id-citation	"1 U.S. 1. Id. at 2"	✅ Identical
clean-html	"<p>1 U.S. 1</p>"	✅ Identical
clean-whitespace	"1 U.S. 1"	✅ Identical
clean-underscores	"1 U.S. ___ 1"	✅ Identical
resolve-citations	"Foo v. Bar, 1 U.S. 1..."	✅ Identical
get-year-validation	"1999"	✅ Identical
get-court-by-paren	"2d Cir"	✅ Identical

KEBENARAN 2 — Fingerprint Comparison

All behavioral fingerprints match:

Cluster	Fingerprint Before	Fingerprint After	Match
find-case-citation	4vu1cs8	4vu1cs8	✅
find-law-citation	4iuaxdk	4iuaxdk	✅
find-journal-citation	1dphgtl	1dphgtl	✅
find-supra-citation	4i7izr5	4i7izr5	✅
find-id-citation	2xa6o4n	2xa6o4n	✅
clean-html	3a7rklb	3a7rklb	✅
clean-whitespace	4p1mzbe	4p1mzbe	✅
clean-underscores	3xk6cql	3xk6cql	✅
resolve-citations	4bfnrj3	4bfnrj3	✅
get-year-validation	1awb7q1	1awb7q1	✅
get-court-by-paren	104ugq6	104ugq6	✅

Chain Hash Comparison

Chain	Hash Before	Hash After	Match
full-citation-pipeline	1nhozlg	1nhozlg	✅
short-cite-resolution	4egd5zg	4egd5zg	✅

Additional Verification

Drift detection: 5 consecutive runs, zero drift on all clusters
Existing test suite: 20 non-hyperscan tests pass
Core functionality: all citation types (case, law, journal, supra, id) extracted correctly

Split the 1146-line helpers.py into 6 focused modules following single-responsibility and domain-cohesion principles: - citation_metadata.py: extract and attach metadata to citations (add_post_citation, add_pre_citation, add_law_metadata, add_journal_metadata, get_year, clean_pin_cite, process_parenthetical) - case_name.py: find case names in plain text and HTML (find_case_name, find_case_name_in_html, plus all helper functions) - pin_cite.py: pin cite extraction (extract_pin_cite) - court_matching.py: court code lookup from citation strings (get_court_by_paren) - token_matching.py: token pattern matching utility (match_on_tokens, MAX_MATCH_CHARS) - citation_filter.py: citation filtering and disambiguation (filter_citations, disambiguate_reporters, overlapping_citations, joke_cite) helpers.py preserved as thin re-export module for backward compatibility. All existing imports (from eyecite.helpers import ...) continue to work. Verified with Regrets regression testing: - 11 clusters: all GREEN (find-case-citation, find-law-citation, find-journal-citation, find-supra-citation, find-id-citation, clean-html, clean-whitespace, clean-underscores, resolve-citations, get-year-validation, get-court-by-paren) - 2 chains: all match (full-citation-pipeline, short-cite-resolution) - Zero drift across 5 consecutive runs - Raw output identical to pre-refactor baseline (KEBENARAN 1) - All fingerprints match pre-refactor baseline (KEBENARAN 2) - Existing test suite: 20 non-hyperscan tests pass

CLAassistant · 2026-06-14T02:50:50Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

mlissner · 2026-06-14T22:43:06Z

Thanks! Are you using eyecite for something or is this a drive-by PR from an AI or something?

A few immediate thoughts:

Looks like we need some linting work
Please sign the SLA
We don't need backwards compat
Pin cites can probably go in with citations

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: decompose helpers.py into domain-focused modules#303

refactor: decompose helpers.py into domain-focused modules#303
Wolfvin wants to merge 1 commit into
freelawproject:mainfrom
Wolfvin:refactor/split-helpers-into-domain-modules

Wolfvin commented Jun 14, 2026

Uh oh!

CLAassistant commented Jun 14, 2026

Uh oh!

mlissner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

Wolfvin commented Jun 14, 2026

What was refactored and why

New module structure

Refactoring principles applied

Verification Results

KEBENARAN 1 — Raw Output Comparison

KEBENARAN 2 — Fingerprint Comparison

Chain Hash Comparison

Additional Verification

Uh oh!

CLAassistant commented Jun 14, 2026

Uh oh!

mlissner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants