You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The invisible character regex in _invisible.py is manually concatenated from 9 separate set/range sources (~20 lines of string concatenation). This has two risks:
C0 range fragility — The three C0 sub-ranges use hardcoded indexes (C0_CONTROL_RANGES[0], [1], [2]). Adding or removing a range silently breaks the regex.
Set iteration order — Fixed in 0.2.0 via sorted(), but the root cause is that regex construction shouldn't depend on container iteration order at all.
Proposed Solution
Replace manual string concatenation with programmatic generation:
# Single source of truth: all invisible codepoints as a flat sorted list_ALL_INVISIBLE_CODEPOINTS=sorted(
ZERO_WIDTH_CHARS|FORMAT_CHARS|MONGOLIAN_FVS_CHARS|BIDI_CONTROL_CHARS| {chr(cp) forstart, endin [VARIATION_SELECTOR_RANGE, TAG_BLOCK_RANGE, ...]
forcpinrange(start, end+1)}
| {chr(cp) forstart, endinC0_CONTROL_RANGESforcpinrange(start, end+1)}
| {chr(cp) forcpinrange(*C1_CONTROL_RANGE)}
)
# Generate regex from the flat set — collapse contiguous runs into rangesINVISIBLE_RE=_build_char_class(_ALL_INVISIBLE_CODEPOINTS)
This makes the regex self-documenting, auditable, and immune to ordering/indexing bugs.
Context
Surfaced by Copilot and Grippy code review on PR #7. See also the regression test added in 0.2.0 that asserts all 492 codepoints are matched.
Scope
Refactor _invisible.py regex construction
Keep all existing character sets as named constants (documentation value)
Verify regex matches identical codepoint set before and after
Problem
The invisible character regex in
_invisible.pyis manually concatenated from 9 separate set/range sources (~20 lines of string concatenation). This has two risks:C0_CONTROL_RANGES[0],[1],[2]). Adding or removing a range silently breaks the regex.sorted(), but the root cause is that regex construction shouldn't depend on container iteration order at all.Proposed Solution
Replace manual string concatenation with programmatic generation:
This makes the regex self-documenting, auditable, and immune to ordering/indexing bugs.
Context
Surfaced by Copilot and Grippy code review on PR #7. See also the regression test added in 0.2.0 that asserts all 492 codepoints are matched.
Scope
_invisible.pyregex construction