Skip to content

refactor: generate invisible char regex programmatically from data structures #8

@Fieldnote-Echo

Description

Problem

The invisible character regex in _invisible.py is manually concatenated from 9 separate set/range sources (~20 lines of string concatenation). This has two risks:

  1. Silent regression — A typo, merge conflict, or range edit can shrink coverage with no test failure (until the regression test from refactor: generate invisible char regex programmatically from data structures #8 catches it)
  2. C0 range fragility — The three C0 sub-ranges use hardcoded indexes (C0_CONTROL_RANGES[0], [1], [2]). Adding or removing a range silently breaks the regex.
  3. Set iteration order — Fixed in 0.2.0 via sorted(), but the root cause is that regex construction shouldn't depend on container iteration order at all.

Proposed Solution

Replace manual string concatenation with programmatic generation:

# Single source of truth: all invisible codepoints as a flat sorted list
_ALL_INVISIBLE_CODEPOINTS = sorted(
    ZERO_WIDTH_CHARS | FORMAT_CHARS | MONGOLIAN_FVS_CHARS | BIDI_CONTROL_CHARS
    | {chr(cp) for start, end in [VARIATION_SELECTOR_RANGE, TAG_BLOCK_RANGE, ...] 
       for cp in range(start, end + 1)}
    | {chr(cp) for start, end in C0_CONTROL_RANGES for cp in range(start, end + 1)}
    | {chr(cp) for cp in range(*C1_CONTROL_RANGE)}
)

# Generate regex from the flat set — collapse contiguous runs into ranges
INVISIBLE_RE = _build_char_class(_ALL_INVISIBLE_CODEPOINTS)

This makes the regex self-documenting, auditable, and immune to ordering/indexing bugs.

Context

Surfaced by Copilot and Grippy code review on PR #7. See also the regression test added in 0.2.0 that asserts all 492 codepoints are matched.

Scope

  • Refactor _invisible.py regex construction
  • Keep all existing character sets as named constants (documentation value)
  • Verify regex matches identical codepoint set before and after
  • Target: 0.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions