Skip to content

fix: reject names that cannot be represented as alphanumeric ASCII#22

Open
Sivva2 wants to merge 1 commit into
tictactrip:masterfrom
Sivva2:fix/reject-non-alphanum-ids
Open

fix: reject names that cannot be represented as alphanumeric ASCII#22
Sivva2 wants to merge 1 commit into
tictactrip:masterfrom
Sivva2:fix/reject-non-alphanum-ids

Conversation

@Sivva2
Copy link
Copy Markdown

@Sivva2 Sivva2 commented May 4, 2026

Closes #2.

Context

Issue #2 lists generated ids containing characters that should not appear in a gpuid: Cyrillic (c|XXκαβάλα__@sx32g), Greek, Arabic, German eszett (g|XXweißenfe@u30e29), curly apostrophes (c|FRlessabd’@gbq8r), em dashes (g|ITadb6–pxs@srbj4j), emojis (g|XX🚌______@u2dhf7), and a handful of malformed inputs where backslash escape sequences leaked through.

The sanitize() pipeline already maps a wide range of Latin-extended characters to ASCII via replaceChar(), but anything outside that table flows through unchanged and ends up in the final id.

Change

A single regex check at the end of sanitize():

if (!/^[a-z0-9 ]+$/.test(sanitized)) {
  throw new Error(
    `Cannot generate gpuid: name "${str}" contains characters that cannot be represented as alphanumeric ASCII (got "${sanitized}" after sanitization).`,
  );
}

The check runs after replaceChar() and stop-word removal, so anything recoverable as Latin (é, ñ, ø, …) still goes through. Only characters that the existing pipeline cannot normalize trigger the error, and the message reports both the original input and the post-sanitization form to make upstream debugging easy.

Test changes

The existing should return array of gpuid test contained a B\u00fcsum, … entry whose expected id was c|DEb\u00fcs@u1w7c — i.e. the test was pinning a malformed input/output pair (the source uses the literal characters \, u, 0, 0, f, c, not a ü). I removed that entry from the array test and added a dedicated test asserting that this input now throws.

A new Non-alphanumeric inputs describe block adds 11 tests covering each category from the issue:

  • curly apostrophe ()
  • em dash ()
  • German eszett (ß)
  • Cyrillic (Брод)
  • Greek (Καβάλα)
  • Arabic (العربية)
  • emoji (🚌)
  • name reducing to nothing after sanitization (!!!)
  • malformed backslash escapes (B\u00fcsum, …)
  • regression: ASCII apostrophe (Pont-de-l'Arche) still works
  • regression: digits (73 Rue Victor Hugo, …) still work

yarn test reports 15/15 passing with 100% coverage, yarn lint is clean, yarn build is clean.

Note on backwards compatibility

This is a behavioural breaking change: inputs that previously produced malformed ids now throw. The issue is filed under the v2.0.0 milestone, but the package is currently at v2.1.1, so I'll let maintainers decide whether to ship this in a v3 or as part of v2.x — happy to adjust whichever way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not authorize non-alphanum-ids

1 participant