Skip to content

Conversation

@MichaelRFairhurst
Copy link
Collaborator

@MichaelRFairhurst MichaelRFairhurst commented Aug 23, 2025

CodeQL coding standards is implementing MISRA rules that refer to unicode standard concepts such as UAX #44 compliant identifiers, and NFC normalization checks.

These concepts are neither specific to MISRA, nor C, and thus, deserve a home in qtil.

This pull request introduces

  • extensible predicates that encode raw data from the unicode standard
  • a script to generate the extensible predicate yaml from local downloads of the unicode property databases
  • a few general ascii/unicode helpers such as isAscii and unescapeUnicode
  • a few specific APIs to make UAX #44 validity and NFC quick checking more user friendly (and efficient)
  • importing these unicode features from Qtil.qll.

These features are pretty advanced, I'm not sure they're worth adding to the README.md.

Adds UAX #44 identifier checking, and NFC quick check support, along with a
few helpers like `isAscii` and `unescapeUnicode`.
@MichaelRFairhurst MichaelRFairhurst force-pushed the michaelrfairhurst/add-unicode branch from d02924f to 2e0ffdb Compare August 24, 2025 06:05
@MichaelRFairhurst MichaelRFairhurst marked this pull request as ready for review August 24, 2025 06:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive Unicode support to the qtil library, providing CodeQL predicates for Unicode property checking, UAX #44 identifier validation, and NFC normalization checking. The implementation includes raw Unicode data generation, string utilities for Unicode escape handling, and efficient APIs for common Unicode operations.

  • Adds extensible predicates for Unicode properties (enumeration, boolean, and numeric)
  • Implements UAX #44 identifier validation and NFC normalization quick checking
  • Provides utilities for Unicode escape sequences and ASCII validation

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/qtil/strings/Unicode.qll Core Unicode module with extensible predicates and helper functions
src/qtil/Qtil.qll Imports the new Unicode module
src/qlpack.yml Adds data extension for generated Unicode data
scripts/generate_unicode.py Python script to generate Unicode property data from Unicode standard files
test/qtil/strings/UnicodeTest.ql Comprehensive test suite for Unicode functionality
test/qtil/strings/UnicodeTest.expected Test expectations file

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +125 to +130
if '..' not in code_point_hex_pair:
code_point_start = code_point_end = int(code_point_hex_pair, 16)
else:
# handle ranges like '00A0..00A7'
code_point_hex_start, code_point_hex_end = code_point_hex_pair.split('..')
code_point_start, code_point_end = int(code_point_hex_start, 16), int(code_point_hex_end, 16)
Copy link

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and code handling ranges is duplicated on lines 128-130 and 157-159. Consider extracting this logic into a helper function to reduce duplication.

Suggested change
if '..' not in code_point_hex_pair:
code_point_start = code_point_end = int(code_point_hex_pair, 16)
else:
# handle ranges like '00A0..00A7'
code_point_hex_start, code_point_hex_end = code_point_hex_pair.split('..')
code_point_start, code_point_end = int(code_point_hex_start, 16), int(code_point_hex_end, 16)
code_point_start, code_point_end = parse_code_point_range(code_point_hex_pair)

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +159
if '..' not in code_point_hex_pair:
code_point_start = code_point_end = int(code_point_hex_pair, 16)
else:
# handle ranges like '00A0..00A7'
code_point_hex_start, code_point_hex_end = code_point_hex_pair.split('..')
code_point_start, code_point_end = int(code_point_hex_start, 16), int(code_point_hex_end, 16)
Copy link

Copilot AI Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and code handling ranges is duplicated on lines 128-130 and 157-159. Consider extracting this logic into a helper function to reduce duplication.

Suggested change
if '..' not in code_point_hex_pair:
code_point_start = code_point_end = int(code_point_hex_pair, 16)
else:
# handle ranges like '00A0..00A7'
code_point_hex_start, code_point_hex_end = code_point_hex_pair.split('..')
code_point_start, code_point_end = int(code_point_hex_start, 16), int(code_point_hex_end, 16)
code_point_start, code_point_end = parse_code_point_range(code_point_hex_pair)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants