Skip to content

Conversation

@TonyWelte
Copy link

@TonyWelte TonyWelte commented Apr 27, 2025

This PR fixes strings containing unicode characters not being escaped properly and their size being inaccurate. For instance on WString it results in the following change

38c38
< static char test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default2[] = "Hell\\xc3\\xb6 w\\xc3\\xb6rld!";
---
> static char test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default2[] = "Hell\xc3\xb6 w\xc3\xb6rld!";
40c40
< static char test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default3[] = "\\xe3\\x83\\x8f\\xe3\\x83\\xad\\xe3\\x83\\xbc\\xe3\\x83\\xaf\\xe3\\x83\\xbc\\xe3\\x83\\xab\\xe3\\x83\\x89";
---
> static char test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default3[] = "\xe3\x83\x8f\xe3\x83\xad\xe3\x83\xbc\xe3\x83\xaf\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x89";
74c74
<     {test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default2, 12, 12},
---
>     {test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default2, 14, 14},
84c84
<     {test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default3, 7, 7},
---
>     {test_msgs__msg__WStrings__DEFAULT_VALUE__wstring_value_default3, 21, 21},
140,141c140,141
<   "wstring wstring_value_default2 \"Hell\\xc3\\xb6 w\\xc3\\xb6rld!\"\n"
<   "wstring wstring_value_default3 \"\\xe3\\x83\\x8f\\xe3\\x83\\xad\\xe3\\x83\\xbc\\xe3\\x83\\xaf\\xe3\\x83\\xbc\\xe3\\x83\\xab\\xe3\\x83\\x89\"\n"
---
>   "wstring wstring_value_default2 \"Hell\xc3\xb6 w\xc3\xb6rld!\"\n"
>   "wstring wstring_value_default3 \"\xe3\x83\x8f\xe3\x83\xad\xe3\x83\xbc\xe3\x83\xaf\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x89\"\n"
161c161
<     {toplevel_type_raw_source, 399, 399},
---
>     {toplevel_type_raw_source, 415, 415},

It also fixes " and ' being over escaped (and some other escaped characters). For instance on https://github.com/Tonywelte/test_interface_files/blob/rolling/msg/Strings.msg (see ros2/test_interface_files#25)

46c46
< static char test_msgs__msg__Strings__DEFAULT_VALUE__string_value_default6[] = "Hello\\\\world!";
---
> static char test_msgs__msg__Strings__DEFAULT_VALUE__string_value_default6[] = "Hello\\world!";
48c48
< static char test_msgs__msg__Strings__DEFAULT_VALUE__string_value_default7[] = "Hello\\tworld!";
---
> static char test_msgs__msg__Strings__DEFAULT_VALUE__string_value_default7[] = "Hello  world!";
61c61
< static char test_msgs__msg__Strings__DEFAULT_VALUE__bounded_string_value_default6[] = "Hello\\\\world!";
---
> static char test_msgs__msg__Strings__DEFAULT_VALUE__bounded_string_value_default6[] = "Hello\\world!";
63c63
< static char test_msgs__msg__Strings__DEFAULT_VALUE__bounded_string_value_default7[] = "Hello\\tworld!";
---
> static char test_msgs__msg__Strings__DEFAULT_VALUE__bounded_string_value_default7[] = "Hello  world!";
250,255c250,255
<   "string string_value_default2 \"Hello\\'world!\"\n"
<   "string string_value_default3 \\'Hello\"world!\\'\n"
<   "string string_value_default4 'Hello\\\\'world!'\n"
<   "string string_value_default5 \"Hello\\\\\"world!\"\n"
<   "string string_value_default6 \"Hello\\\\\\\\world!\"\n"
<   "string string_value_default7 \"Hello\\\\tworld!\"\n"
---
>   "string string_value_default2 \"Hello'world!\"\n"
>   "string string_value_default3 'Hello\"world!'\n"
>   "string string_value_default4 'Hello\\'world!'\n"
>   "string string_value_default5 \"Hello\\\"world!\"\n"
>   "string string_value_default6 \"Hello\\\\world!\"\n"
>   "string string_value_default7 \"Hello\\tworld!\"\n"
260,265c260,265
<   "string<=22 bounded_string_value_default2 \"Hello\\'world!\"\n"
<   "string<=22 bounded_string_value_default3 \\'Hello\"world!\\'\n"
<   "string<=22 bounded_string_value_default4 'Hello\\\\'world!'\n"
<   "string<=22 bounded_string_value_default5 \"Hello\\\\\"world!\"\n"
<   "string<=22 bounded_string_value_default6 \"Hello\\\\\\\\world!\"\n"
<   "string<=22 bounded_string_value_default7 \"Hello\\\\tworld!\"\n"
---
>   "string<=22 bounded_string_value_default2 \"Hello'world!\"\n"
>   "string<=22 bounded_string_value_default3 'Hello\"world!'\n"
>   "string<=22 bounded_string_value_default4 'Hello\\'world!'\n"
>   "string<=22 bounded_string_value_default5 \"Hello\\\"world!\"\n"
>   "string<=22 bounded_string_value_default6 \"Hello\\\\world!\"\n"
>   "string<=22 bounded_string_value_default7 \"Hello\\tworld!\"\n"

Additional notes

  • While working on this fix I noticed that rosidl_runtime_c__String states that the capacity should include the null byte but the size shouldn't. None of the string defined by rosidl generator do this. I haven't fixed it in this PR since I'm not sure what needs to be changed. It should be as simple as adding +1 in static_seq but for toplevel_type_raw_source it should be -1. I'm not sure what's the difference.
  • A think it should be clarified what escaped characters are supported in default strings. \' and \" are mentioned in https://design.ros2.org/articles/legacy_interface_definition.html but other (\\ and \t) worked in rosidl_generator_cpp but rosidl_generator_c didn't escape them properly. \n doesn't work at all.

Fixes #855

@TonyWelte TonyWelte force-pushed the bugfix/unicode-string-length branch from 2360fe6 to 6eba9f8 Compare April 27, 2025 11:12
@TonyWelte TonyWelte force-pushed the bugfix/unicode-string-length branch from 6eba9f8 to 8c50d41 Compare May 4, 2025 14:10
@TonyWelte TonyWelte marked this pull request as ready for review May 4, 2025 14:12
TonyWelte added a commit to TonyWelte/rosidlcpp that referenced this pull request May 4, 2025
Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, just a minor nitpick.

i would like to have an another review for this PR before merge.

@fujitatomoya
Copy link
Contributor

Pulls: #862
Gist: https://gist.githubusercontent.com/fujitatomoya/88476ec104d8b6970d44034c3f4a6fac/raw/354a5a9e3a0ddbe768c322cf3f1ae478f06cbdf5/ros2.repos
BUILD args: --packages-above-and-dependencies rosidl_generator_c
TEST args: --packages-above rosidl_generator_c
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/15920

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@TonyWelte TonyWelte force-pushed the bugfix/unicode-string-length branch from 8c50d41 to 0c29c98 Compare May 6, 2025 18:33
@fujitatomoya
Copy link
Contributor

Pulls: #862
Gist: https://gist.githubusercontent.com/fujitatomoya/9157bd03771275331596df97b01c4e91/raw/354a5a9e3a0ddbe768c322cf3f1ae478f06cbdf5/ros2.repos
BUILD args: --packages-above-and-dependencies rosidl_generator_c
TEST args: --packages-above rosidl_generator_c
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/16050

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Signed-off-by: Anthony Welte <[email protected]>
@TonyWelte TonyWelte force-pushed the bugfix/unicode-string-length branch from 0c29c98 to 52876f4 Compare May 21, 2025 19:37
@TonyWelte
Copy link
Author

TonyWelte commented May 21, 2025

Looks like the RHEL build uses an older Python version (before 3.12) which doesn't like strings within fstrings. I've pushed the following modification:

   """Statically define a runtime Sequence or String type."""
   if values:
     if isinstance(values, str):
-      return f'{{{varname}, {len(values.encode('utf-8'))}, {len(values.encode('utf-8'))}}}'
+      utf8_values = values.encode('utf-8')
+      return f'{{{varname}, {len(utf8_values)}, {len(utf8_values)}}}'
     else:
       return f'{{{varname}, {len(values)}, {len(values)}}}'
   return '{NULL, 0, 0}'

@fujitatomoya
Copy link
Contributor

Pulls: #862
Gist: https://gist.githubusercontent.com/fujitatomoya/56af7eafdd1a192e3fc92c458054b5f5/raw/354a5a9e3a0ddbe768c322cf3f1ae478f06cbdf5/ros2.repos
BUILD args: --packages-above-and-dependencies rosidl_generator_c
TEST args: --packages-above rosidl_generator_c
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/16062

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@TonyWelte
Copy link
Author

After using this fix with our interfaces we noticed another bug that could cause issues once this is merged.

If a Unicode character is directly followed by 0-9 or a-f, we get a warning warning: hex escape sequence out of range. It compiles but our tests fail because of an address sanitizer error global-buffer-overflow.

The solution would be to split the string after a Unicode sequence by adding "" (maybe only if the next character is 0-9 or a-f ?). Turning

char test[] = "Temperature \xc2\xb0C";

into

char test[] = "Temperature \xc2\xb0""C";

see https://godbolt.org/z/5rPKh4qa9

I can attempt a fix in this MR or make a new one. What do you think @fujitatomoya ?

@fujitatomoya
Copy link
Contributor

@TonyWelte to be honest, i am not sure about that.

@sloretz @cottsay @ahcorde do you have opinion about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default string containing unicode characters are not handled properly

2 participants