Skip to content

Conversation

arturo32
Copy link

@arturo32 arturo32 commented Aug 30, 2025

This PR fixes #4120 where the RegEx in api/rich-text does not follow any standard. The current RegEx does not validate URIs such as ko-fi.com or 日本語.jp.

The new RegEx is based on the WHATWG Living Standard by using an altered version of a well-reviewed gist by Diego Perini.

The new RegEx

The new RegEx is considerably longer than the old one. This is mainly due to the complex IPv4 validation. It is also important to notice that the new RegEx does not cover (yet) IPv6.

If the complexity of the new RegEx makes this PR hard to review, the IPv4 validation can be relaxed (allowing any pattern in the format 4 groups of digits that can have between 1 and 3 digits each [ (\d{1,3}\.){3}\d{1,3} ]).

Here is the original RegEx from Perini:

/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]{0,62})?[a-z0-9\u00a1-\uffff]\.)+(?:[a-z\u00a1-\uffff]{2,}\.?))(?::\d{2,5})?(?:[/?#]\S*)?$/i

And here is the one that I used:

/(?:^|\s|\()(?<uri>(?<protocol>https?:\/\/)?(?<domain>(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}\.(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]*)?[a-z0-9\u00a1-\uffff]\.)+(?<tld>[a-z\u00a1-\uffff]{2,})\.?)(?::\d{2,5})?(?:[/?#]\S*)?)/gim

Main differences between Perini's and my RegEx:

  1. Removal of the required begin of line (^) and end of line ($) tokens;
  2. Removal of 64 characters/bytes limit to the domain;
  3. Only http and https protocols are accepted, if a protocol is present;
  4. Keeping of (?:^|\s|\() from the old RegEx at the beginning;
  5. Extensive use of named capture groups;
  6. Change from (?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]) to (?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)), putting [1-9]\d? in the last position (necessary after the removal of $ ).

The new logic

The added named groups and the new logic in the detectFacets function solves the problem of not accepting URLs in the upper-case format. It also helps with readability of the code.

The validation of a URI by its TLD part is made only when the TLD exists, i.e., it is not an IP.

Tests

This fix is compatible with all previous tests that existed and it also add more tests in here:

'HTTPS://google.com',
'https://google.COM',
'ko-fi.com',
'日本語.jp',
'GOOGLE.com',
'https://34.64.0.52',
'198.185.159.145',
'invalid IPs: http://127.0.0.1 https://255.255.255.255 https://0.0.0.0 https://169.254.1.1 https://1.1.1.011',
'invalid URIs: https://google.a https://localhost',

The last two are considered only text.

Notes

This PR, if accepted, should close #3002 as it solves the case-sensitive problem.

@@ -1,6 +1,8 @@
export const MENTION_REGEX = /(^|\s|\()(@)([a-zA-Z0-9.-]+)(\b)/g
// inspired by https://gist.github.com/dperini/729294 (2018/09/12 version)
// gist credit: Diego Perini
Copy link
Contributor

@matthieusieben matthieusieben Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that simply referencing this does not satisfy the license terms.

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omg, you are right. I will include this text this afternoon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a comment of the same length of the regex to visualize better the parts of it:

export const URL_REGEX =
  /(?:^|\s|\()(?<uri>(?<protocol>https?:\/\/)?(?<domain>(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}\.(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]*)?[a-z0-9\u00a1-\uffff]\.)+(?<tld>[a-z\u00a1-\uffff]{2,}))(?::\d{2,5})?(?:[/?#]\S*)?)/gim
//-(-prefix--)(uri---(-------protocol-------)-(domain---(not-private-and-loopback-ips)(---not-system-and-class-c-private-ips--)(----------not-class-b-private-ips-----------)(----------ip-1st-oct------------)(----------ip-2nd-and-3rd-oct---------)--(-----------ip-4th-oct------------)-(--------------------------------dns-domain---------------------------------)-(-------------tld------------))(---port---)-(---path---)-)

@arturo32
Copy link
Author

arturo32 commented Sep 7, 2025

Hey, @matthieusieben, is there something I could do to make the review of this PR faster? Like,

  1. Is performance a concern? I could do some tests with thousands of posts with links on them (in regex101 the old regex seems to have a mean of 1ms while the new regex takes 3ms);
  2. Is the copyright notice an impediment? I could remove the IP identification part, making it different enough from the original code so that a copyright notice is not needed anymore (I would keep the "inspired by" old comment);
  3. Should I propose a library instead? Is reliability in this long regex a concern?

There are more than 5 issues in social-app that are related to URLs, so the acceptance of non-latin characters, dashes and non-case-sensitive matching that this PR provides are things that I consider important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Link recognition in api/richtext is broken
2 participants