Fix URL regex in api/rich-text #4156

arturo32 · 2025-08-30T04:36:07Z

This PR fixes #4120 where the RegEx in api/rich-text does not follow any standard. The current RegEx does not validate URIs such as ko-fi.com or 日本語.jp.

The new RegEx is based on the WHATWG Living Standard by using an altered version of a well-reviewed gist by Diego Perini.

The new RegEx

The new RegEx is considerably longer than the old one. This is mainly due to the complex IPv4 validation. It is also important to notice that the new RegEx does not cover (yet) IPv6.

If the complexity of the new RegEx makes this PR hard to review, the IPv4 validation can be relaxed (allowing any pattern in the format 4 groups of digits that can have between 1 and 3 digits each [ (\d{1,3}\.){3}\d{1,3} ]).

Here is the original RegEx from Perini:

/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]{0,62})?[a-z0-9\u00a1-\uffff]\.)+(?:[a-z\u00a1-\uffff]{2,}\.?))(?::\d{2,5})?(?:[/?#]\S*)?$/i

And here is the one that I used:

/(?:^|\s|\()(?<uri>(?<protocol>https?:\/\/)?(?<domain>(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}\.(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]*)?[a-z0-9\u00a1-\uffff]\.)+(?<tld>[a-z\u00a1-\uffff]{2,})\.?)(?::\d{2,5})?(?:[/?#]\S*)?)/gim

Main differences between Perini's and my RegEx:

Removal of the required begin of line (^) and end of line ($) tokens;
Removal of 64 characters/bytes limit to the domain;
Only http and https protocols are accepted, if a protocol is present;
Keeping of (?:^|\s|\() from the old RegEx at the beginning;
Extensive use of named capture groups;
Change from (?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]) to (?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)), putting [1-9]\d? in the last position (necessary after the removal of $ ).

The new logic

The added named groups and the new logic in the detectFacets function solves the problem of not accepting URLs in the upper-case format. It also helps with readability of the code.

The validation of a URI by its TLD part is made only when the TLD exists, i.e., it is not an IP.

Tests

This fix is compatible with all previous tests that existed and it also add more tests in here:

'HTTPS://google.com',
'https://google.COM',
'ko-fi.com',
'日本語.jp',
'GOOGLE.com',
'https://34.64.0.52',
'198.185.159.145',
'invalid IPs: http://127.0.0.1 https://255.255.255.255 https://0.0.0.0 https://169.254.1.1 https://1.1.1.011',
'invalid URIs: https://google.a https://localhost',

The last two are considered only text.

Notes

This PR, if accepted, should close #3002 as it solves the case-sensitive problem.

matthieusieben · 2025-08-30T13:25:13Z

packages/api/src/rich-text/util.ts

@@ -1,6 +1,8 @@
 export const MENTION_REGEX = /(^|\s|\()(@)([a-zA-Z0-9.-]+)(\b)/g
+// inspired by https://gist.github.com/dperini/729294 (2018/09/12 version)
+// gist credit: Diego Perini


Note that simply referencing this does not satisfy the license terms.

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Omg, you are right. I will include this text this afternoon.

I also added a comment of the same length of the regex to visualize better the parts of it:

export const URL_REGEX = /(?:^|\s|\()(?<uri>(?<protocol>https?:\/\/)?(?<domain>(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}\.(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]*)?[a-z0-9\u00a1-\uffff]\.)+(?<tld>[a-z\u00a1-\uffff]{2,}))(?::\d{2,5})?(?:[/?#]\S*)?)/gim //-(-prefix--)(uri---(-------protocol-------)-(domain---(not-private-and-loopback-ips)(---not-system-and-class-c-private-ips--)(----------not-class-b-private-ips-----------)(----------ip-1st-oct------------)(----------ip-2nd-and-3rd-oct---------)--(-----------ip-4th-oct------------)-(--------------------------------dns-domain---------------------------------)-(-------------tld------------))(---port---)-(---path---)-)

arturo32 · 2025-09-07T15:42:16Z

Hey, @matthieusieben, is there something I could do to make the review of this PR faster? Like,

Is performance a concern? I could do some tests with thousands of posts with links on them (in regex101 the old regex seems to have a mean of 1ms while the new regex takes 3ms);
Is the copyright notice an impediment? I could remove the IP identification part, making it different enough from the original code so that a copyright notice is not needed anymore (I would keep the "inspired by" old comment);
Should I propose a library instead? Is reliability in this long regex a concern?

There are more than 5 issues in social-app that are related to URLs, so the acceptance of non-latin characters, dashes and non-case-sensitive matching that this PR provides are things that I consider important.

Fix URL regex in api/rich-text

dafe564

matthieusieben reviewed Aug 30, 2025

View reviewed changes

add license text to url regex in api/rich-text

23009d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix URL regex in api/rich-text #4156

Fix URL regex in api/rich-text #4156

arturo32 commented Aug 30, 2025 •

edited

Loading

Uh oh!

matthieusieben Aug 30, 2025 •

edited

Loading

Uh oh!

arturo32 Aug 30, 2025

Uh oh!

arturo32 Aug 31, 2025

Uh oh!

arturo32 commented Sep 7, 2025

Uh oh!

Uh oh!

Fix URL regex in api/rich-text #4156

Are you sure you want to change the base?

Fix URL regex in api/rich-text #4156

Conversation

arturo32 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The new RegEx

The new logic

Tests

Notes

Uh oh!

matthieusieben Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arturo32 Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

arturo32 Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

arturo32 commented Sep 7, 2025

Uh oh!

Uh oh!

arturo32 commented Aug 30, 2025 •

edited

Loading

matthieusieben Aug 30, 2025 •

edited

Loading