-
Notifications
You must be signed in to change notification settings - Fork 756
Fix URL regex in api/rich-text #4156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
packages/api/src/rich-text/util.ts
Outdated
@@ -1,6 +1,8 @@ | |||
export const MENTION_REGEX = /(^|\s|\()(@)([a-zA-Z0-9.-]+)(\b)/g | |||
// inspired by https://gist.github.com/dperini/729294 (2018/09/12 version) | |||
// gist credit: Diego Perini |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that simply referencing this does not satisfy the license terms.
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omg, you are right. I will include this text this afternoon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a comment of the same length of the regex to visualize better the parts of it:
export const URL_REGEX =
/(?:^|\s|\()(?<uri>(?<protocol>https?:\/\/)?(?<domain>(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}\.(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]*)?[a-z0-9\u00a1-\uffff]\.)+(?<tld>[a-z\u00a1-\uffff]{2,}))(?::\d{2,5})?(?:[/?#]\S*)?)/gim
//-(-prefix--)(uri---(-------protocol-------)-(domain---(not-private-and-loopback-ips)(---not-system-and-class-c-private-ips--)(----------not-class-b-private-ips-----------)(----------ip-1st-oct------------)(----------ip-2nd-and-3rd-oct---------)--(-----------ip-4th-oct------------)-(--------------------------------dns-domain---------------------------------)-(-------------tld------------))(---port---)-(---path---)-)
Hey, @matthieusieben, is there something I could do to make the review of this PR faster? Like,
There are more than 5 issues in social-app that are related to URLs, so the acceptance of non-latin characters, dashes and non-case-sensitive matching that this PR provides are things that I consider important. |
This PR fixes #4120 where the RegEx in api/rich-text does not follow any standard. The current RegEx does not validate URIs such as
ko-fi.com
or日本語.jp
.The new RegEx is based on the WHATWG Living Standard by using an altered version of a well-reviewed gist by Diego Perini.
The new RegEx
The new RegEx is considerably longer than the old one. This is mainly due to the complex IPv4 validation. It is also important to notice that the new RegEx does not cover (yet) IPv6.
If the complexity of the new RegEx makes this PR hard to review, the IPv4 validation can be relaxed (allowing any pattern in the format 4 groups of digits that can have between 1 and 3 digits each [
(\d{1,3}\.){3}\d{1,3}
]).Here is the original RegEx from Perini:
/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]{0,62})?[a-z0-9\u00a1-\uffff]\.)+(?:[a-z\u00a1-\uffff]{2,}\.?))(?::\d{2,5})?(?:[/?#]\S*)?$/i
And here is the one that I used:
/(?:^|\s|\()(?<uri>(?<protocol>https?:\/\/)?(?<domain>(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}\.(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?)|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]*)?[a-z0-9\u00a1-\uffff]\.)+(?<tld>[a-z\u00a1-\uffff]{2,})\.?)(?::\d{2,5})?(?:[/?#]\S*)?)/gim
Main differences between Perini's and my RegEx:
^
) and end of line ($
) tokens;http
andhttps
protocols are accepted, if a protocol is present;(?:^|\s|\()
from the old RegEx at the beginning;(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4])
to(?:1\d\d|2[0-4]\d|25[0-4]|[1-9]\d?))
, putting[1-9]\d?
in the last position (necessary after the removal of$
).The new logic
The added named groups and the new logic in the
detectFacets
function solves the problem of not accepting URLs in the upper-case format. It also helps with readability of the code.The validation of a URI by its TLD part is made only when the TLD exists, i.e., it is not an IP.
Tests
This fix is compatible with all previous tests that existed and it also add more tests in here:
The last two are considered only text.
Notes
This PR, if accepted, should close #3002 as it solves the case-sensitive problem.