-
-
Notifications
You must be signed in to change notification settings - Fork 27
refactor: make extension detection more robust #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Current extension detection relies on string search which can lead to situation where a normal file is seen as an archive such as with the brotli extension which is br and the Bru markup language which uses bru. This refactor implement the detection based on splitting the file name on dots and then check the artifacts base on that. While not bullet proof it should keep the current behaviour while avoiding sitution where close extension gets mixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. While I do agree the extension detection needs improvement, I think this change does too, before I can merge it.
First, I don't think we need a whole internal
package just for 2 small function, one of which I don't even know is necessary -- just make an unexported function in the archives package.
Then, I don't think we should use any Contains()
, since that's going to be brittle all-around. For example:
#7 (comment)
We should probably add more test cases and make sure that we are accurately detecting the real extension, while accounting for the fact that archive filenames may have multiple dots in them. The extension would obviously be only at the end, but it may be one dot component, or two.
lzip.go
Outdated
|
||
// match filename | ||
if filepath.Ext(strings.ToLower(filename)) == lz.Extension() { | ||
if extensions.EndsWith(filename, lz.Extension()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this change, what didn't work before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing wrong with the original implementation. I was thinking about consistency across the code base. Part of it is using string parsing and some filepath.
Sure thing
Will do but see below for my last comment.
I read that issue and I think I misunderstood that comment. In any case, my original goal was to keep the current behaviour while essentially ensuring that only the known extensions are detected.
For the test cases, would that be the ones I wrote in Just to be sure I understand you correctly:
If so, then I think that |
Thanks! Yes, I think that's correct, although I think the reason I didn't use So we might need to be mindful of that; or we need to refactor how we do format identification, maybe starting with the outer-most format, and working to the inner-most (there should only be 2 at most: compression on the outside, and archive on the inner, though that's not strictly true -- I guess you could have So yeah, I'm also talking about using your new test cases and expanding them even more if possible. So:
How's that sound? |
All file types are now tested using equality with their respective extension except for .tar (which matches the original implementation).
Sounds good, I started with the cleanup but I haven't touched the refactoring of |
Current extension detection relies on string search which can lead to situation where a normal file is seen as an archive such as with the brotli extension which is br and the Bru markup language which uses bru.
This refactor implement the detection based on splitting the file name on dots and then check the artifacts base on that. While not bullet proof it should keep the current behaviour while avoiding situation where close extension gets mixed.
Fixes gitleaks/gitleaks#1949