refactor: make extension detection more robust #52

sgaist · 2025-10-01T19:53:46Z

Current extension detection relies on string search which can lead to situation where a normal file is seen as an archive such as with the brotli extension which is br and the Bru markup language which uses bru.

This refactor implement the detection based on splitting the file name on dots and then check the artifacts base on that. While not bullet proof it should keep the current behaviour while avoiding situation where close extension gets mixed.

Fixes gitleaks/gitleaks#1949

Current extension detection relies on string search which can lead to situation where a normal file is seen as an archive such as with the brotli extension which is br and the Bru markup language which uses bru. This refactor implement the detection based on splitting the file name on dots and then check the artifacts base on that. While not bullet proof it should keep the current behaviour while avoiding sitution where close extension gets mixed.

mholt

Thanks. While I do agree the extension detection needs improvement, I think this change does too, before I can merge it.

First, I don't think we need a whole internal package just for 2 small function, one of which I don't even know is necessary -- just make an unexported function in the archives package.

Then, I don't think we should use any Contains(), since that's going to be brittle all-around. For example:
#7 (comment)

We should probably add more test cases and make sure that we are accurately detecting the real extension, while accounting for the fact that archive filenames may have multiple dots in them. The extension would obviously be only at the end, but it may be one dot component, or two.

mholt · 2025-10-01T21:54:55Z

lzip.go


 	// match filename
-	if filepath.Ext(strings.ToLower(filename)) == lz.Extension() {
+	if extensions.EndsWith(filename, lz.Extension()) {


What is the point of this change, what didn't work before?

Nothing wrong with the original implementation. I was thinking about consistency across the code base. Part of it is using string parsing and some filepath.

sgaist · 2025-10-02T07:38:11Z

Thanks. While I do agree the extension detection needs improvement, I think this change does too, before I can merge it.

Sure thing

First, I don't think we need a whole internal package just for 2 small function, one of which I don't even know is necessary -- just make an unexported function in the archives package.

Will do but see below for my last comment.

Then, I don't think we should use any Contains(), since that's going to be brittle all-around. For example: #7 (comment)

I read that issue and I think I misunderstood that comment. In any case, my original goal was to keep the current behaviour while essentially ensuring that only the known extensions are detected.

We should probably add more test cases and make sure that we are accurately detecting the real extension, while accounting for the fact that archive filenames may have multiple dots in them. The extension would obviously be only at the end, but it may be one dot component, or two.

For the test cases, would that be the ones I wrote in extensionutils_test.go ?

Just to be sure I understand you correctly:

The library would assume (and document/comment) that file name detection would be about having the official extension(s) as last element of the file path
all the currently supported formats code should be ported to do that (essentially == rather than contains)

If so, then I think that filepath.Ext could be used everywhere and extensionutils can be completely dropped.

mholt · 2025-10-03T15:41:28Z

Thanks!

Yes, I think that's correct, although I think the reason I didn't use filepath.Ext() is because I started with the Tar format, and that often has another extension after it, like .tar.gz, so I had to use Contains(). I think, anyway.

So we might need to be mindful of that; or we need to refactor how we do format identification, maybe starting with the outer-most format, and working to the inner-most (there should only be 2 at most: compression on the outside, and archive on the inner, though that's not strictly true -- I guess you could have foo.gz.tar).

So yeah, I'm also talking about using your new test cases and expanding them even more if possible.

So:

Let's drop extensionutils
Probably use filepath.Ext(), but we have to be mindful of double-extensions.
Possibly refactor Identify() to work outer-inner when it comes to file extension (it already does this for streams); might involve stripping a detected file extension, then we always can use filepath.Ext(), I think.

How's that sound?

All file types are now tested using equality with their respective extension except for .tar (which matches the original implementation).

sgaist · 2025-10-07T22:44:13Z

Sounds good, I started with the cleanup but I haven't touched the refactoring of Identity yet.

mholt requested changes Oct 1, 2025

View reviewed changes

refactor: remove internal utils and use filepath

b7e23a9

All file types are now tested using equality with their respective extension except for .tar (which matches the original implementation).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: make extension detection more robust #52

refactor: make extension detection more robust #52

Uh oh!

sgaist commented Oct 1, 2025

Uh oh!

mholt left a comment

Uh oh!

mholt Oct 1, 2025

Uh oh!

sgaist Oct 2, 2025

Uh oh!

sgaist commented Oct 2, 2025 •

edited

Loading

Uh oh!

mholt commented Oct 3, 2025

Uh oh!

sgaist commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

refactor: make extension detection more robust #52

Are you sure you want to change the base?

refactor: make extension detection more robust #52

Uh oh!

Conversation

sgaist commented Oct 1, 2025

Uh oh!

mholt left a comment

Choose a reason for hiding this comment

Uh oh!

mholt Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

sgaist Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

sgaist commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mholt commented Oct 3, 2025

Uh oh!

sgaist commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sgaist commented Oct 2, 2025 •

edited

Loading