Skip to content

Conversation

sgaist
Copy link

@sgaist sgaist commented Oct 1, 2025

Current extension detection relies on string search which can lead to situation where a normal file is seen as an archive such as with the brotli extension which is br and the Bru markup language which uses bru.

This refactor implement the detection based on splitting the file name on dots and then check the artifacts base on that. While not bullet proof it should keep the current behaviour while avoiding situation where close extension gets mixed.

Fixes gitleaks/gitleaks#1949

Current extension detection relies on string search
which can lead to situation where a normal file is
seen as an archive such as with the brotli extension
which is br and the Bru markup language which uses
bru.

This refactor implement the detection based on
splitting the file name on dots and then check
the artifacts base on that. While not bullet
proof it should keep the current behaviour while
avoiding sitution where close extension gets mixed.
Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. While I do agree the extension detection needs improvement, I think this change does too, before I can merge it.

First, I don't think we need a whole internal package just for 2 small function, one of which I don't even know is necessary -- just make an unexported function in the archives package.

Then, I don't think we should use any Contains(), since that's going to be brittle all-around. For example:
#7 (comment)

We should probably add more test cases and make sure that we are accurately detecting the real extension, while accounting for the fact that archive filenames may have multiple dots in them. The extension would obviously be only at the end, but it may be one dot component, or two.

lzip.go Outdated

// match filename
if filepath.Ext(strings.ToLower(filename)) == lz.Extension() {
if extensions.EndsWith(filename, lz.Extension()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this change, what didn't work before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with the original implementation. I was thinking about consistency across the code base. Part of it is using string parsing and some filepath.

@sgaist
Copy link
Author

sgaist commented Oct 2, 2025

Thanks. While I do agree the extension detection needs improvement, I think this change does too, before I can merge it.

Sure thing

First, I don't think we need a whole internal package just for 2 small function, one of which I don't even know is necessary -- just make an unexported function in the archives package.

Will do but see below for my last comment.

Then, I don't think we should use any Contains(), since that's going to be brittle all-around. For example: #7 (comment)

I read that issue and I think I misunderstood that comment. In any case, my original goal was to keep the current behaviour while essentially ensuring that only the known extensions are detected.

We should probably add more test cases and make sure that we are accurately detecting the real extension, while accounting for the fact that archive filenames may have multiple dots in them. The extension would obviously be only at the end, but it may be one dot component, or two.

For the test cases, would that be the ones I wrote in extensionutils_test.go ?

Just to be sure I understand you correctly:

  • The library would assume (and document/comment) that file name detection would be about having the official extension(s) as last element of the file path
  • all the currently supported formats code should be ported to do that (essentially == rather than contains)

If so, then I think that filepath.Ext could be used everywhere and extensionutils can be completely dropped.

@mholt
Copy link
Owner

mholt commented Oct 3, 2025

Thanks!

Yes, I think that's correct, although I think the reason I didn't use filepath.Ext() is because I started with the Tar format, and that often has another extension after it, like .tar.gz, so I had to use Contains(). I think, anyway.

So we might need to be mindful of that; or we need to refactor how we do format identification, maybe starting with the outer-most format, and working to the inner-most (there should only be 2 at most: compression on the outside, and archive on the inner, though that's not strictly true -- I guess you could have foo.gz.tar).

So yeah, I'm also talking about using your new test cases and expanding them even more if possible.

So:

  • Let's drop extensionutils
  • Probably use filepath.Ext(), but we have to be mindful of double-extensions.
  • Possibly refactor Identify() to work outer-inner when it comes to file extension (it already does this for streams); might involve stripping a detected file extension, then we always can use filepath.Ext(), I think.

How's that sound?

All file types are now tested using equality with
their respective extension except for .tar (which
matches the original implementation).
@sgaist
Copy link
Author

sgaist commented Oct 7, 2025

Sounds good, I started with the cleanup but I haven't touched the refactoring of Identity yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to scan .bru files by default, only scans when renamed to .txt

2 participants