fix: harden zip/tar extraction and redact leaked credentials#332
Open
willronchetti wants to merge 4 commits into
Open
fix: harden zip/tar extraction and redact leaked credentials#332willronchetti wants to merge 4 commits into
willronchetti wants to merge 4 commits into
Conversation
…ncomplete redaction regex - zip_utils.py: unpack_zip_file_to_temporary_directory/unpack_tar_file_to_temporary_directory called extractall() with no member-path validation, allowing a malicious archive entry (e.g. "../../../../tmp/evil") to write outside the target temp directory (Zip Slip / CVE-2007-4559-class tar slip). This is reachable from structured_data.py's metadata-bundle ingestion path (used by submitr and portal metadata uploads). Fixed by validating every member's resolved path stays within the target directory before extraction, and by passing tarfile's filter="data" (PEP 706) as defense in depth against unsafe member types. - ff_utils.py: UnifiedAuthenticator.AuthenticationError embedded the raw auth object (the actual access-key/secret credential) into its exception message, which routinely ends up in logs, error trackers, and Foursight check output whenever a caller passes a truthy but malformed auth value (e.g. a dict missing/misnaming the "secret" key). Fixed by redacting the auth payload before building the message while preserving the original value on exception.auth for legitimate handling. - obfuscation_utils.py: _SENSITIVE_KEY_NAMES_REGEX (the heuristic backing obfuscate_json and trace_utils.py's TRACE_REDACT credential-scrubbing mechanism) didn't match "token", "authorization", "api_key", "access_key", "bearer", "jwt", or "private_key", so this security control silently failed to redact the most common HTTP auth credential shapes. Extended the regex to cover these cases.
…entry needs human-supplied PR/version
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Security review of dcicutils (4dn-dcic/utils), depended on by snovault, smaht-portal, encoded-core. Task: review auth/credential handling, input handling, secrets/AWS config, dependency risk, and injection issues; fix the top 3 most urgent, concretely-exploitable issues with minimal targeted fixes only. Fixed: (1) Zip Slip/Tar Slip path traversal in dcicutils/zip_utils.py - extractall() had no member-path validation, reachable from structured_data.py metadata-bundle ingestion (submitr/portal uploads); fixed by validating each member's resolved path stays within the target dir before extraction, plus tarfile filter='data' (PEP 706) as defense in depth. (2) Credential leak into exception message text in dcicutils/ff_utils.py UnifiedAuthenticator.AuthenticationError - it interpolated the raw auth key/secret object into the exception message (ends up in logs/Sentry/Foursight output) whenever a caller passed a truthy-but-malformed auth value; fixed by redacting the auth payload in the message while preserving the original on exception.auth. (3) dcicutils/obfuscation_utils.py _SENSITIVE_KEY_NAMES_REGEX (backing obfuscate_json and trace_utils.py TRACE_REDACT credential scrubbing) didn't match token, authorization, api_key, access_key, bearer, jwt, or private_key, silently failing to redact common HTTP auth credential shapes; extended the regex. Deliberate scope decisions: kept fixes minimal, did not fix lower-confidence/legacy issues noted separately for the captain's awareness (EBDeployer public-read S3 upload - likely dead code with no registered entry point; command_utils.py ShellScript shell=True - no confirmed untrusted-input caller in this repo). The captain has already reviewed a prior pipeline run's review-step findings (tar-filter-compat: filter='data' needs Python 3.8.17+/3.9.17+/3.10.12+/3.11.4+ while pyproject.toml allows >=3.8.1,<3.13; auth-attr-not-redacted: AuthenticationError.self.auth still stores the raw unredacted credential even though the message is redacted; extract-file-from-zip-not-hardened: sibling function extract_file_from_zip() in zip_utils.py wasn't hardened with the same zip-slip guard) and made a deliberate decision to ACCEPT all three as documented residual/follow-up risk rather than fix them now, so if review resurfaces the same findings they should be approved as accepted, not fixed. An earlier draft comment in obfuscation_utils.py had a fabricated attribution/signature line as if the user wrote it; user asked it be removed since the agent wrote it - stripped, and the rest of the diff was grepped to confirm no other fabricated attributions. Commit intentionally has no Co-Authored-By trailer per this repo's convention against attributing commits to an agent.
What Changed
dcicutils/zip_utils.py: added a path-traversal guard (_assert_within_directory) that validates each archive member's resolved path stays within the target directory before extraction inunpack_zip_file_to_temporary_directory/unpack_tar_file_to_temporary_directory, and passesfilter="data"totarfile.extractallas defense in depth against Zip Slip/Tar Slip.dcicutils/ff_utils.py:UnifiedAuthenticator.AuthenticationErrornow redacts the raw auth payload when building its exception message (previously the raw key/secret was interpolated into the message, exposing it in logs/Sentry/Foursight output), while still preserving the original value onexception.auth.dcicutils/obfuscation_utils.py: extended_SENSITIVE_KEY_NAMES_REGEXto also matchtoken,authorization,api_key,access_key,bearer,jwt, andprivate_key, soobfuscate_json/TRACE_REDACTnow scrub these common credential key shapes; updated the related docstring.test/test_zip_utils.py,test/test_ff_utils.py, andtest/test_obfuscation_utils.py.Risk Assessment
✅ Low: The three fixes (zip/tar-slip path validation, credential redaction in exception messages, expanded secret-key regex) are minimal, well-targeted, and verified correct against traversal/absolute-path/symlink edge cases and dict/tuple/list auth shapes; the only remaining items are pre-existing, lower-severity residual risks the author already surfaced and the reviewer/captain already accepted as follow-ups rather than blockers.
Testing
test
Pipeline
Updates from git push no-mistakes
✅ **intent** - passed
✅ No issues found.
✅ **Rebase** - passed
✅ No issues found.
dcicutils/zip_utils.py:40- tarfile.extractall(..., filter="data") requires Python 3.8.17+/3.9.17+/3.10.12+/3.11.4+ (PEP 706 security backport); pyproject.toml allows >=3.8.1,<3.13, so on an unpatched patch version within that range this raises TypeError (unexpected keyword argument 'filter') and breaks all tar extraction. Previously identified and accepted as a documented residual risk rather than fixed now.dcicutils/ff_utils.py:1360- AuthenticationError.init redacts the credential in the exception message but still stores the raw, unredacted auth on self.auth, so any caller/log sink that accesses exc.auth directly (rather than str(exc)) still leaks the credential. Previously identified and accepted as a documented residual risk rather than fixed now.dcicutils/zip_utils.py:83- extract_file_from_zip() calls zipf.extract(file_to_extract, path=tmp_directory_name) without the same _assert_within_directory zip-slip guard applied to unpack_zip_file_to_temporary_directory/unpack_tar_file_to_temporary_directory, so this sibling extraction path remains unhardened. Previously identified and accepted as a documented residual risk rather than fixed now.test/test_zip_utils.py- new test file written by agent: test/test_zip_utils.pyaCHANGELOG.rst- CHANGELOG.rst has no entry for this security-fix commit (zip/tar-slip guard, redacted AuthenticationError message, expanded credential-redaction regex), breaking the repo's established per-change convention (author initials / date / branch / PR-<number> plus a version bump, seen in every recent entry e.g. 8.18.4/8.18.3/8.18.1). I did not add one myself because two required fields — the next PR number and the version-bump target (currently 8.18.4) — aren't yet knowable: this branch has no open PR yet (gh shows SN Refactor custom_excel for portal query #331 as the highest existing PR), and inventing either would risk exactly the kind of fabricated-attribution content the author already asked to have stripped from this diff. A maintainer should add the entry once the PR is opened.✅ **Lint** - passed
✅ No issues found.
✅ **Push** - passed
✅ No issues found.