Skip to content

fix: harden zip/tar extraction and redact leaked credentials#332

Open
willronchetti wants to merge 4 commits into
masterfrom
fm/sec-fix-util1
Open

fix: harden zip/tar extraction and redact leaked credentials#332
willronchetti wants to merge 4 commits into
masterfrom
fm/sec-fix-util1

Conversation

@willronchetti

Copy link
Copy Markdown
Member

Intent

Security review of dcicutils (4dn-dcic/utils), depended on by snovault, smaht-portal, encoded-core. Task: review auth/credential handling, input handling, secrets/AWS config, dependency risk, and injection issues; fix the top 3 most urgent, concretely-exploitable issues with minimal targeted fixes only. Fixed: (1) Zip Slip/Tar Slip path traversal in dcicutils/zip_utils.py - extractall() had no member-path validation, reachable from structured_data.py metadata-bundle ingestion (submitr/portal uploads); fixed by validating each member's resolved path stays within the target dir before extraction, plus tarfile filter='data' (PEP 706) as defense in depth. (2) Credential leak into exception message text in dcicutils/ff_utils.py UnifiedAuthenticator.AuthenticationError - it interpolated the raw auth key/secret object into the exception message (ends up in logs/Sentry/Foursight output) whenever a caller passed a truthy-but-malformed auth value; fixed by redacting the auth payload in the message while preserving the original on exception.auth. (3) dcicutils/obfuscation_utils.py _SENSITIVE_KEY_NAMES_REGEX (backing obfuscate_json and trace_utils.py TRACE_REDACT credential scrubbing) didn't match token, authorization, api_key, access_key, bearer, jwt, or private_key, silently failing to redact common HTTP auth credential shapes; extended the regex. Deliberate scope decisions: kept fixes minimal, did not fix lower-confidence/legacy issues noted separately for the captain's awareness (EBDeployer public-read S3 upload - likely dead code with no registered entry point; command_utils.py ShellScript shell=True - no confirmed untrusted-input caller in this repo). The captain has already reviewed a prior pipeline run's review-step findings (tar-filter-compat: filter='data' needs Python 3.8.17+/3.9.17+/3.10.12+/3.11.4+ while pyproject.toml allows >=3.8.1,<3.13; auth-attr-not-redacted: AuthenticationError.self.auth still stores the raw unredacted credential even though the message is redacted; extract-file-from-zip-not-hardened: sibling function extract_file_from_zip() in zip_utils.py wasn't hardened with the same zip-slip guard) and made a deliberate decision to ACCEPT all three as documented residual/follow-up risk rather than fix them now, so if review resurfaces the same findings they should be approved as accepted, not fixed. An earlier draft comment in obfuscation_utils.py had a fabricated attribution/signature line as if the user wrote it; user asked it be removed since the agent wrote it - stripped, and the rest of the diff was grepped to confirm no other fabricated attributions. Commit intentionally has no Co-Authored-By trailer per this repo's convention against attributing commits to an agent.

What Changed

  • dcicutils/zip_utils.py: added a path-traversal guard (_assert_within_directory) that validates each archive member's resolved path stays within the target directory before extraction in unpack_zip_file_to_temporary_directory/unpack_tar_file_to_temporary_directory, and passes filter="data" to tarfile.extractall as defense in depth against Zip Slip/Tar Slip.
  • dcicutils/ff_utils.py: UnifiedAuthenticator.AuthenticationError now redacts the raw auth payload when building its exception message (previously the raw key/secret was interpolated into the message, exposing it in logs/Sentry/Foursight output), while still preserving the original value on exception.auth.
  • dcicutils/obfuscation_utils.py: extended _SENSITIVE_KEY_NAMES_REGEX to also match token, authorization, api_key, access_key, bearer, jwt, and private_key, so obfuscate_json/TRACE_REDACT now scrub these common credential key shapes; updated the related docstring.
  • Added corresponding tests in test/test_zip_utils.py, test/test_ff_utils.py, and test/test_obfuscation_utils.py.

Risk Assessment

✅ Low: The three fixes (zip/tar-slip path validation, credential redaction in exception messages, expanded secret-key regex) are minimal, well-targeted, and verified correct against traversal/absolute-path/symlink edge cases and dict/tuple/list auth shapes; the only remaining items are pre-existing, lower-severity residual risks the author already surfaced and the reviewer/captain already accepted as follow-ups rather than blockers.

Testing

test

  • Outcome: ⚠️ 1 info across 1 run (11m23s)

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

⚠️ **Review** - 3 warnings
  • ⚠️ dcicutils/zip_utils.py:40 - tarfile.extractall(..., filter="data") requires Python 3.8.17+/3.9.17+/3.10.12+/3.11.4+ (PEP 706 security backport); pyproject.toml allows >=3.8.1,<3.13, so on an unpatched patch version within that range this raises TypeError (unexpected keyword argument 'filter') and breaks all tar extraction. Previously identified and accepted as a documented residual risk rather than fixed now.
  • ⚠️ dcicutils/ff_utils.py:1360 - AuthenticationError.init redacts the credential in the exception message but still stores the raw, unredacted auth on self.auth, so any caller/log sink that accesses exc.auth directly (rather than str(exc)) still leaks the credential. Previously identified and accepted as a documented residual risk rather than fixed now.
  • ⚠️ dcicutils/zip_utils.py:83 - extract_file_from_zip() calls zipf.extract(file_to_extract, path=tmp_directory_name) without the same _assert_within_directory zip-slip guard applied to unpack_zip_file_to_temporary_directory/unpack_tar_file_to_temporary_directory, so this sibling extraction path remains unhardened. Previously identified and accepted as a documented residual risk rather than fixed now.
⚠️ **Test** - 1 info
  • ℹ️ test/test_zip_utils.py - new test file written by agent: test/test_zip_utils.py
  • a
⚠️ **Document** - 1 info
  • ℹ️ CHANGELOG.rst - CHANGELOG.rst has no entry for this security-fix commit (zip/tar-slip guard, redacted AuthenticationError message, expanded credential-redaction regex), breaking the repo's established per-change convention (author initials / date / branch / PR-<number> plus a version bump, seen in every recent entry e.g. 8.18.4/8.18.3/8.18.1). I did not add one myself because two required fields — the next PR number and the version-bump target (currently 8.18.4) — aren't yet knowable: this branch has no open PR yet (gh shows SN Refactor custom_excel for portal query #331 as the highest existing PR), and inventing either would risk exactly the kind of fabricated-attribution content the author already asked to have stripped from this diff. A maintainer should add the entry once the PR is opened.
✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

…ncomplete redaction regex

- zip_utils.py: unpack_zip_file_to_temporary_directory/unpack_tar_file_to_temporary_directory
  called extractall() with no member-path validation, allowing a malicious archive entry
  (e.g. "../../../../tmp/evil") to write outside the target temp directory (Zip Slip /
  CVE-2007-4559-class tar slip). This is reachable from structured_data.py's metadata-bundle
  ingestion path (used by submitr and portal metadata uploads). Fixed by validating every
  member's resolved path stays within the target directory before extraction, and by passing
  tarfile's filter="data" (PEP 706) as defense in depth against unsafe member types.

- ff_utils.py: UnifiedAuthenticator.AuthenticationError embedded the raw auth object (the
  actual access-key/secret credential) into its exception message, which routinely ends up
  in logs, error trackers, and Foursight check output whenever a caller passes a truthy but
  malformed auth value (e.g. a dict missing/misnaming the "secret" key). Fixed by redacting
  the auth payload before building the message while preserving the original value on
  exception.auth for legitimate handling.

- obfuscation_utils.py: _SENSITIVE_KEY_NAMES_REGEX (the heuristic backing obfuscate_json and
  trace_utils.py's TRACE_REDACT credential-scrubbing mechanism) didn't match "token",
  "authorization", "api_key", "access_key", "bearer", "jwt", or "private_key", so this
  security control silently failed to redact the most common HTTP auth credential shapes.
  Extended the regex to cover these cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant