fix(pathfinder): make find_nvidia_binary_utility deterministic, never search CWD#2196
fix(pathfinder): make find_nvidia_binary_utility deterministic, never search CWD#2196aryanputta wants to merge 4 commits into
Conversation
… search CWD find_nvidia_binary_utility assembled a bounded list of trusted directories (NVIDIA wheel bin/, CONDA_PREFIX, CUDA_HOME/CUDA_PATH) and then delegated to shutil.which(name, path=trusted_dirs). On Windows shutil.which prepends the process current working directory to the search even when an explicit path= is supplied, so a binary located in an arbitrary (possibly attacker-writable) CWD could be returned in preference to the trusted CUDA / Conda / wheel binary. That violates the pathfinder contract of a deterministic lookup over a documented, bounded set of trusted roots. Replace the shutil.which delegation with an explicit resolver that searches only the trusted directories, in order, returning the first executable match. The current working directory and ambient PATH are never consulted. POSIX execute-bit (X_OK) and Windows extension semantics are preserved, so behavior is unchanged except for removing the CWD/PATH leakage. Names resolved in the existing trusted dirs return exactly as before. Rewrites the search-path tests to assert the deterministic probe order and adds TestResolveInTrustedDirs covering CWD isolation, first-match-wins, empty/duplicate dir skipping, and POSIX non-executable rejection. Fixes NVIDIA#2119
|
Hi @aryanputta, this is great, but in isolation it'll probably break some users. This was pointed out by @kkraus14 here before:
I ran the context and my thoughts by GPT around the same time, which lead to the conclusion at the end of this comment:
I still think that's the right compromise:
The behavior difference is significant enough though that I'd want to bump the pathfinder minor version, which we can do; I don't have concerns about that. How do you feel about expanding this PR to fold in the CTK-root canary fallback? — I believe the code changes will still be quite modest. |
…utility After the deterministic search over the explicit trusted directories (NVIDIA wheel bin/, CONDA_PREFIX, CUDA_HOME/CUDA_PATH) misses, fall back to a CTK-root canary probe: resolve cudart through the OS dynamic loader, which honors LD_LIBRARY_PATH on Linux and the native DLL search on Windows, derive the CUDA Toolkit root from its absolute path, and search that root's bin layout. This addresses the concern raised on NVIDIA#2196: users who follow the CUDA Linux installation guide set LD_LIBRARY_PATH for libraries and PATH for executables. The bounded finder alone would stop finding the utility for them because PATH is intentionally never consulted. The canary fallback recovers that case through LD_LIBRARY_PATH instead of PATH. LD_LIBRARY_PATH is still an attack vector, but a significantly harder one to exploit than PATH, and the ambient PATH and process CWD remain unused. The canary runs only after the explicit trusted dirs miss, so the common wheel/conda/CUDA_HOME cases never spawn the resolver subprocess. The canary -> CTK-root resolution is factored into a shared resolve_ctk_root_via_canary helper reused by the dynamic-library CTK-root canary flow. Adds tests for the fallback (found, ordering, Windows bin layout, not consulted when found earlier, cached) and for resolve_ctk_root_via_canary. Adds 1.6.0 release notes for the minor version bump.
|
Thanks @rwgk, that compromise makes sense and I've folded it in (pushed What the new commit adds Search order is unchanged for the explicit trusted dirs, with the canary as a strict fallback:
Key points:
Version Added Tests Added coverage for the fallback in |
pre-commit.ci mypy flagged returning Any from resolve_ctk_root_via_canary and _resolve_ctk_root_via_canary (both declared -> str | None), because derive_ctk_root resolves to Any under the pathfinder mypy config. Bind the result to an annotated local before returning, matching the pattern used elsewhere in the package.
|
Quick check-in: is there anything else you would like changed here after the CTK-root canary fallback update, or is this mainly waiting on review / runner validation now? Happy to adjust if the version note or fallback behavior should be shaped differently. |
rwgk
left a comment
There was a problem hiding this comment.
Generally, in private functions I'd add docstrings only to explain things that are not already obvious. Simple comments are often most efficient.
| # CUDA Toolkit canary library used to derive the toolkit root when it is only | ||
| # visible through the dynamic loader. ``cudart`` always ships with the CTK and | ||
| # matches the anchor used by the dynamic-library CTK-root canary flow. | ||
| _CTK_ROOT_CANARY_ANCHOR_LIBNAME = "cudart" |
There was a problem hiding this comment.
Headers, dynamic-library descriptors, and now binary discovery all rely on the same "cudart" choice. Could you please centralize the anchor in a neutral shared module, e.g. _utils/ctk_root_canary.py?
I think this will go well with the existing code:
CTK_ROOT_CANARY_ANCHOR_LIBNAMES = ("cudart",)
| On Windows the CTK ships binaries under ``bin/x64`` (CTK 13), ``bin/x86_64``, | ||
| and ``bin`` (CTK 12); on Linux they live in ``bin``. |
There was a problem hiding this comment.
Please remove: it essentially just explains the code immediately below, which is actually easier to read than this prose; and it's a private helper function.
| ``cudart`` is resolved by the OS dynamic loader, which honors | ||
| ``LD_LIBRARY_PATH`` on Linux and the native DLL search on Windows, and the | ||
| toolkit root is derived from its absolute path. The ambient ``PATH`` is | ||
| never consulted. The loader module is imported lazily to avoid pulling the |
There was a problem hiding this comment.
Please remove the first two sentences here: they only explain code elsewhere; for a private helper function that's more noise than signal.
Also throughout: Please avoid all mentions like "unlike shutil.which" and "ambient PATH". That only makes sense in the context of this particular PR, but will be puzzling later. Explaining the which/PATH/CWD pitfalls is suitable in the PR description and release notes, but not in code comments (maybe in general terms, in one or two places).
|
|
||
|
|
||
| def _resolve_in_trusted_dirs(normalized_name: str, dirs: list[str]) -> str | None: | ||
| """Resolve ``normalized_name`` against ``dirs`` only, in order. |
There was a problem hiding this comment.
Resolve ``normalized_name`` against ``dirs`` in order.
("only" will be puzzling later)
| Unlike ``shutil.which``, this never consults the current working directory | ||
| or the ambient ``PATH``. On Windows ``shutil.which`` prepends the process | ||
| CWD to the search even when an explicit ``path=`` is supplied, which lets a | ||
| binary sitting in an arbitrary CWD shadow the trusted CUDA / Conda / wheel | ||
| binary that pathfinder is contracted to discover. Searching the trusted | ||
| directories explicitly keeps the lookup deterministic and bounded. |
There was a problem hiding this comment.
See above: please delete this entire paragraph.
| if not directory or directory in seen: | ||
| continue |
There was a problem hiding this comment.
Skipping over empty strings can easily mask bugs (i.e. string slicing or split bug). None doesn't make sense here, too.
A very easy way to be stricter/safer:
if directory in seen:
continue
assert directory
| and search that root's bin layout. This finds the utility for users | ||
| who follow the CUDA install guide and set ``LD_LIBRARY_PATH`` for | ||
| libraries without also setting ``CUDA_HOME`` / ``CUDA_PATH``. |
There was a problem hiding this comment.
I think it's better to delete the last sentence: LD_LIBRARY_PATH is mentioned in the previous sentence already, and the second part will just make people wonder unnecessarily/distractingly ("why would one do that?").
| CTK root listed above; the process working directory and the ambient | ||
| ``PATH`` are never consulted. |
There was a problem hiding this comment.
See above: delete part after semicolon.
|
|
||
| # 4. CTK-root canary fallback: only when the explicit trusted dirs above | ||
| # miss. Resolve cudart via the dynamic loader (honors LD_LIBRARY_PATH), | ||
| # derive the toolkit root, and search its bin layout. PATH is never used. |
There was a problem hiding this comment.
See above: delete the last line here.
| * :func:`find_nvidia_binary_utility` now resolves binaries through a bounded, | ||
| deterministic search of trusted directories instead of ``shutil.which``. The | ||
| process working directory and the ambient ``PATH`` are never consulted, which | ||
| closes a lookup ambiguity on Windows where ``shutil.which`` prepends the CWD | ||
| even when an explicit search path is supplied. | ||
| (`PR #2196 <https://github.com/NVIDIA/cuda-python/pull/2196>`_) | ||
|
|
||
| * :func:`find_nvidia_binary_utility` gains a CTK-root canary fallback. When the | ||
| NVIDIA wheel, ``CONDA_PREFIX``, and ``CUDA_HOME`` / ``CUDA_PATH`` directories | ||
| all miss, ``cudart`` is resolved through the OS dynamic loader, which honors | ||
| ``LD_LIBRARY_PATH`` on Linux and the native DLL search on Windows. The CUDA | ||
| Toolkit root is derived from that path and its ``bin`` layout is searched. | ||
| This locates the utility for users who follow the CUDA installation guide and | ||
| set ``LD_LIBRARY_PATH`` for libraries without also setting ``CUDA_HOME`` / | ||
| ``CUDA_PATH``, while still never falling back to ``PATH``. | ||
| (`PR #2196 <https://github.com/NVIDIA/cuda-python/pull/2196>`_) |
There was a problem hiding this comment.
Could you please combine this into one bullet point? I think it's best to also delete the "This locates the utility for users who follow..." sentence here.
|
@rwgk I pushed |
|
@rwgk gentle nudge on this one. The CTK-root canary fallback is folded in ( |
Summary
Fixes #2119.
find_nvidia_binary_utilitypreviously delegated toshutil.which(name, path=trusted_dirs). On Windows, CPython'sshutil.whichprepends the process current working directory even when an explicitpath=is supplied. That allowed a binary in an arbitrary CWD to shadow the CUDA / Conda / wheel binary that pathfinder is meant to discover through a bounded search.This PR replaces that delegation with an explicit resolver over trusted directories only, then folds in the maintainer-requested CTK-root canary fallback so users who follow the CUDA installation guide continue to work when the toolkit is visible through the dynamic loader.
Fix
The binary lookup order is now:
bin/directories from site-packages.CONDA_PREFIX.CUDA_HOME/CUDA_PATH.cudartthrough the OS dynamic loader, derive the CUDA Toolkit root, and search that root'sbinlayout.The current working directory and ambient
PATHare never consulted.Notes:
Tests
bin/x64,bin/x86_64, andbinlayout.resolve_ctk_root_via_canary(...).Local checks:
ruff checkandruff format --checkpass on the touched files.