remote: make remote_setup reliably provision the GPU node#11
Open
ykhrustalev wants to merge 7 commits into
Open
remote: make remote_setup reliably provision the GPU node#11ykhrustalev wants to merge 7 commits into
ykhrustalev wants to merge 7 commits into
Conversation
ssh_run forces `bash -lc` so venv activate and other bash-isms work, but that means bash never sees PATH additions written by tool installers into fish-only config (e.g. uv's installer drops ~/.config/fish/conf.d/ uv.env.fish when $SHELL is fish, and never touches ~/.bashrc). Result: bootstrap's `command -v uv` check fails on fish-shell hosts and we silently fall back to pip — or blow up when pip is also missing. Prepend $HOME/.local/bin:$HOME/.cargo/bin to PATH in the bash -lc wrapper, plus a defense-in-depth fallback in detect_environment that tests the uv binary directly.
The previous commit prepended ~/.local/bin and ~/.cargo/bin to PATH in ssh_run, which misses /snap/bin (snap installs) and any other layout we didn't think of. Replace that with an explicit search: try `command -v uv` first, then probe a list of canonical install dirs (~/.local/bin, ~/.cargo/bin, /snap/bin, /usr/local/bin, /opt/homebrew/bin, /home/linuxbrew/.linuxbrew/bin, /usr/bin) in a single round-trip. detect_environment now returns the absolute path to uv (or None) instead of a bool, and bootstrap_remote invokes uv via that path so it works regardless of what's on bash's PATH. Reverts the ssh_run PATH-mangling since the explicit lookup makes it unnecessary.
Before this change, ssh_run wraps every remote command in `bash -lc`, so bash dotfiles are the only source of truth for PATH. The Astral uv installer (and most Rust/snap/pipx-style installers) writes its PATH config into the *login* shell's rc — `~/.config/fish/conf.d/...` for fish, `~/.zprofile` for zsh — leaving bash blind to it. Result: detection silently misses tools that are installed and on PATH for every interactive command the user actually runs. Read $SHELL once per bootstrap, then resolve each tool by running `<login-shell> -lc 'command -v <tool>'`. The -lc invocation form is accepted by bash, zsh, fish, and dash, so it covers everyone short of csh / nushell. Skip the lookup entirely for /bin/false and /sbin/nologin service-account shells. Three-layer resolution: login shell first, then bash PATH, then a direct test -x walk over canonical install dirs (snap, Homebrew, ~/.local/bin, …) so even tools no shell rc has been taught about still get found. Both `python3` and `uv` use the same path; the absolute path is threaded through bootstrap so venv creation and `uv pip install` work regardless of what's on bash's PATH.
Three follow-ups from review of 3ddaeff: (#2) `_which_via_login_shell` previously accepted any line that started with "/". A MOTD ending in e.g. "/etc/motd updated 2024-01-01" would be returned as the tool path, then bootstrap would try to run "/etc/motd updated 2024-01-01 venv …" and fail confusingly. `command -v` always emits a single whitespace-free token, so reject lines containing a space. (#5) Add a regression test that the absolute python3 path returned by detect_environment is the one threaded into `<python3> -m venv` and likewise for `<uv> venv` / `<uv> pip install`. Fixes the existing TestBootstrapRemote mocks while doing so — they made `test -x .lqh-env/bin/python` return rc=0, which made venv creation get skipped entirely and the assertions vacuously fail. (#4) Rewrite test_no_login_shell_skips_first_probe's docstring and assertions: the test passed login_shell=None, so checking for "/bin/false" in the calls was always trivially true. Replace with an ordering assertion (first call must be the bash probe, not a nested `<shell> -lc`).
The outer ssh_run wrapper was invoked as `bash -lc ...` from the user's login shell. If that shell is fish/zsh and the user has a function or alias named `bash` in their config, it would silently shadow the real binary. Routing through `/usr/bin/env` skips shell-level name resolution (env is exec'd as an external binary and looks up bash on PATH directly). Also makes the wrapper more robust on non-FHS systems (NixOS lacks /bin/bash but provides /usr/bin/env by systemd-tmpfiles convention) and on hardened images where the login shell may have a minimal PATH but still includes /usr/bin. Cost: one extra exec (~µs) and a new assumption that /usr/bin/env exists — true on every Linux/macOS/WSL system coreutils ships on.
Reverting most of 3ddaeff and the parts of 71a454e that depended on it. The shell-aware lookup (_detect_login_shell + _which_via_login_shell + _locate_tool) covered an edge case — "uv installed in a non-standard location that happens to be on fish/zsh's PATH" — that nobody on a real GPU box actually hits. The canonical-dirs probe in _locate_uv already covers Astral's installer, cargo, snap, /usr/local, Homebrew, and /usr/bin, which between them are how anyone actually installs uv. python3 absolute-path threading was likewise wasted effort: /usr/bin/python3 is on bash's PATH on every GPU host worth supporting, and bare `python3 -m venv` has never had a problem. Drops ~150 lines of code and the entire MOTD-output parser, plus all the tests that only existed because of the indirection. Keeps: - _locate_uv with canonical-dirs probe (the actual fix from f32b5af). - The mock fix that lets TestBootstrapRemote actually run the venv creation branch. - The "absolute uv path is the one threaded into `uv venv` and `uv pip install`" assertion. - /usr/bin/env bash wrapping (20bdd14, separate concern).
remote_setup only found the local lqh source when it was installed as an editable checkout — _find_lqh_package_root required a pyproject.toml sitting next to lqh/. For a normal `pip install` (lqh in site-packages) it returned None and bootstrap fell back to `pip install lqh[train]` from PyPI. `lqh` on PyPI is an unrelated package, so the remote ended up with the wrong code and `python -m lqh.train` failed with `ModuleNotFoundError: No module named 'lqh.train'`. The importable lqh/ directory is always available — we're running it. bootstrap_remote now always rsyncs that directory; for dependency metadata it uses the checkout's pyproject.toml when one exists, else synthesizes a minimal one from importlib.metadata. The PyPI fallback is removed entirely: the remote always gets exactly the lqh running locally. - _local_lqh_root: parent of lqh/__init__.py; always resolvable. - _rsync_lqh_source: extracted from bootstrap_remote, unit-testable. - _synthesize_pyproject: minimal hatchling pyproject.toml built from the installed distribution's version + requires (base + train extra). - compute_local_lqh_hash works for non-checkout installs too, so remote_status drift detection no longer reports "pypi".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
lqh remote_setupfailed to provision a GPU node end-to-end in two independent ways. Both produced confusing downstream failures rather than a clear error.1.
uvnot found when the login shell isn't bashssh_helpers.ssh_runwraps every remote command inbash -lc— deliberately, because venvactivate,&&-chaining,source,export, and the#!/bin/bashlauncher all need bash. But that makes bash's dotfiles the only source of truth for PATH. Astral'suvinstaller keys off$SHELL: for a fish user it writes~/.config/fish/conf.d/uv.env.fishand never touches~/.bashrc. Sobash -lc 'command -v uv'found nothing even thoughuvwas installed — and bootstrap silently downgraded topip(or failed if pip was missing too).2. Wrong
lqhinstalled on the remote →ModuleNotFoundError: No module named 'lqh.train'_find_lqh_package_root()only located the local source whenlqhwas an editable checkout (it required apyproject.tomlnext tolqh/). For a normalpip install(lqh insite-packages) it returnedNone, and bootstrap fell back topip install lqh[train]from PyPI.lqhon PyPI is an unrelated package — so the remote got the wrong code andpython -m lqh.trainblew up.Fix
Find
uvwherever it actually is. New_locate_uvtriescommand -v uv, then probes the canonical install dirs directly (~/.local/bin,~/.cargo/bin,/snap/bin,/usr/local/bin,/opt/homebrew/bin,/home/linuxbrew/.linuxbrew/bin,/usr/bin). Returns the absolute path, which is threaded into<uv> venvand<uv> pip installso bash never has to resolveuvon PATH.Always install the lqh that's running locally. The importable
lqh/directory is always available (Path(lqh.__file__).parent).bootstrap_remotenow always rsyncs it; for the[train]dependency metadata it uses the checkout'spyproject.tomlwhen present, else synthesizes a minimal one fromimportlib.metadata. The PyPI fallback is removed entirely — the remote always gets exactly the local code.compute_local_lqh_hashworks for non-checkout installs too, soremote_statusdrift detection no longer reports "pypi".Harden the bash wrapper.
ssh_runnow invokes/usr/bin/env bash -lc …instead ofbash -lc …— skips shell-level name resolution so a function/alias namedbashcan't shadow the binary, and works on non-FHS layouts (NixOS).Deliberately not done
<shell> -lc 'command -v uv'probing. Reviewed and dropped — the canonical-dirs probe already covers every install method in real use, and the shell-aware layer added a fragile MOTD parser and extra round-trips for an edge case nobody hits./usr/bin/python3is on bash's PATH on every Linux/macOS GPU host; barepython3 -m venvworks.Test plan
TestLocateUv— PATH hit,/snap/binfallback, probe covers every canonical location, not-found.TestLocalLqhRoot/TestSynthesizePyproject— the importable-root resolver and the syntheticpyproject.toml(built structure, base + train deps).TestBootstrapRemote— uv path threaded into<uv> venv/<uv> pip install;python3 -m venv+ plain pip when uv is absent; synthesizespyproject.tomland installs from the synced tree (never PyPI) when there's no checkout.TestSSHRun::test_invokes_bash_via_env— pins the/usr/bin/env bash -lc …form.Full suite: 276 passed, 0 failed, 26 skipped.
Session