Skip to content

remote: make remote_setup reliably provision the GPU node#11

Open
ykhrustalev wants to merge 7 commits into
mainfrom
claude/fix-gpu-shell-compatibility-FowxA
Open

remote: make remote_setup reliably provision the GPU node#11
ykhrustalev wants to merge 7 commits into
mainfrom
claude/fix-gpu-shell-compatibility-FowxA

Conversation

@ykhrustalev

@ykhrustalev ykhrustalev commented May 19, 2026

Copy link
Copy Markdown
Member

Problem

lqh remote_setup failed to provision a GPU node end-to-end in two independent ways. Both produced confusing downstream failures rather than a clear error.

1. uv not found when the login shell isn't bash

ssh_helpers.ssh_run wraps every remote command in bash -lc — deliberately, because venv activate, &&-chaining, source, export, and the #!/bin/bash launcher all need bash. But that makes bash's dotfiles the only source of truth for PATH. Astral's uv installer keys off $SHELL: for a fish user it writes ~/.config/fish/conf.d/uv.env.fish and never touches ~/.bashrc. So bash -lc 'command -v uv' found nothing even though uv was installed — and bootstrap silently downgraded to pip (or failed if pip was missing too).

2. Wrong lqh installed on the remote → ModuleNotFoundError: No module named 'lqh.train'

_find_lqh_package_root() only located the local source when lqh was an editable checkout (it required a pyproject.toml next to lqh/). For a normal pip install (lqh in site-packages) it returned None, and bootstrap fell back to pip install lqh[train] from PyPI. lqh on PyPI is an unrelated package — so the remote got the wrong code and python -m lqh.train blew up.

Fix

Find uv wherever it actually is. New _locate_uv tries command -v uv, then probes the canonical install dirs directly (~/.local/bin, ~/.cargo/bin, /snap/bin, /usr/local/bin, /opt/homebrew/bin, /home/linuxbrew/.linuxbrew/bin, /usr/bin). Returns the absolute path, which is threaded into <uv> venv and <uv> pip install so bash never has to resolve uv on PATH.

Always install the lqh that's running locally. The importable lqh/ directory is always available (Path(lqh.__file__).parent). bootstrap_remote now always rsyncs it; for the [train] dependency metadata it uses the checkout's pyproject.toml when present, else synthesizes a minimal one from importlib.metadata. The PyPI fallback is removed entirely — the remote always gets exactly the local code. compute_local_lqh_hash works for non-checkout installs too, so remote_status drift detection no longer reports "pypi".

Harden the bash wrapper. ssh_run now invokes /usr/bin/env bash -lc … instead of bash -lc … — skips shell-level name resolution so a function/alias named bash can't shadow the binary, and works on non-FHS layouts (NixOS).

Deliberately not done

  • No login-shell detection. An earlier iteration of this branch added <shell> -lc 'command -v uv' probing. Reviewed and dropped — the canonical-dirs probe already covers every install method in real use, and the shell-aware layer added a fragile MOTD parser and extra round-trips for an edge case nobody hits.
  • No python3 path resolution. /usr/bin/python3 is on bash's PATH on every Linux/macOS GPU host; bare python3 -m venv works.

Test plan

  • TestLocateUv — PATH hit, /snap/bin fallback, probe covers every canonical location, not-found.
  • TestLocalLqhRoot / TestSynthesizePyproject — the importable-root resolver and the synthetic pyproject.toml (built structure, base + train deps).
  • TestBootstrapRemote — uv path threaded into <uv> venv / <uv> pip install; python3 -m venv + plain pip when uv is absent; synthesizes pyproject.toml and installs from the synced tree (never PyPI) when there's no checkout.
  • TestSSHRun::test_invokes_bash_via_env — pins the /usr/bin/env bash -lc … form.

Full suite: 276 passed, 0 failed, 26 skipped.

Session

claude added 7 commits May 19, 2026 19:08
ssh_run forces `bash -lc` so venv activate and other bash-isms work,
but that means bash never sees PATH additions written by tool installers
into fish-only config (e.g. uv's installer drops ~/.config/fish/conf.d/
uv.env.fish when $SHELL is fish, and never touches ~/.bashrc). Result:
bootstrap's `command -v uv` check fails on fish-shell hosts and we
silently fall back to pip — or blow up when pip is also missing.

Prepend $HOME/.local/bin:$HOME/.cargo/bin to PATH in the bash -lc
wrapper, plus a defense-in-depth fallback in detect_environment that
tests the uv binary directly.
The previous commit prepended ~/.local/bin and ~/.cargo/bin to PATH in
ssh_run, which misses /snap/bin (snap installs) and any other layout we
didn't think of. Replace that with an explicit search: try `command -v
uv` first, then probe a list of canonical install dirs (~/.local/bin,
~/.cargo/bin, /snap/bin, /usr/local/bin, /opt/homebrew/bin,
/home/linuxbrew/.linuxbrew/bin, /usr/bin) in a single round-trip.

detect_environment now returns the absolute path to uv (or None) instead
of a bool, and bootstrap_remote invokes uv via that path so it works
regardless of what's on bash's PATH. Reverts the ssh_run PATH-mangling
since the explicit lookup makes it unnecessary.
Before this change, ssh_run wraps every remote command in `bash -lc`,
so bash dotfiles are the only source of truth for PATH. The Astral
uv installer (and most Rust/snap/pipx-style installers) writes its
PATH config into the *login* shell's rc — `~/.config/fish/conf.d/...`
for fish, `~/.zprofile` for zsh — leaving bash blind to it. Result:
detection silently misses tools that are installed and on PATH for
every interactive command the user actually runs.

Read $SHELL once per bootstrap, then resolve each tool by running
`<login-shell> -lc 'command -v <tool>'`. The -lc invocation form is
accepted by bash, zsh, fish, and dash, so it covers everyone short
of csh / nushell. Skip the lookup entirely for /bin/false and
/sbin/nologin service-account shells.

Three-layer resolution: login shell first, then bash PATH, then a
direct test -x walk over canonical install dirs (snap, Homebrew,
~/.local/bin, …) so even tools no shell rc has been taught about
still get found. Both `python3` and `uv` use the same path; the
absolute path is threaded through bootstrap so venv creation and
`uv pip install` work regardless of what's on bash's PATH.
Three follow-ups from review of 3ddaeff:

(#2) `_which_via_login_shell` previously accepted any line that
started with "/". A MOTD ending in e.g. "/etc/motd updated 2024-01-01"
would be returned as the tool path, then bootstrap would try to run
"/etc/motd updated 2024-01-01 venv …" and fail confusingly. `command -v`
always emits a single whitespace-free token, so reject lines containing
a space.

(#5) Add a regression test that the absolute python3 path returned by
detect_environment is the one threaded into `<python3> -m venv` and
likewise for `<uv> venv` / `<uv> pip install`. Fixes the existing
TestBootstrapRemote mocks while doing so — they made `test -x
.lqh-env/bin/python` return rc=0, which made venv creation get skipped
entirely and the assertions vacuously fail.

(#4) Rewrite test_no_login_shell_skips_first_probe's docstring and
assertions: the test passed login_shell=None, so checking for
"/bin/false" in the calls was always trivially true. Replace with an
ordering assertion (first call must be the bash probe, not a nested
`<shell> -lc`).
The outer ssh_run wrapper was invoked as `bash -lc ...` from the
user's login shell. If that shell is fish/zsh and the user has a
function or alias named `bash` in their config, it would silently
shadow the real binary. Routing through `/usr/bin/env` skips
shell-level name resolution (env is exec'd as an external binary
and looks up bash on PATH directly).

Also makes the wrapper more robust on non-FHS systems (NixOS lacks
/bin/bash but provides /usr/bin/env by systemd-tmpfiles convention)
and on hardened images where the login shell may have a minimal
PATH but still includes /usr/bin.

Cost: one extra exec (~µs) and a new assumption that /usr/bin/env
exists — true on every Linux/macOS/WSL system coreutils ships on.
Reverting most of 3ddaeff and the parts of 71a454e that depended on
it. The shell-aware lookup (_detect_login_shell + _which_via_login_shell
+ _locate_tool) covered an edge case — "uv installed in a non-standard
location that happens to be on fish/zsh's PATH" — that nobody on a real
GPU box actually hits. The canonical-dirs probe in _locate_uv already
covers Astral's installer, cargo, snap, /usr/local, Homebrew, and
/usr/bin, which between them are how anyone actually installs uv.

python3 absolute-path threading was likewise wasted effort: /usr/bin/python3
is on bash's PATH on every GPU host worth supporting, and bare
`python3 -m venv` has never had a problem.

Drops ~150 lines of code and the entire MOTD-output parser, plus
all the tests that only existed because of the indirection. Keeps:

- _locate_uv with canonical-dirs probe (the actual fix from f32b5af).
- The mock fix that lets TestBootstrapRemote actually run the venv
  creation branch.
- The "absolute uv path is the one threaded into `uv venv` and
  `uv pip install`" assertion.
- /usr/bin/env bash wrapping (20bdd14, separate concern).
remote_setup only found the local lqh source when it was installed as
an editable checkout — _find_lqh_package_root required a pyproject.toml
sitting next to lqh/. For a normal `pip install` (lqh in site-packages)
it returned None and bootstrap fell back to `pip install lqh[train]`
from PyPI. `lqh` on PyPI is an unrelated package, so the remote ended
up with the wrong code and `python -m lqh.train` failed with
`ModuleNotFoundError: No module named 'lqh.train'`.

The importable lqh/ directory is always available — we're running it.
bootstrap_remote now always rsyncs that directory; for dependency
metadata it uses the checkout's pyproject.toml when one exists, else
synthesizes a minimal one from importlib.metadata. The PyPI fallback
is removed entirely: the remote always gets exactly the lqh running
locally.

- _local_lqh_root: parent of lqh/__init__.py; always resolvable.
- _rsync_lqh_source: extracted from bootstrap_remote, unit-testable.
- _synthesize_pyproject: minimal hatchling pyproject.toml built from
  the installed distribution's version + requires (base + train extra).
- compute_local_lqh_hash works for non-checkout installs too, so
  remote_status drift detection no longer reports "pypi".
@ykhrustalev ykhrustalev changed the title remote: find uv reliably when the user's login shell isn't bash remote: make remote_setup reliably provision the GPU node May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants