Skip to content

Conversation

@Aurashk
Copy link
Contributor

@Aurashk Aurashk commented Dec 5, 2025

Description

Fixes #658

Modifies both flavours of SSH process manager (shell and paramiko) to query/kill remote processes directly rather than through the local ssh client. This is achieved by running headless remote processes and storing the pid of the remote process in metadata. Then the pid can be used to send signals through ssh directly to the remote process.

This has some desired effects:

  • We get informative exit codes from the remote processes rather than the ssh client processes
  • More control of cleanup through sending signals directly, otherwise all we can do is kill the client which sends a SIGHUP to the remote process

Less desirable effects:

  • Some commands in drunc-unified-shell are slower e.g. on my laptop ps now takes a couple of seconds rather than instant. Killing each process is around a second. I would expect some degree of slowdown as we went from completely local process communication to having to talk remotely through ssh to each process. The implementation can probably be optimised better if this is a concern.

Note: There is an adjacent comment in the issue about fixing the terminate order to match K8s. I think that would be straightforward to add to this PR but I will wait for feedback on the approach first

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future.
    (Indicate issue here: # (issue))

@Aurashk Aurashk marked this pull request as ready for review December 8, 2025 10:10
Copy link
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you can still see our meeting chat any more, but I think I have signal forwarding working without any monitor threads or other hacks.

The secret is to run the ssh client with the -t option, which forces remote to create a PTY, and then run the command inside of that, instead of without, where it runs the command directly. Advantage of this way is signals like SIGTERM et al are actually forwarded to remote command properly, thanks to the PTY's built-in signal handling.

So all it means in practice is using ssh -t ... instead of ssh ..., and it should work without the hacks.

@jamesturner246
Copy link
Contributor

One caveat though, SIGTERMing the local ssh means that in fact a SIGHUP is actually what appears on the remote command side. Butt from this use case I don't think it matters which of SIGTERM or SIGHUP reaches remote command, just the fact that a TERM-ish signal is reaching remote reliably.

@jamesturner246
Copy link
Contributor

jamesturner246 commented Dec 8, 2025

One can in fact be EXTRA safe (probably would recommend) by sending the ^C byte through explicitly, and letting the remote program deal with shutting itself down. The ssh return code would be the return code of the remote command. It would fall back to getting SIGHUP if that fails though.

@Aurashk
Copy link
Contributor Author

Aurashk commented Jan 16, 2026

One caveat though, SIGTERMing the local ssh means that in fact a SIGHUP is actually what appears on the remote command side. Butt from this use case I don't think it matters which of SIGTERM or SIGHUP reaches remote command, just the fact that a TERM-ish signal is reaching remote reliably.

From my understanding of #658 and #649 this is exactly the problem we want to solve. We want to explicitly send signals to the remote process, but the current implementation send signals to the local process. If a remote process gets a SIGHUP it leads to ambiguity, as it just means the connection/terminal was closed from the client side. The current intended behaviour attempts to shut everything down cleanly with a SIQUIT (but this doesn't work because it SIGQUITS the client instead)

This is all my interpretation of what the code should be doing from issues and looking at the current implementation though, I'd say ideally we should have these nuances documented somewhere it's clear what boot and kill should do. Maybe add a few lines in process_manager.md

@Aurashk
Copy link
Contributor Author

Aurashk commented Jan 16, 2026

Not sure if you can still see our meeting chat any more, but I think I have signal forwarding working without any monitor threads or other hacks.

The secret is to run the ssh client with the -t option, which forces remote to create a PTY, and then run the command inside of that, instead of without, where it runs the command directly. Advantage of this way is signals like SIGTERM et al are actually forwarded to remote command properly, thanks to the PTY's built-in signal handling.

So all it means in practice is using ssh -t ... instead of ssh ..., and it should work without the hacks.

Monitor threads are used for a few things in the shell ssh process manager which we should probably document better (reading logs from remote processes asynchronously, checking the remote process is alive, running a callback function when the remote process exits), I don't think it's that straightforward to remove them.

We are already using -t in all ssh commands. I actually think we might be overusing it and should use -T for some of the ssh calls for better efficiency. I'm not too sure why -t is already used everywhere - it may be that some of the processes require a full interactive shell so the process manager wouldn't work robustly without it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Terminate incorret implementation in SSH PM

3 participants