fix: port-forward process cleanup in connect() failure paths#461
fix: port-forward process cleanup in connect() failure paths#461Goku2099 wants to merge 7 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Fixes subprocess lifecycle handling for kubectl port-forward created during Spark Connect setup, aiming to prevent orphaned/zombie processes when connect() fails or retries.
Changes:
- Wraps
connect()port-forward setup/connection logic intry/finallyto ensure cleanup on failure paths. - Adds explicit termination/reaping of prior port-forward processes before restarting in retry scenarios.
- Adds a new
test_pf_leak.pyscript intended to demonstrate subprocess leakage.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
kubeflow/spark/backends/kubernetes/backend.py |
Adds cleanup logic for port-forward subprocesses across failure/retry/timeout paths in connect(). |
test_pf_leak.py |
Introduces a standalone subprocess-leak reproduction script (currently conflicts with pytest collection). |
|
cc @andreyvelich @Shekharrajak @tariq-hasan |
Shekharrajak
left a comment
There was a problem hiding this comment.
We should refine this for production; the current method is quite cluttered and could use some simplification.
Always add some unit tests (even though unit tests not present) show that we can validate the changes and requirements .
Thanks for the review, I will simplify the logic and add a basic unit test to validate the failure scenarios. |
|
@Shekharrajak |
|
@Shekharrajak @tariq-hasan gentle ping on this PR, would appreciate a review when you have time. |
tariq-hasan
left a comment
There was a problem hiding this comment.
@Goku2099 Thanks for raising the PR. I have left a few comments.
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
…ailure path Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
2958fb3 to
253d6f6
Compare
tariq-hasan
left a comment
There was a problem hiding this comment.
@Goku2099 Thanks for addressing the review comments. I have left a few nits and additional comments on the tests.
… fix tests Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Problem
In
kubeflow/spark/backends/kubernetes/backend.pyinsideconnect(), the port-forward process created viasubprocess.Popenis not properly cleaned up in failure scenarios.Specifically:
wait()not called).Fix
pf_procin retry pathswait()is always called to properly reap subprocessestry/finallyblock to guarantee cleanup in failure scenariosImpact
kubectl port-forwardprocessesRelated
Fixes #460