Skip to content

fix(trainer): changed ValueError to raise RuntimeError in localprocess and container in backends #433#469

Open
aashvgit wants to merge 1 commit into
kubeflow:mainfrom
aashvgit:fix/433-valueerror-to-runtimeerror
Open

fix(trainer): changed ValueError to raise RuntimeError in localprocess and container in backends #433#469
aashvgit wants to merge 1 commit into
kubeflow:mainfrom
aashvgit:fix/433-valueerror-to-runtimeerror

Conversation

@aashvgit
Copy link
Copy Markdown

What this PR does / why we need it:

  • get_job, get_job_logs, wait_for_job_status, delete_job in LocalProcess
    backend now raise RuntimeError for job-not-found
  • _get_job_containers and __get_trainjob_from_containers in Container
    backend now raise RuntimeError for job/network/runtime not found
  • Keep ValueError for input validation (polling_interval > timeout)

Fixes #
#433

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aashvgit aashvgit marked this pull request as ready for review April 21, 2026 18:18
Copilot AI review requested due to automatic review settings April 21, 2026 18:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns LocalProcess and Container backends with the documented TrainerClient error contract by raising RuntimeError (instead of ValueError) for operational “not found” conditions such as missing jobs, containers, networks, or runtimes.

Changes:

  • LocalProcess backend: switch job-not-found and runtime-not-found errors to RuntimeError while keeping ValueError for input validation (e.g., polling_interval > timeout).
  • Container backend: switch job/container/network/runtime lookup failures to RuntimeError and update related docstrings.
  • Update unit tests to assert RuntimeError for these not-found scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
kubeflow/trainer/backends/localprocess/backend.py Raises RuntimeError for runtime/job not found to match the public API contract.
kubeflow/trainer/backends/localprocess/backend_test.py Updates expectations to RuntimeError for missing runtime/job cases.
kubeflow/trainer/backends/container/backend.py Raises RuntimeError for missing containers/network/runtime during job reconstruction and lookup.
kubeflow/trainer/backends/container/backend_test.py Updates expectations to RuntimeError for missing job cases.

Copy link
Copy Markdown
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issue found — see inline comment.

Copy link
Copy Markdown
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aashvgit!

local_runtime = next((rt for rt in local_runtimes if rt.name == runtime.name), None)
if not local_runtime:
raise ValueError(f"Runtime '{runtime.name}' not found.")
raise RuntimeError(f"Runtime '{runtime.name}' not found.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError(f"Runtime '{runtime.name}' not found.")
raise RuntimeError(f"Runtime '{runtime.name}' not found.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aashvgit
Copy link
Copy Markdown
Author

Thanks @kramaranya for the review I've fixed the line now

@kramaranya
Copy link
Copy Markdown
Contributor

Could you rebase the changes and sign your commits please?

@aashvgit aashvgit force-pushed the fix/433-valueerror-to-runtimeerror branch from fb06144 to 811cfb9 Compare April 30, 2026 14:40
@aashvgit aashvgit requested review from Copilot and kramaranya May 1, 2026 18:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment on lines 457 to 461
config={"name": BASIC_TRAIN_JOB_NAME, "timeout": 10, "polling_interval": -1},
expected_error=ValueError,
config={"job_name": "nonexistent-job"},
expected_error=RuntimeError,
),
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TestCase currently passes config and expected_error twice, which is a Python syntax error and will prevent the test module from importing; split the job-not-found assertion into its own TestCase entry (using config={"name": "nonexistent-job"} and expected_error=RuntimeError, since wait_for_job_status(**config) expects name).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aashvgit could you address it please?

@aashvgit aashvgit force-pushed the fix/433-valueerror-to-runtimeerror branch from 811cfb9 to fb1c808 Compare May 5, 2026 09:52
@aashvgit aashvgit requested a review from Copilot May 5, 2026 09:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines 509 to +510
if not containers:
raise ValueError(f"No TrainJob with name {name}")
raise RuntimeError(f"No TrainJob with name {name}")
Comment on lines 678 to +684
network_id = containers[0]["labels"].get(f"{self.label_prefix}/network-id")
if not network_id:
raise ValueError(f"TrainJob {job_name} is missing network metadata")
raise RuntimeError(f"TrainJob {job_name} is missing network metadata")

network_info = self._adapter.get_network(network_id)
if not network_info:
raise ValueError(f"TrainJob {job_name} network not found")
raise RuntimeError(f"TrainJob {job_name} network not found")
Comment on lines 690 to +696
try:
job_runtime = self.get_runtime(runtime_name) if runtime_name else None
except Exception as e:
raise ValueError(f"Runtime {runtime_name} not found for job {job_name}") from e
raise RuntimeError(f"Runtime {runtime_name} not found for job {job_name}") from e

if not job_runtime:
raise ValueError(f"Runtime {runtime_name} not found for job {job_name}")
raise RuntimeError(f"Runtime {runtime_name} not found for job {job_name}")
@aashvgit aashvgit force-pushed the fix/433-valueerror-to-runtimeerror branch from fb1c808 to 41eba7d Compare May 5, 2026 10:01
…s and container backends

Signed-off-by: aashvgit <167199295+aashvgit@users.noreply.github.com>
@aashvgit aashvgit force-pushed the fix/433-valueerror-to-runtimeerror branch from 41eba7d to 912039f Compare May 8, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants