Skip to content

feat(trainer): add TPU utility to compute numNodes for multi-slice TPU#498

Open
richabanker wants to merge 1 commit into
kubeflow:mainfrom
richabanker:tpu-multislice-util
Open

feat(trainer): add TPU utility to compute numNodes for multi-slice TPU#498
richabanker wants to merge 1 commit into
kubeflow:mainfrom
richabanker:tpu-multislice-util

Conversation

@richabanker
Copy link
Copy Markdown

@richabanker richabanker commented May 20, 2026

What this PR does / why we need it:
Introduce the get_tpu_num_nodes utility function in the Trainer backend to calculate VM host/node requirements for multi-slice TPU TrainJobs based on the user-specified slice count and physical TPU topology.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Issue #kubeflow/trainer#3407

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings May 20, 2026 18:58
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a TPU helper utility to compute required host/node counts from slice count and topology, along with parametrized tests to validate expected behavior and error handling.

Changes:

  • Introduced get_tpu_num_nodes() to compute total TPU host nodes from num_slices, topology, and chips_per_host.
  • Added a parametrized pytest suite covering common 2D/3D topologies, multi-slice scaling, and invalid topology inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
kubeflow/trainer/backends/kubernetes/utils.py Adds get_tpu_num_nodes() utility and related validation/error handling.
kubeflow/trainer/backends/kubernetes/utils_test.py Adds parametrized unit tests for get_tpu_num_nodes() success and failure cases.

Comment thread kubeflow/trainer/backends/kubernetes/utils.py
Comment thread kubeflow/trainer/backends/kubernetes/utils.py
Comment thread kubeflow/trainer/backends/kubernetes/utils.py Outdated
Comment thread kubeflow/trainer/backends/kubernetes/utils.py Outdated
Comment thread kubeflow/trainer/backends/kubernetes/utils_test.py Outdated
Comment thread kubeflow/trainer/backends/kubernetes/utils.py
@richabanker richabanker force-pushed the tpu-multislice-util branch from ac9f162 to 21213d5 Compare May 20, 2026 19:02
@richabanker richabanker changed the title feat(api): add TPU utility to compute numNodes for multi-slice TPU feat(trainer): add TPU utility to compute numNodes for multi-slice TPU May 20, 2026
@richabanker richabanker force-pushed the tpu-multislice-util branch 2 times, most recently from f3357e8 to f7b465c Compare May 20, 2026 19:06
@richabanker
Copy link
Copy Markdown
Author

cc @siyuanfoundation @fedebongio

Comment thread kubeflow/trainer/backends/kubernetes/utils.py Outdated
Comment thread kubeflow/trainer/backends/kubernetes/utils.py
@andreyvelich
Copy link
Copy Markdown
Member

/ok-to-test
/retest

Signed-off-by: Richa Banker <richabanker@google.com>
@richabanker richabanker force-pushed the tpu-multislice-util branch from f7b465c to 50c0017 Compare May 20, 2026 22:41
@richabanker
Copy link
Copy Markdown
Author

/ok-to-test

@richabanker
Copy link
Copy Markdown
Author

/retest

@siyuanfoundation
Copy link
Copy Markdown

/lgtm Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants