feat(trainer): add TPU utility to compute numNodes for multi-slice TPU#498
feat(trainer): add TPU utility to compute numNodes for multi-slice TPU#498richabanker wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ae5f84e to
ac9f162
Compare
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a TPU helper utility to compute required host/node counts from slice count and topology, along with parametrized tests to validate expected behavior and error handling.
Changes:
- Introduced
get_tpu_num_nodes()to compute total TPU host nodes fromnum_slices,topology, andchips_per_host. - Added a parametrized pytest suite covering common 2D/3D topologies, multi-slice scaling, and invalid topology inputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| kubeflow/trainer/backends/kubernetes/utils.py | Adds get_tpu_num_nodes() utility and related validation/error handling. |
| kubeflow/trainer/backends/kubernetes/utils_test.py | Adds parametrized unit tests for get_tpu_num_nodes() success and failure cases. |
ac9f162 to
21213d5
Compare
f3357e8 to
f7b465c
Compare
|
/ok-to-test |
Signed-off-by: Richa Banker <richabanker@google.com>
f7b465c to
50c0017
Compare
|
/ok-to-test |
|
/retest |
|
/lgtm Thank you! |
What this PR does / why we need it:
Introduce the
get_tpu_num_nodesutility function in the Trainer backend to calculate VM host/node requirements for multi-slice TPU TrainJobs based on the user-specified slice count and physical TPU topology.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Issue #kubeflow/trainer#3407
Checklist: