Skip to content

feat(api): add TPU utility to compute numNodes for multi-slice TPU#3532

Closed
richabanker wants to merge 1 commit into
kubeflow:masterfrom
richabanker:tpu-multislice-util
Closed

feat(api): add TPU utility to compute numNodes for multi-slice TPU#3532
richabanker wants to merge 1 commit into
kubeflow:masterfrom
richabanker:tpu-multislice-util

Conversation

@richabanker
Copy link
Copy Markdown

What this PR does / why we need it:
Introduce a TPU utility function get_num_nodes in the Python API package to calculate the total number of VM hosts (numNodes) for multi-slice TPU configurations

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Issue ##3407

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Richa Banker <richabanker@google.com>
Copilot AI review requested due to automatic review settings May 20, 2026 02:40
@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@richabanker
Copy link
Copy Markdown
Author

cc @siyuanfoundation

@google-oss-prow google-oss-prow Bot requested a review from jinchihe May 20, 2026 02:41
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a small TPU helper to the Python API package to compute the total host count (numNodes) for multi-slice TPU topologies, intended to simplify configuring multi-slice TPU TrainJobs (Issue #3407).

Changes:

  • Added get_num_nodes(num_slices, topology, chips_per_host=4) to compute total VM hosts across slices.
  • Added unit tests covering common 2D/3D TPU topologies and invalid inputs.
  • Re-exported get_num_nodes from kubeflow_trainer_api package __init__.py.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
api/python_api/kubeflow_trainer_api/tpu.py Introduces TPU topology parsing + host count computation utility.
api/python_api/kubeflow_trainer_api/tpu_test.py Adds unit tests for the new TPU utility function.
api/python_api/kubeflow_trainer_api/__init__.py Exposes get_num_nodes as part of the package public surface.

Comment on lines +35 to +39
if not topology:
raise ValueError("TPU topology must be specified.")

# Parse the topology dimensions (e.g. "2x2" or "2x2x2")
try:
Comment on lines +40 to +44
dims = [int(d) for d in topology.lower().split("x")]
except ValueError:
raise ValueError(
f"Invalid topology format: '{topology}'. Must be formatted as 'AxB' or 'AxBxC' (e.g. '2x2', '2x2x2')."
)
Comment on lines +15 to +18
import unittest
from kubeflow_trainer_api.tpu import get_num_nodes

class TestTPUUtils(unittest.TestCase):
# See the License for the specific language governing permissions and
# limitations under the License.

def get_num_nodes(num_slices: int, topology: str, chips_per_host: int = 4) -> int:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @richabanker!
I think, we should contribute this utility function to Kubeflow SDK: https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/backends/kubernetes/utils.py
Since kubeflow_trainer_api only exposes Python models for Trainer CRDs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, opened kubeflow/sdk#498 against the SDK repo.

Will close this one out. Thanks!

@siyuanfoundation
Copy link
Copy Markdown

I agree with Andrey, this function should be in Kubeflow SDK.
Thank you for working on this Richa!

@richabanker
Copy link
Copy Markdown
Author

/close

in favor of kubeflow/sdk#498

@google-oss-prow google-oss-prow Bot closed this May 20, 2026
@google-oss-prow
Copy link
Copy Markdown

@richabanker: Closed this PR.

Details

In response to this:

/close

in favor of kubeflow/sdk#498

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants