Skip to content

fix(operator): support multi-replica endpoint generation in IdentifyP…#3539

Open
fedebongio wants to merge 2 commits into
kubeflow:masterfrom
fedebongio:fix-multi-slice-endpoints
Open

fix(operator): support multi-replica endpoint generation in IdentifyP…#3539
fedebongio wants to merge 2 commits into
kubeflow:masterfrom
fedebongio:fix-multi-slice-endpoints

Conversation

@fedebongio
Copy link
Copy Markdown

What this PR does / why we need it:
This PR resolves a limitation in the IdentifyPodNetwork function within pkg/runtime/framework/plugins/jobset/jobset.go. Previously, the operator hardcoded the number of replicated job replicas to 1 (rJobReplicas := constants.DefaultJobReplicas) and only generated network endpoints for replica index 0.
With the introduction of multi-slice TPU support (where replicas > 1 represents the slice count, e.g., in JobSet), this 1-replica assumption causes endpoint generation to fail for all slices other than the first one.
This PR updates IdentifyPodNetwork to:

  1. Retrieve the actual replica count from the ReplicatedJob configuration template.
  2. Iterate over all replicas (slices) and pods to generate unique, correct network endpoints (e.g., ...-node-0-0... to ...-node-3-7...).
    This is an essential follow-up fix to ensure complete multi-slice TPU support in the operator.
    Which issue(s) this PR fixes:
    Fixes Support multi-slice TPU in trainer #3407 (Companion/follow-up to feat(operator): support multi-slice TPU by enabling trainer replicas > 1 #3408)
    Checklist:
  • Docs included if any changes are user facing
    (Note: This is an internal operator networking fix, so no user-facing documentation changes are required. Checked for checklist completeness).

Copilot AI review requested due to automatic review settings May 22, 2026 23:06
@google-oss-prow google-oss-prow Bot requested review from jinchihe and kuizhiqing May 22, 2026 23:06
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates JobSet pod network endpoint generation to support multiple replicas per replicated job, rather than assuming a single replica.

Changes:

  • Use rJob.Replicas to determine replica count (with a default) instead of a fixed constant.
  • Generate endpoints for every (replicaIdx, podIdx) combination.

Comment thread pkg/runtime/framework/plugins/jobset/jobset.go Outdated
Comment thread pkg/runtime/framework/plugins/jobset/jobset.go Outdated
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @fedebongio!
Please can you sign commits for DCO?

// REF: https://github.com/kubeflow/trainer/issues/2318
podCount := info.TemplateSpec.PodSets[rJobIdx].Count
rJobReplicas := constants.DefaultJobReplicas
rJobReplicas := ptr.Deref(rJob.Replicas, constants.DefaultJobReplicas)
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that users will be responsible to use RuntimePatches API to configure correct replicas from the TrainJob's numNodes?
It is a bit different compare to what we discussed here: #3408

cc @siyuanfoundation @krishdef7 @richabanker @kaisoz @kubeflow/kubeflow-trainer-team

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is correct. The users would configure the correct replicas/numSlices and the total numNodes, #3408 have the logic to compute the per slice Parallelism. We added the util function to compute numNodes from numSlices and topology in kubeflow/sdk#498, but the user would need to know and specify numSlices

Copy link
Copy Markdown

@richabanker richabanker May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some naive questions, but putting it out there:

  1. Is the concern that relying on rJob.Replicas is problematic due to RuntimePatches API not supporting numSlices/replicas field, so rjob.Replicas is only reliable when its coming from the SDK or raw manifest for TrainingRuntime CR and not from RuntimePatches API (which doesnt support replicas) ? If so, is a follow up here to add support for replicas field in RuntimePatches API?
  2. Why is using rJob.Replicas in IdentifyPodNetwork() being raised as a concern here but not in feat(operator): support multi-slice TPU by enabling trainer replicas > 1 #3408 where we use the same field in (j *JobSet) Build() ?
  3. Regardless of the fix, IdentifyPodNetwork using constants.DefaultJobReplicas (=1) is still a bug right ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just to close the loop, beyond the open questions I signed the commits for DCO)

…odNetwork

Signed-off-by: Federico Bongiovanni <fbongiovanni@google.com>
Signed-off-by: Federico Bongiovanni <fbongiovanni@google.com>
@fedebongio fedebongio force-pushed the fix-multi-slice-endpoints branch from c4c70ba to 8fcbcdd Compare May 26, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multi-slice TPU in trainer

5 participants