fix(operator): support multi-replica endpoint generation in IdentifyP…#3539
fix(operator): support multi-replica endpoint generation in IdentifyP…#3539fedebongio wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR updates JobSet pod network endpoint generation to support multiple replicas per replicated job, rather than assuming a single replica.
Changes:
- Use
rJob.Replicasto determine replica count (with a default) instead of a fixed constant. - Generate endpoints for every
(replicaIdx, podIdx)combination.
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you for this @fedebongio!
Please can you sign commits for DCO?
| // REF: https://github.com/kubeflow/trainer/issues/2318 | ||
| podCount := info.TemplateSpec.PodSets[rJobIdx].Count | ||
| rJobReplicas := constants.DefaultJobReplicas | ||
| rJobReplicas := ptr.Deref(rJob.Replicas, constants.DefaultJobReplicas) |
There was a problem hiding this comment.
Does it mean that users will be responsible to use RuntimePatches API to configure correct replicas from the TrainJob's numNodes?
It is a bit different compare to what we discussed here: #3408
cc @siyuanfoundation @krishdef7 @richabanker @kaisoz @kubeflow/kubeflow-trainer-team
There was a problem hiding this comment.
that is correct. The users would configure the correct replicas/numSlices and the total numNodes, #3408 have the logic to compute the per slice Parallelism. We added the util function to compute numNodes from numSlices and topology in kubeflow/sdk#498, but the user would need to know and specify numSlices
There was a problem hiding this comment.
Maybe some naive questions, but putting it out there:
- Is the concern that relying on
rJob.Replicasis problematic due to RuntimePatches API not supporting numSlices/replicas field, sorjob.Replicasis only reliable when its coming from the SDK or raw manifest for TrainingRuntime CR and not from RuntimePatches API (which doesnt support replicas) ? If so, is a follow up here to add support for replicas field in RuntimePatches API? - Why is using rJob.Replicas in
IdentifyPodNetwork()being raised as a concern here but not in feat(operator): support multi-slice TPU by enabling trainer replicas > 1 #3408 where we use the same field in(j *JobSet) Build()? - Regardless of the fix,
IdentifyPodNetworkusing constants.DefaultJobReplicas (=1) is still a bug right ?
There was a problem hiding this comment.
(just to close the loop, beyond the open questions I signed the commits for DCO)
…odNetwork Signed-off-by: Federico Bongiovanni <fbongiovanni@google.com>
Signed-off-by: Federico Bongiovanni <fbongiovanni@google.com>
c4c70ba to
8fcbcdd
Compare
What this PR does / why we need it:
This PR resolves a limitation in the
IdentifyPodNetworkfunction withinpkg/runtime/framework/plugins/jobset/jobset.go. Previously, the operator hardcoded the number of replicated job replicas to 1 (rJobReplicas := constants.DefaultJobReplicas) and only generated network endpoints for replica index0.With the introduction of multi-slice TPU support (where
replicas > 1represents the slice count, e.g., in JobSet), this 1-replica assumption causes endpoint generation to fail for all slices other than the first one.This PR updates
IdentifyPodNetworkto:ReplicatedJobconfiguration template....-node-0-0...to...-node-3-7...).This is an essential follow-up fix to ensure complete multi-slice TPU support in the operator.
Which issue(s) this PR fixes:
Fixes Support multi-slice TPU in trainer #3407 (Companion/follow-up to feat(operator): support multi-slice TPU by enabling trainer replicas > 1 #3408)
Checklist:
(Note: This is an internal operator networking fix, so no user-facing documentation changes are required. Checked for checklist completeness).