Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features #15
+829
−416
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Rebase our slurm-operator fork onto upstream Slinky v1.0 (v1beta1 APIs).
Re‑introduce Together‑specific behavior that was in PR Changes Compiled #13, adapted to the new 1.0 architecture.
Key changes
Add slinky.slurm.net/node-cordon annotation constant.
Implement updateNodeCordonAnnotation / isNodeCordoned logic in nodeset controller so K8s Nodes are marked when Slurm nodes are drained/undrained.
Keep using upstream pod‑level annotations (AnnotationPodCordon, AnnotationPodDeletionCost) from v1.0.
Add login section to helm/slurm/values.yaml (image, resources, nodeSelector, affinity, tolerations, extra volumes).
Add login helpers in _slurm.tpl.
Add helm/slurm/templates/login/login-deployment.yaml and login-service.yaml to deploy the login pod.
Extend Slurm chart values to support:
Additional tolerations for controller/accounting/restapi where needed.
shmSize and persistence.existingDataClaims for compute NodeSets (e.g. /data, /scratch).
Wire these values into the corresponding templates.
Add .Values.operator.tolerations and .Values.operator.affinity to the operator Deployment.
Ensure RBAC includes update on resources required by new behavior (e.g. nodes).
Set module path to github.com/togethercomputer/slurm-operator and adjust imports so tcloud can depend on this fork cleanly.