Skip to content

Conversation

@jhu-svg
Copy link

@jhu-svg jhu-svg commented Nov 24, 2025

Summary

  • Rebase our slurm-operator fork onto upstream Slinky v1.0 (v1beta1 APIs).

  • Re‑introduce Together‑specific behavior that was in PR Changes Compiled #13, adapted to the new 1.0 architecture.

Key changes

  • Node cordon signal
    Add slinky.slurm.net/node-cordon annotation constant.
    Implement updateNodeCordonAnnotation / isNodeCordoned logic in nodeset controller so K8s Nodes are marked when Slurm nodes are drained/undrained.
    Keep using upstream pod‑level annotations (AnnotationPodCordon, AnnotationPodDeletionCost) from v1.0.
  • Login support
    Add login section to helm/slurm/values.yaml (image, resources, nodeSelector, affinity, tolerations, extra volumes).
    Add login helpers in _slurm.tpl.
    Add helm/slurm/templates/login/login-deployment.yaml and login-service.yaml to deploy the login pod.
  • Helm configuration knobs
    Extend Slurm chart values to support:
    Additional tolerations for controller/accounting/restapi where needed.
    shmSize and persistence.existingDataClaims for compute NodeSets (e.g. /data, /scratch).
    Wire these values into the corresponding templates.
  • Operator chart / RBAC
    Add .Values.operator.tolerations and .Values.operator.affinity to the operator Deployment.
    Ensure RBAC includes update on resources required by new behavior (e.g. nodes).
  • Module / wiring
    Set module path to github.com/togethercomputer/slurm-operator and adjust imports so tcloud can depend on this fork cleanly.

@jhu-svg jhu-svg changed the title add node-cordon, login chart, Helm/operator tweaks Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants