Motivation
The motivation behind this issue is to allow Claudie clusters to remain operational using only controller nodes when there are no CPU/GPU workloads running, during idle time, and to provision GPU compute nodes only when GPU workloads are actually requested.
Description
When Claudie deploys infrastructure with:
- a controller node pool using a static node count, and
- a compute node pool managed by an autoscaler with min=0,
it initially provisions only the controller nodes. However, shortly after deployment, the autoscaler provisions compute nodes automatically.
This happens because some pods from the longhorn-system namespace and several pods from kube-system (notably hubble-* pods) do not tolerate the control-plane taint:
node-role.kubernetes.io/control-plane:NoSchedule
As a result, these pods cannot be scheduled onto the controller nodes. The cluster autoscaler then detects unschedulable workloads and immediately provisions a new node from the autoscaled compute pool.
Because of this behavior, the autoscaled compute node pool never remains at zero nodes after deployment.
Motivation
The motivation behind this issue is to allow Claudie clusters to remain operational using only controller nodes when there are no CPU/GPU workloads running, during idle time, and to provision GPU compute nodes only when GPU workloads are actually requested.
Description
When Claudie deploys infrastructure with:
it initially provisions only the controller nodes. However, shortly after deployment, the autoscaler provisions compute nodes automatically.
This happens because some pods from the
longhorn-systemnamespace and several pods fromkube-system(notably hubble-* pods) do not tolerate the control-plane taint:node-role.kubernetes.io/control-plane:NoScheduleAs a result, these pods cannot be scheduled onto the controller nodes. The cluster autoscaler then detects unschedulable workloads and immediately provisions a new node from the autoscaled compute pool.
Because of this behavior, the autoscaled compute node pool never remains at zero nodes after deployment.