Skip to content

Feature: Autoscaled compute node pool cannot stay at zero #2118

Description

@m-brando

Motivation

The motivation behind this issue is to allow Claudie clusters to remain operational using only controller nodes when there are no CPU/GPU workloads running, during idle time, and to provision GPU compute nodes only when GPU workloads are actually requested.

Description

When Claudie deploys infrastructure with:

  • a controller node pool using a static node count, and
  • a compute node pool managed by an autoscaler with min=0,

it initially provisions only the controller nodes. However, shortly after deployment, the autoscaler provisions compute nodes automatically.

This happens because some pods from the longhorn-system namespace and several pods from kube-system (notably hubble-* pods) do not tolerate the control-plane taint:

node-role.kubernetes.io/control-plane:NoSchedule

As a result, these pods cannot be scheduled onto the controller nodes. The cluster autoscaler then detects unschedulable workloads and immediately provisions a new node from the autoscaled compute pool.

Because of this behavior, the autoscaled compute node pool never remains at zero nodes after deployment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions