|
| 1 | +# DCGM Exporter Monitoring with SkyPilot API Server |
| 2 | + |
| 3 | +This document explains how to configure the SkyPilot API server to automatically monitor DCGM exporters running in the same Kubernetes cluster. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The SkyPilot API server can automatically monitor DCGM exporters deployed in the same Kubernetes cluster. This is achieved through: |
| 8 | + |
| 9 | +1. **Kubernetes Service Discovery**: Prometheus uses Kubernetes-native service discovery to automatically find DCGM exporters |
| 10 | +2. **Grafana Dashboards**: Pre-built dashboards for visualizing GPU metrics from the cluster |
| 11 | + |
| 12 | +**Note**: The current implementation works within a single Kubernetes cluster where the SkyPilot API server, Prometheus, and DCGM exporters are all deployed together. |
| 13 | + |
| 14 | +## Architecture |
| 15 | + |
| 16 | +``` |
| 17 | +┌─────────────────┐ ┌──────────────────┐ |
| 18 | +│ Prometheus │───▶│ DCGM Exporters │ |
| 19 | +│ │ │ (Same Cluster) │ |
| 20 | +└─────────────────┘ └──────────────────┘ |
| 21 | + │ │ |
| 22 | + │ │ |
| 23 | + ▼ ▼ |
| 24 | +┌─────────────────┐ ┌──────────────────┐ |
| 25 | +│ Grafana │ │ GPU Metrics │ |
| 26 | +│ Dashboards │ │ (Kubernetes SD) │ |
| 27 | +└─────────────────┘ └──────────────────┘ |
| 28 | +``` |
| 29 | + |
| 30 | +## Prerequisites |
| 31 | + |
| 32 | +### 1. DCGM Exporters in the Same Cluster |
| 33 | + |
| 34 | +The Kubernetes cluster where SkyPilot is deployed must have DCGM exporters running with proper Prometheus configuration. |
| 35 | + |
| 36 | +#### Using GPU Operator |
| 37 | +https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-deployment-scenarios |
| 38 | + |
| 39 | +While the GPU Operator provides DCGM exporter capabilities, **manual configuration is typically required** to enable proper Prometheus scraping. |
| 40 | + |
| 41 | +#### Manual DCGM Exporter Configuration |
| 42 | + |
| 43 | +After installing the GPU Operator, you may need to manually configure or deploy DCGM exporters with the correct Prometheus annotations: |
| 44 | + |
| 45 | +```yaml |
| 46 | +# Example DCGM exporter deployment with Prometheus annotations |
| 47 | +apiVersion: apps/v1 |
| 48 | +kind: DaemonSet |
| 49 | +metadata: |
| 50 | + name: nvidia-dcgm-exporter |
| 51 | + namespace: gpu-operator |
| 52 | +spec: |
| 53 | + selector: |
| 54 | + matchLabels: |
| 55 | + app: nvidia-dcgm-exporter |
| 56 | + template: |
| 57 | + metadata: |
| 58 | + labels: |
| 59 | + app: nvidia-dcgm-exporter |
| 60 | + annotations: |
| 61 | + prometheus.io/scrape: "true" |
| 62 | + prometheus.io/port: "9400" |
| 63 | + prometheus.io/path: "/metrics" |
| 64 | + spec: |
| 65 | + # ... rest of DCGM exporter configuration |
| 66 | +``` |
| 67 | + |
| 68 | +#### Expected DCGM Exporter Configuration |
| 69 | + |
| 70 | +Once properly configured, you should see: |
| 71 | + |
| 72 | +**NVIDIA GPU Operator DCGM Exporter:** |
| 73 | +- **Service Name**: `nvidia-dcgm-exporter` |
| 74 | +- **Namespace**: `gpu-operator` |
| 75 | +- **Port**: `9400` |
| 76 | +- **Metrics Path**: `/metrics` |
| 77 | +- **Prometheus Annotation**: `prometheus.io/scrape: true` |
| 78 | + |
| 79 | +**Note**: Depending on your cloud provider or cluster setup, you may also see additional DCGM exporters (e.g., cloud provider-specific ones like `nebius-dcgm`). Prometheus will automatically discover and scrape all properly annotated DCGM exporters. |
| 80 | + |
| 81 | +#### Verification Commands |
| 82 | + |
| 83 | +```bash |
| 84 | +# Check if DCGM exporters are running |
| 85 | +kubectl get pods -A | grep dcgm |
| 86 | + |
| 87 | +# Check DCGM exporter services and annotations |
| 88 | +kubectl get svc -A | grep dcgm |
| 89 | +kubectl describe svc -n gpu-operator nvidia-dcgm-exporter |
| 90 | + |
| 91 | +# Test metrics endpoint |
| 92 | +kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400 |
| 93 | +curl http://localhost:9400/metrics |
| 94 | +``` |
| 95 | + |
| 96 | +## Configuration |
| 97 | + |
| 98 | +### Deploy with Helm |
| 99 | + |
| 100 | +```bash |
| 101 | +helm install skypilot ./charts/skypilot \ |
| 102 | + --set apiService.metrics.enabled=true \ |
| 103 | + --set prometheus.enabled=true \ |
| 104 | + --set grafana.enabled=true |
| 105 | +``` |
| 106 | + |
| 107 | +### Dashboard Configuration |
| 108 | + |
| 109 | +The NVIDIA DCGM dashboard is automatically provisioned using Grafana's dashboard import feature: |
| 110 | + |
| 111 | +```yaml |
| 112 | +# In values.yaml |
| 113 | +grafana: |
| 114 | + enabled: true |
| 115 | + dashboardProviders: |
| 116 | + dashboardproviders.yaml: |
| 117 | + apiVersion: 1 |
| 118 | + providers: |
| 119 | + - name: 'default' |
| 120 | + orgId: 1 |
| 121 | + folder: '' |
| 122 | + type: file |
| 123 | + disableDeletion: false |
| 124 | + allowUiUpdates: false |
| 125 | + updateIntervalSeconds: 30 |
| 126 | + options: |
| 127 | + path: /var/lib/grafana/dashboards/default |
| 128 | +``` |
| 129 | +
|
| 130 | +## How It Works |
| 131 | +
|
| 132 | +### 1. Kubernetes Service Discovery |
| 133 | +
|
| 134 | +Prometheus is configured to use Kubernetes service discovery to automatically find DCGM exporters: |
| 135 | +
|
| 136 | +```yaml |
| 137 | +# Prometheus scrape config (automatically generated) |
| 138 | +- job_name: 'kubernetes-pods' |
| 139 | + kubernetes_sd_configs: |
| 140 | + - role: pod |
| 141 | + relabel_configs: |
| 142 | + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] |
| 143 | + action: keep |
| 144 | + regex: true |
| 145 | + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] |
| 146 | + action: replace |
| 147 | + target_label: __metrics_path__ |
| 148 | + regex: (.+) |
| 149 | +``` |
| 150 | +
|
| 151 | +### 2. Dashboard Provisioning |
| 152 | +
|
| 153 | +Grafana automatically: |
| 154 | +1. Discovers dashboards defined in charts/skypilot/manifests/ |
| 155 | +2. Uses the Prometheus datasource |
0 commit comments