Skip to content

Commit 4cd31e2

Browse files
committed
sync with upstream chart
Signed-off-by: David Young <[email protected]>
1 parent 8b4cb4d commit 4cd31e2

37 files changed

+7737
-126
lines changed

charts/stable/skypilot/Chart.yaml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,20 @@
11
apiVersion: v2
22
name: skypilot
33
description: A Helm chart for deploying SkyPilot API server on Kubernetes
4+
icon: "https://raw.githubusercontent.com/skypilot-org/skypilot/master/charts/skypilot/skypilot.svg"
45
type: application
5-
version: 0.0.1-pre-07
6+
version: 0.0.0
67
appVersion: "0.0"
78
dependencies:
89
- name: ingress-nginx
9-
version: 4.11.3
10+
version: 4.11.8
1011
repository: https://kubernetes.github.io/ingress-nginx
1112
condition: ingress-nginx.enabled
13+
- name: prometheus
14+
version: 27.20.0
15+
repository: https://prometheus-community.github.io/helm-charts
16+
condition: prometheus.enabled
17+
- name: grafana
18+
version: 9.2.2
19+
repository: https://grafana.github.io/helm-charts
20+
condition: grafana.enabled
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# DCGM Exporter Monitoring with SkyPilot API Server
2+
3+
This document explains how to configure the SkyPilot API server to automatically monitor DCGM exporters running in the same Kubernetes cluster.
4+
5+
## Overview
6+
7+
The SkyPilot API server can automatically monitor DCGM exporters deployed in the same Kubernetes cluster. This is achieved through:
8+
9+
1. **Kubernetes Service Discovery**: Prometheus uses Kubernetes-native service discovery to automatically find DCGM exporters
10+
2. **Grafana Dashboards**: Pre-built dashboards for visualizing GPU metrics from the cluster
11+
12+
**Note**: The current implementation works within a single Kubernetes cluster where the SkyPilot API server, Prometheus, and DCGM exporters are all deployed together.
13+
14+
## Architecture
15+
16+
```
17+
┌─────────────────┐ ┌──────────────────┐
18+
│ Prometheus │───▶│ DCGM Exporters │
19+
│ │ │ (Same Cluster) │
20+
└─────────────────┘ └──────────────────┘
21+
│ │
22+
│ │
23+
▼ ▼
24+
┌─────────────────┐ ┌──────────────────┐
25+
│ Grafana │ │ GPU Metrics │
26+
│ Dashboards │ │ (Kubernetes SD) │
27+
└─────────────────┘ └──────────────────┘
28+
```
29+
30+
## Prerequisites
31+
32+
### 1. DCGM Exporters in the Same Cluster
33+
34+
The Kubernetes cluster where SkyPilot is deployed must have DCGM exporters running with proper Prometheus configuration.
35+
36+
#### Using GPU Operator
37+
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-deployment-scenarios
38+
39+
While the GPU Operator provides DCGM exporter capabilities, **manual configuration is typically required** to enable proper Prometheus scraping.
40+
41+
#### Manual DCGM Exporter Configuration
42+
43+
After installing the GPU Operator, you may need to manually configure or deploy DCGM exporters with the correct Prometheus annotations:
44+
45+
```yaml
46+
# Example DCGM exporter deployment with Prometheus annotations
47+
apiVersion: apps/v1
48+
kind: DaemonSet
49+
metadata:
50+
name: nvidia-dcgm-exporter
51+
namespace: gpu-operator
52+
spec:
53+
selector:
54+
matchLabels:
55+
app: nvidia-dcgm-exporter
56+
template:
57+
metadata:
58+
labels:
59+
app: nvidia-dcgm-exporter
60+
annotations:
61+
prometheus.io/scrape: "true"
62+
prometheus.io/port: "9400"
63+
prometheus.io/path: "/metrics"
64+
spec:
65+
# ... rest of DCGM exporter configuration
66+
```
67+
68+
#### Expected DCGM Exporter Configuration
69+
70+
Once properly configured, you should see:
71+
72+
**NVIDIA GPU Operator DCGM Exporter:**
73+
- **Service Name**: `nvidia-dcgm-exporter`
74+
- **Namespace**: `gpu-operator`
75+
- **Port**: `9400`
76+
- **Metrics Path**: `/metrics`
77+
- **Prometheus Annotation**: `prometheus.io/scrape: true`
78+
79+
**Note**: Depending on your cloud provider or cluster setup, you may also see additional DCGM exporters (e.g., cloud provider-specific ones like `nebius-dcgm`). Prometheus will automatically discover and scrape all properly annotated DCGM exporters.
80+
81+
#### Verification Commands
82+
83+
```bash
84+
# Check if DCGM exporters are running
85+
kubectl get pods -A | grep dcgm
86+
87+
# Check DCGM exporter services and annotations
88+
kubectl get svc -A | grep dcgm
89+
kubectl describe svc -n gpu-operator nvidia-dcgm-exporter
90+
91+
# Test metrics endpoint
92+
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
93+
curl http://localhost:9400/metrics
94+
```
95+
96+
## Configuration
97+
98+
### Deploy with Helm
99+
100+
```bash
101+
helm install skypilot ./charts/skypilot \
102+
--set apiService.metrics.enabled=true \
103+
--set prometheus.enabled=true \
104+
--set grafana.enabled=true
105+
```
106+
107+
### Dashboard Configuration
108+
109+
The NVIDIA DCGM dashboard is automatically provisioned using Grafana's dashboard import feature:
110+
111+
```yaml
112+
# In values.yaml
113+
grafana:
114+
enabled: true
115+
dashboardProviders:
116+
dashboardproviders.yaml:
117+
apiVersion: 1
118+
providers:
119+
- name: 'default'
120+
orgId: 1
121+
folder: ''
122+
type: file
123+
disableDeletion: false
124+
allowUiUpdates: false
125+
updateIntervalSeconds: 30
126+
options:
127+
path: /var/lib/grafana/dashboards/default
128+
```
129+
130+
## How It Works
131+
132+
### 1. Kubernetes Service Discovery
133+
134+
Prometheus is configured to use Kubernetes service discovery to automatically find DCGM exporters:
135+
136+
```yaml
137+
# Prometheus scrape config (automatically generated)
138+
- job_name: 'kubernetes-pods'
139+
kubernetes_sd_configs:
140+
- role: pod
141+
relabel_configs:
142+
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
143+
action: keep
144+
regex: true
145+
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
146+
action: replace
147+
target_label: __metrics_path__
148+
regex: (.+)
149+
```
150+
151+
### 2. Dashboard Provisioning
152+
153+
Grafana automatically:
154+
1. Discovers dashboards defined in charts/skypilot/manifests/
155+
2. Uses the Prometheus datasource

charts/stable/skypilot/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
![SkyPilot](https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png)
2+
3+
[![Documentation](https://img.shields.io/badge/docs-gray?logo=readthedocs&logoColor=f5f5f5)](https://docs.skypilot.co/)
4+
[![GitHub Release](https://img.shields.io/github/release/skypilot-org/skypilot.svg)](https://github.com/skypilot-org/skypilot/releases)
5+
[![Join Slack](https://img.shields.io/badge/SkyPilot-Join%20Slack-blue?logo=slack)](http://slack.skypilot.co)
6+
[![Downloads](https://img.shields.io/pypi/dm/skypilot)](https://github.com/skypilot-org/skypilot/releases)
7+
8+
## Run AI on Any Infra — Unified, Faster, Cheaper
9+
10+
### [🌟 **SkyPilot Demo** 🌟: Click to see a 1-minute tour](https://demo.skypilot.co/dashboard/)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Developer Guide
2+
3+
## SkyPilot Config Persistency
4+
5+
**Design Decision:** Use PVC as the source of truth for SkyPilot configuration.
6+
7+
**Drawback:** Helm upgrades cannot directly change the SkyPilot config for
8+
the API server. Instead, we provide instructions for updating config
9+
(see: https://docs.skypilot.co/en/latest/reference/api-server/api-server-admin-deploy.html#setting-the-skypilot-config).
10+
11+
### Why PVC?
12+
- **Fast:** Immediate reflection of config changes
13+
- **Unified:** Consistent with non-Kubernetes deployments
14+
- **Future-proof:** Enables migration to external storage sources
15+
16+
### Why Not ConfigMap?
17+
- **Slow updates:** Changes take 2-3 seconds to reflect in mounted files,
18+
making SkyPilot unresponsive (known Kubernetes issue with no planned fix:
19+
[kubernetes/kubernetes#50345](https://github.com/kubernetes/kubernetes/issues/50345#issuecomment-585344794))
20+
- **Poor persistence:** Config is not persisted across Kubernetes clusters
21+
and backing up is difficult
22+
23+
**Note:** SkyPilot syncs config back to ConfigMap for user convenience, but
24+
ConfigMap may not always be in sync with PVC (e.g., user `helm upgrade` with a
25+
new configMap). Sync occurs when config changes are made through the workspace
26+
API.
27+
28+
**TODO:** Provide API to get config directly from API server to eliminate
29+
ConfigMap dependency.

0 commit comments

Comments
 (0)