geek-cookbook
diff --git a/‎charts/stable/skypilot/Chart.yaml‎
Lines changed: 11 additions & 2 deletions b/‎charts/stable/skypilot/Chart.yaml‎
Lines changed: 11 additions & 2 deletions
diff --git a/‎charts/stable/skypilot/DCGM_MONITORING.md‎
Lines changed: 155 additions & 0 deletions b/‎charts/stable/skypilot/DCGM_MONITORING.md‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎charts/stable/skypilot/README.md‎
Lines changed: 10 additions & 0 deletions b/‎charts/stable/skypilot/README.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎charts/stable/skypilot/developer.md‎
Lines changed: 29 additions & 0 deletions b/‎charts/stable/skypilot/developer.md‎
Lines changed: 29 additions & 0 deletions
@@ -1,11 +1,20 @@
 apiVersion: v2
 name: skypilot
 description: A Helm chart for deploying SkyPilot API server on Kubernetes
+icon: "https://raw.githubusercontent.com/skypilot-org/skypilot/master/charts/skypilot/skypilot.svg"
 type: application
-version: 0.0.1-pre-07
+version: 0.0.0
 appVersion: "0.0"
 dependencies:
   - name: ingress-nginx
-    version: 4.11.3
+    version: 4.11.8
     repository: https://kubernetes.github.io/ingress-nginx
     condition: ingress-nginx.enabled
+  - name: prometheus
+    version: 27.20.0
+    repository: https://prometheus-community.github.io/helm-charts
+    condition: prometheus.enabled
+  - name: grafana
+    version: 9.2.2
+    repository: https://grafana.github.io/helm-charts
+    condition: grafana.enabled
@@ -0,0 +1,155 @@
+# DCGM Exporter Monitoring with SkyPilot API Server
+
+This document explains how to configure the SkyPilot API server to automatically monitor DCGM exporters running in the same Kubernetes cluster.
+
+## Overview
+
+The SkyPilot API server can automatically monitor DCGM exporters deployed in the same Kubernetes cluster. This is achieved through:
+
+1. **Kubernetes Service Discovery**: Prometheus uses Kubernetes-native service discovery to automatically find DCGM exporters
+2. **Grafana Dashboards**: Pre-built dashboards for visualizing GPU metrics from the cluster
+
+**Note**: The current implementation works within a single Kubernetes cluster where the SkyPilot API server, Prometheus, and DCGM exporters are all deployed together.
+
+## Architecture
+
+```
+┌─────────────────┐    ┌──────────────────┐
+│   Prometheus    │───▶│ DCGM Exporters  │
+│                 │    │  (Same Cluster)  │
+└─────────────────┘    └──────────────────┘
+         │                       │
+         │                       │
+         ▼                       ▼
+┌─────────────────┐    ┌──────────────────┐
+│    Grafana      │    │   GPU Metrics    │
+│   Dashboards    │    │  (Kubernetes SD) │
+└─────────────────┘    └──────────────────┘
+```
+
+## Prerequisites
+
+### 1. DCGM Exporters in the Same Cluster
+
+The Kubernetes cluster where SkyPilot is deployed must have DCGM exporters running with proper Prometheus configuration.
+
+#### Using GPU Operator
+https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-deployment-scenarios
+
+While the GPU Operator provides DCGM exporter capabilities, **manual configuration is typically required** to enable proper Prometheus scraping.
+
+#### Manual DCGM Exporter Configuration
+
+After installing the GPU Operator, you may need to manually configure or deploy DCGM exporters with the correct Prometheus annotations:
+
+```yaml
+# Example DCGM exporter deployment with Prometheus annotations
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: nvidia-dcgm-exporter
+  namespace: gpu-operator
+spec:
+  selector:
+    matchLabels:
+      app: nvidia-dcgm-exporter
+  template:
+    metadata:
+      labels:
+        app: nvidia-dcgm-exporter
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "9400"
+        prometheus.io/path: "/metrics"
+    spec:
+      # ... rest of DCGM exporter configuration
+```
+
+#### Expected DCGM Exporter Configuration
+
+Once properly configured, you should see:
+
+**NVIDIA GPU Operator DCGM Exporter:**
+- **Service Name**: `nvidia-dcgm-exporter`
+- **Namespace**: `gpu-operator`
+- **Port**: `9400`
+- **Metrics Path**: `/metrics`
+- **Prometheus Annotation**: `prometheus.io/scrape: true`
+
+**Note**: Depending on your cloud provider or cluster setup, you may also see additional DCGM exporters (e.g., cloud provider-specific ones like `nebius-dcgm`). Prometheus will automatically discover and scrape all properly annotated DCGM exporters.
+
+#### Verification Commands
+
+```bash
+# Check if DCGM exporters are running
+kubectl get pods -A | grep dcgm
+
+# Check DCGM exporter services and annotations
+kubectl get svc -A | grep dcgm
+kubectl describe svc -n gpu-operator nvidia-dcgm-exporter
+
+# Test metrics endpoint
+kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
+curl http://localhost:9400/metrics
+```
+
+## Configuration
+
+### Deploy with Helm
+
+```bash
+helm install skypilot ./charts/skypilot \
+  --set apiService.metrics.enabled=true \
+  --set prometheus.enabled=true \
+  --set grafana.enabled=true
+```
+
+### Dashboard Configuration
+
+The NVIDIA DCGM dashboard is automatically provisioned using Grafana's dashboard import feature:
+
+```yaml
+# In values.yaml
+grafana:
+  enabled: true
+  dashboardProviders:
+    dashboardproviders.yaml:
+      apiVersion: 1
+      providers:
+      - name: 'default'
+        orgId: 1
+        folder: ''
+        type: file
+        disableDeletion: false
+        allowUiUpdates: false
+        updateIntervalSeconds: 30
+        options:
+          path: /var/lib/grafana/dashboards/default
+```
+
+## How It Works
+
+### 1. Kubernetes Service Discovery
+
+Prometheus is configured to use Kubernetes service discovery to automatically find DCGM exporters:
+
+```yaml
+# Prometheus scrape config (automatically generated)
+- job_name: 'kubernetes-pods'
+  kubernetes_sd_configs:
+  - role: pod
+  relabel_configs:
+  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
+    action: keep
+    regex: true
+  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
+    action: replace
+    target_label: __metrics_path__
+    regex: (.+)
+```
+
+### 2. Dashboard Provisioning
+
+Grafana automatically:
+1. Discovers dashboards defined in charts/skypilot/manifests/
+2. Uses the Prometheus datasource
@@ -0,0 +1,10 @@
+![SkyPilot](https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png)
+
+[![Documentation](https://img.shields.io/badge/docs-gray?logo=readthedocs&logoColor=f5f5f5)](https://docs.skypilot.co/)
+[![GitHub Release](https://img.shields.io/github/release/skypilot-org/skypilot.svg)](https://github.com/skypilot-org/skypilot/releases)
+[![Join Slack](https://img.shields.io/badge/SkyPilot-Join%20Slack-blue?logo=slack)](http://slack.skypilot.co)
+[![Downloads](https://img.shields.io/pypi/dm/skypilot)](https://github.com/skypilot-org/skypilot/releases)
+
+## Run AI on Any Infra — Unified, Faster, Cheaper
+
+### [🌟 **SkyPilot Demo** 🌟: Click to see a 1-minute tour](https://demo.skypilot.co/dashboard/)
@@ -0,0 +1,29 @@
+# Developer Guide
+
+## SkyPilot Config Persistency
+
+**Design Decision:** Use PVC as the source of truth for SkyPilot configuration.
+
+**Drawback:** Helm upgrades cannot directly change the SkyPilot config for 
+the API server. Instead, we provide instructions for updating config 
+(see: https://docs.skypilot.co/en/latest/reference/api-server/api-server-admin-deploy.html#setting-the-skypilot-config).
+
+### Why PVC?
+- **Fast:** Immediate reflection of config changes
+- **Unified:** Consistent with non-Kubernetes deployments  
+- **Future-proof:** Enables migration to external storage sources
+
+### Why Not ConfigMap?
+- **Slow updates:** Changes take 2-3 seconds to reflect in mounted files, 
+  making SkyPilot unresponsive (known Kubernetes issue with no planned fix: 
+  [kubernetes/kubernetes#50345](https://github.com/kubernetes/kubernetes/issues/50345#issuecomment-585344794))
+- **Poor persistence:** Config is not persisted across Kubernetes clusters 
+  and backing up is difficult
+
+**Note:** SkyPilot syncs config back to ConfigMap for user convenience, but 
+ConfigMap may not always be in sync with PVC (e.g., user `helm upgrade` with a
+new configMap). Sync occurs when config changes are made through the workspace
+API.
+
+**TODO:** Provide API to get config directly from API server to eliminate 
+ConfigMap dependency.