Helm charts for deploying NVIDIA NeMo microservices infrastructure and demos on OpenShift.
- OpenShift 4.x cluster
- Helm 3.x
ocCLI configured- NGC API key (for pulling NVIDIA images)
- GPU nodes (for training/inference workloads)
deploy/nemo-infra: Infrastructure components (databases, MLflow, Argo, Milvus, Volcano, Operators)deploy/nemo-instances: Example NeMo microservices deployments
Before installation, ensure you have:
-
NGC Helm Repository (required for NeMo Operator):
helm repo add nvidia-nemo https://helm.ngc.nvidia.com/nvidia-nemo helm repo update
-
NGC API Key for pulling NVIDIA images and downloading models
-
OpenShift Cluster with GPU nodes available
Deploy all infrastructure components (PostgreSQL, MLflow, Argo Workflows, Milvus, Volcano, Operators):
cd deploy/nemo-infra
# Update dependencies
helm dependency update
# Deploy infrastructure
helm install nemo-infra . \
-n <namespace> \
--create-namespaceVerify installation:
oc get pods -n <namespace> | grep -E "(postgresql|mlflow|volcano|argo|milvus|opentelemetry|jupyter|minio)"Expected Output:
jupyter-notebook-5fc745674d-9gq29 1/1 Running 0 175m
nemo-infra-argo-workflows-server-6ccf84f45d-rnvxg 1/1 Running 0 175m
nemo-infra-argo-workflows-workflow-controller-68d456d755-hds5p 1/1 Running 0 175m
nemo-infra-customizer-mlflow-tracking-6787ff598-fjmmr 1/1 Running 0 175m
nemo-infra-customizer-opentelemetry-648455b458-zq8v6 1/1 Running 0 175m
nemo-infra-customizer-postgresql-0 1/1 Running 0 175m
nemo-infra-datastore-postgresql-0 1/1 Running 0 175m
nemo-infra-entity-store-postgresql-0 1/1 Running 0 175m
nemo-infra-evaluator-milvus-standalone-86599fc78f-gkhhv 1/1 Running 0 175m
nemo-infra-evaluator-opentelemetry-577d4d757-lnlcm 1/1 Running 0 175m
nemo-infra-evaluator-postgresql-0 1/1 Running 0 175m
nemo-infra-guardrail-postgresql-0 1/1 Running 0 175m
nemo-infra-minio-89bdcdd79-s28km 1/1 Running 0 175m
nemo-infra-postgresql-0 1/1 Running 0 175m
All pods should show 1/1 Running status. If any pods are not running, check the troubleshooting section in deploy/nemo-infra/README.md.
๐ Configuration options: deploy/nemo-infra/README.md
Before deploying NeMo instances, create the required secrets:
export NGC_API_KEY="<YOUR_NGC_API_KEY>"
# NGC Image Pull Secret (for pulling images)
oc create secret docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_API_KEY \
-n <namespace>
# NGC API Secret (for model downloads)
oc create secret generic ngc-api-secret \
--from-literal=NGC_API_KEY=$NGC_API_KEY \
-n <namespace>These secrets must match the PostgreSQL passwords from your infrastructure deployment. Default passwords (development only):
# Customizer PostgreSQL
oc create secret generic customizer-pg-existing-secret \
--from-literal=password=ncspassword \
-n <namespace>
# Datastore PostgreSQL
oc create secret generic datastore-pg-existing-secret \
--from-literal=password=ndspass \
-n <namespace>
# Entity Store PostgreSQL
oc create secret generic entity-store-pg-existing-secret \
--from-literal=password=nespass \
-n <namespace>
# Evaluator PostgreSQL
oc create secret generic evaluator-pg-existing-secret \
--from-literal=password=evalpass \
-n <namespace>
# Guardrail PostgreSQL
oc create secret generic guardrail-pg-existing-secret \
--from-literal=password=guardrailpass \
-n <namespace>๐ Full secrets documentation: deploy/nemo-instances/README.md
Deploy example NeMo microservices (Customizer, Datastore, Entity Store, Evaluator, Guardrail, NIM services):
cd deploy/nemo-instances
# Update dependencies (downloads operator charts)
helm dependency update
# Deploy operators and create NeMo instances
helm install nemo-instances . \
-n <namespace> \
--set namespace.name=<namespace>Deployment Order:
- NeMo Operator is installed first
- NIM Operator is installed second (depends on NeMo Operator)
- NeMo instances are deployed last (depends on both operators)
Verify installation:
# Check all NeMo microservices
oc get -n <namespace> nemoentitystore,nemodatastore,nemoguardrails,nemocustomizer,nemoevaluator
# Check NIM services
oc get -n <namespace> nimpipeline,nimcache,nimservice
# Check pods
oc get pods -n <namespace> | grep -E "(nemo-operator|nim-operator|nemocustomizer|nemodatastore|nemoentitystore|nemoevaluator|nemoguardrails|nimcache|nimpipeline)"Expected Output:
nemo-samples-nemo-operator-controller-manager-b4bd4bd69-vbh7z 2/2 Running 0 152m
nemo-samples-nim-operator-666b78dd44-crmtm 1/1 Running 0 152m
nemocustomizer-sample-5d8b5fb5fb-qqd8c 1/1 Running 0 50m
nemodatastore-sample-74dcb5568d-qbkc8 1/1 Running 0 152m
nemoentitystore-sample-66b4fc4fdc-9n82g 1/1 Running 0 152m
nemoevaluator-sample-79544995db-b4h4b 1/1 Running 0 152m
nemoguardrails-sample-74bb7f5bc9-w2q68 1/1 Running 0 152m
All pods should show Running status. The NeMo Operator pod shows 2/2 Running (controller + manager containers), while other pods show 1/1 Running. If any pods are not running, check the troubleshooting section in deploy/nemo-instances/README.md.
๐ Configuration options: deploy/nemo-instances/README.md
To upgrade existing deployments:
# Upgrade infrastructure
cd deploy/nemo-infra
helm upgrade nemo-infra . -n <namespace>
# Upgrade instances
cd deploy/nemo-instances
helm upgrade nemo-instances . -n <namespace> --set namespace.name=<namespace>OpenShift uses Security Context Constraints (SCCs) to control what security settings pods can use. The NeMo deployment follows the principle of least privilege by granting specific SCCs only to the ServiceAccounts that need them.
1. NemoCustomizer SCC (nemo-customizer-scc)
- Purpose: Allows model downloader jobs to run with specific security contexts
- Granted to:
nemocustomizer-sampleServiceAccount only - Why: Model downloader jobs need
RunAsAnyto match PVC ownership (user 1000, group 2000) - Not granted to:
defaultServiceAccount (principle of least privilege)
2. Nonroot SCC (nonroot)
- Purpose: Standard SCC for inference workloads (requires non-root user)
- Granted to:
defaultServiceAccount (for UI-created InferenceServices) - Why: UI-created InferenceServices that don't go through NIMService operator use
defaultServiceAccount - Note: NIMService-created InferenceServices use their own ServiceAccounts (handled by NIMService operator)
-
Dedicated ServiceAccounts: Each workload should have its own ServiceAccount
- NIMService operator creates a ServiceAccount per NIMService (name = NIMService name)
- Customizer uses
nemocustomizer-sampleServiceAccount - UI-created InferenceServices use
defaultServiceAccount
-
Explicit SCC Access: SCCs are granted via RBAC Role/Binding, not via
oc adm policy- See
deploy/nemo-instances/templates/nemocustomizer-oc-rbac.yaml - Follows OpenShift best practices for RBAC
- See
-
Principle of Least Privilege: Only grant the minimum SCC required
- Customizer jobs need
nemo-customizer-scc(RunAsAny) - Inference workloads need
nonroot(MustRunAsNonRoot)
- Customizer jobs need
If pods fail with SCC errors:
# Check which SCCs are available to a ServiceAccount
oc get rolebinding -n <namespace> | grep <serviceaccount-name>
# Check SCC details
oc get scc nonroot -o yaml
oc get scc nemo-customizer-scc -o yaml
# Verify ServiceAccount is bound correctly
oc get rolebinding default-scc-nonroot-binding -n <namespace> -o yaml
oc get rolebinding customizer-scc-nemo-customizer-scc-binding -n <namespace> -o yamlCommon Issues:
- Permission denied errors: Check if pod's ServiceAccount has the correct SCC access
- SCC conflicts: Ensure only one SCC is granted per ServiceAccount (OpenShift selects the most permissive)
- UI-created InferenceServices: Should use
defaultServiceAccount withnonrootSCC (automatically configured)
Always uninstall nemo-instances BEFORE nemo-infra because:
nemo-instancesdepends on infrastructure components- Custom Resources in
nemo-instancesmust be deleted first - The pre-delete hook in
nemo-instanceshandles CR cleanup automatically
Simply run helm uninstall - the pre-delete hook will automatically clean up Custom Resources:
# Step 1: Uninstall nemo-instances first (pre-delete hook handles CR cleanup)
helm uninstall nemo-instances -n <namespace>
# Step 2: Uninstall infrastructure
helm uninstall nemo-infra -n <namespace>The pre-delete hook (pre-delete-cleanup.yaml) in nemo-instances automatically:
- Deletes all Custom Resources (NemoCustomizer, NemoDatastore, etc.)
- Handles finalizers - Automatically removes finalizers if they block deletion
- Waits for pods to terminate
- Then Helm proceeds with normal uninstall
That's it! The hook handles everything automatically, including finalizer cleanup.
When you run helm uninstall, Helm only removes resources it manages directly. However:
- Custom Resources (NemoCustomizer, NemoDatastore, etc.) are managed by the NeMo Operator
- The operator continues to reconcile these CRs and create pods
- Helm doesn't delete CRs because they're not part of the Helm chart templates (they're created by the operator)
The pre-delete hook (pre-delete-cleanup.yaml) automatically deletes CRs before Helm uninstall, so helm uninstall works seamlessly.
If the pre-delete hook fails (rare), you can manually delete CRs first:
NAMESPACE="<your-namespace>"
# Step 1: Delete Custom Resources first (CRITICAL - must be done first)
oc delete nemocustomizer,nemodatastore,nemoentitystore,nemoevaluator,nemoguardrail,nimcache,nimpipeline --all -n "$NAMESPACE"
# Step 2: Wait for pods to terminate
oc wait --for=delete pod --all -n "$NAMESPACE" --timeout=300s
# Step 3: Uninstall Helm releases (in correct order)
helm uninstall nemo-instances -n "$NAMESPACE"
helm uninstall nemo-infra -n "$NAMESPACE"
# Step 4: Clean up remaining resources (optional)
oc delete all --all -n "$NAMESPACE" --ignore-not-found=true
oc delete jobs --all -n "$NAMESPACE" --ignore-not-found=true
oc delete configmap,secret,serviceaccount,role,rolebinding --all -n "$NAMESPACE" --ignore-not-found=true
# Step 5: Clean up cluster resources (optional)
oc delete clusterrole,clusterrolebinding -l component=volcano --ignore-not-found=true
oc delete validatingwebhookconfigurations \
volcano-admission-service-jobs-validate \
volcano-admission-service-pods-validate \
volcano-admission-service-queues-validate \
--ignore-not-found=true
# Step 6: Optionally delete PVCs (preserves data by default)
# oc delete pvc --all -n "$NAMESPACE"If resources are stuck after cleanup:
# Check for finalizers blocking deletion
oc get nemocustomizer -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.finalizers}'
# Force delete by removing finalizers (use with caution)
oc patch nemocustomizer <name> -n "$NAMESPACE" -p '{"metadata":{"finalizers":[]}}' --type=merge
# Check for stuck pods
oc get pods -n "$NAMESPACE"
# Force delete stuck pods (if needed)
oc delete pod <pod-name> -n "$NAMESPACE" --force --grace-period=0Some resources are retained by default due to resource policies:
- CustomResourceDefinitions (CRDs): Cluster-scoped, typically retained
- PVCs: Retained to preserve data (delete manually if needed)
- ClusterRole/ClusterRoleBinding: May be retained if used by other namespaces