Pavan Madduri pmady

Senior Cloud Platform Engineer building GPU/AI infrastructure at scale.
CNCF Golden Kubestronaut. Oracle ACE Associate. Dragonfly Community Member.
31+ PRs across 17 open-source projects in CNCF, ASWF, and beyond.
If GPUs need scheduling, scaling, or observability on Kubernetes — that's what I build.

⚡ What I'm Building


🎮 GPU Autoscaling	KEDA External Scaler with native NVML metrics, DaemonSet architecture, scaling profiles for vLLM, Triton, and training workloads. Referenced in KEDA #7538 and published on CNCF Blog.
🔬 GPU NUMA Topology	Volcano scheduler plugin for NUMA-aware GPU placement — topology discovery via sysfs, CRD extensions, and cross-socket affinity optimization.
📡 GPU Observability	OpenTelemetry Collector receiver for GPU metrics (NVML-native) and Docker Desktop Extension for real-time GPU monitoring dashboards.
🧠 Topology-Aware AIOps	Knowledge graph of Kubernetes resources with graph-based root-cause traversal, AlertManager webhook integration, and blast-radius analysis.
☁️ Platform Engineering	Kubernetes, ArgoCD, Crossplane, Docker, KEDA — production platforms serving enterprise workloads at scale.
📝 Technical Writing	20 published articles across CNCF Blog, IEEE ComSoc, Platform Engineering, VKTR, Cloud Native Now, and Medium.

🏆 Certifications & Recognition

Golden Kubestronaut — All five Kubernetes certifications: KCNA, CKA, CKAD, CKS, KCSA

🚀 Featured Projects

🎮 KEDA GPU Scaler

KEDA External gRPC Scaler for GPU/AI workloads

🎮 Native NVML — Direct GPU metrics via go-nvml
🚀 Scaling Profiles — vLLM, Triton, training presets
📦 DaemonSet — Per-node GPU metric collection
🔄 Scale-to-Zero — GPU-aware idle detection
📈 Prometheus — Optional /metrics endpoint

Tech: Go · gRPC · NVIDIA NVML · Kubernetes · Helm

Referenced in KEDA #7538 | CNCF Blog

📡 OpenTelemetry GPU Receiver

OpenTelemetry Collector receiver for GPU metrics

🔋 NVIDIA NVML — GPU utilization, memory, temperature
📊 OTel Native — Standard OTLP export pipeline
🖥️ Multi-GPU — All devices on the node
📈 Prometheus — Built-in Prometheus exporter

Tech: Go · OpenTelemetry Collector SDK · NVML

🐳 Docker GPU Dashboard Extension

Real-time NVIDIA GPU metrics in Docker Desktop

📊 Live Dashboard — Utilization, memory, temperature, power
📈 History Charts — 2-minute rolling Recharts graphs
🚦 Alert Thresholds — Color-coded green/yellow/red
🎭 Mock Mode — Develop without GPU hardware

Tech: Go · React · Recharts · Docker Extension SDK · NVML

🧠 Kube Topology Agent

K8s knowledge graph & automated root-cause analysis

🗺️ Knowledge Graph — Real-time resource topology
🔍 Root-Cause Traversal — Graph-based incident investigation
🎮 GPU Aware — Training/inference/batch classification
🔔 AlertManager — Webhook integration for auto-investigation

Tech: Go · Kubernetes API · Gorilla Mux · Helm

More projects: KubeAI Autoscaler · Ingress2Gateway · Golden Kubestronaut Learning · LLMOps

🌱 Open Source Contributions

31+ PRs across 17 projects in CNCF, ASWF, and open-source foundations.

CNCF (Cloud Native Computing Foundation)

Project	Description	Contributions
Dragonfly	P2P-based file distribution and image acceleration	client#1861 - Fix error chain propagation in backend stream failures, client#1665 - Add Hugging Face backend support with hf:// protocol, client#1673 - Add ModelScope backend support with modelscope:// protocol, d7y.io#386 - Add hf:// protocol documentation, d7y.io#398 - Add P2P-accelerated AI model downloads blog post, helm-charts#455 - Add injector support to helm chart, helm-charts#480 - Replace deprecated bitnamilegacy/mysql with bitnami/mysql
Kubernetes	Production-Grade Container Orchestration	#53891 - Document deployment.kubernetes.io/* annotations, #53892 - Add kubectl apply view-last-applied documentation
TiKV	Distributed transactional key-value database	#19225 - Add AGENTS.md for AI agent guidance
Volcano	Cloud-native batch scheduling for AI/HPC	#5328 - Fix typos in scheduler comments, #5095 - GPU NUMA topology awareness in scheduler, apis#229 - Add GPUInfo type to NumatopoSpec CRD, resource-exporter#12 - GPU NUMA topology discovery via sysfs
HAMi	Heterogeneous AI Computing Virtualization Middleware	#1893 - Add unit tests for nvinternal info, mig, and watch packages
KEDA	Kubernetes Event-driven Autoscaling	keda-docs#1658 - Removing metricName from the kedadocs, keda-docs#1769 - Fix datadog scaler typos across all versions, #7538 - GPU/AI inference scaler architectural analysis
Metal³	Bare metal host provisioning for Kubernetes	#624 - Fix redirect links in tryit.md
OpenTelemetry	Observability framework	#8632 - Add .NET troubleshooting page
kpt	Kubernetes-native packaging and resource management	#4278 - Fix kpt fn doc command for KRM functions expecting input
traceAI	Open-source LLM observability SDK	#165 - Fix exporter shutdown and thread safety in Python SDK, #166 - Add Go SDK with OpenAI instrumentor

ASWF (Academy Software Foundation)

Project	Description	Contributions
OpenColorIO	Color management library	#2229 - Add release signing workflow, #2230 - Add Dependabot configuration, #2243 - Add Vulkan unit test framework
OpenCue	Cloud rendering management system	#2134 - Add scheduled subscription recalculation task
OpenImageIO	Image processing library	#4976 - Fix IBA::compare_Yee() channel access
RAWtoACES	RAW to ACES image conversion	#222 - Add build developer documentation
xSTUDIO	Playback and review application	#186 - Fix broken build guide links

🧰 Tech Stack

📝 Publications

20 articles published across CNCF Blog, IEEE ComSoc, Platform Engineering, VKTR, Cloud Native Now, and Medium.

Title	Publication	Date
Stop Wasting GPU Budget: Autoscaling AI Inference on Kubernetes with KEDA	Cloud Native Now	Jun 2026
GPU Autoscaling on Kubernetes with KEDA: Building an External Scaler	CNCF Blog	May 2026
Shattering the Kubernetes Registry Bottleneck: Scaling Enterprise CI/CD with P2P Mesh Architecture	Cloud Native Now	May 2026
The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs	Cloud Native Now	May 2026
Agentic AIOps: Building the Guardrails for Autonomous Infrastructure	VKTR	May 2026
Architecting Enterprise GitOps: Scaling Argo CD on OKE	Cloud Native Now	May 2026
Deploying Docker AI Agents on OCI and OKE	Cloud Native Now	May 2026
Abstracting AI Infrastructure: Native GPU Scaling for Internal Developer Platforms	Platform Engineering	May 2026
Why Enterprise AI Fails: The 4 Infrastructure Bottlenecks Nobody Wants to Talk About	VKTR	Apr 2026
From public static void main to Golden Kubestronaut: The Art of Unlearning	CNCF Blog	Apr 2026
Peer-to-Peer Acceleration for AI Model Distribution with Dragonfly	CNCF Blog	Apr 2026
The IDP Paradox: Why Your Internal Developer Platform Needs a "Java-First" Strategy	Platform Engineering	Apr 2026
The Financial Trap of Autonomous Networks: Scaling Agentic AI in the Telecom Core	IEEE ComSoc	Mar 2026
Zero-Trust on OKE: How to Actually Secure Your Clusters With Terraform	Cloud Native Now	Mar 2026
Beyond the Green Checkmark: Using Formal Verification to Stop ArgoCD Drift	Cloud Native Now	Mar 2026
The Efficiency Era: How Kubernetes v1.35 Finally Solves the "Restart" Headache	Cloud Native Now	Mar 2026
Beyond Basic Sync: Why ArgoCD v3 is the Backbone of Modern Platform Engineering	Platform Engineering	Feb 2026
From PagerDuty to 'Agentic Ops': The Rise of Self-Healing Kubernetes	Cloud Native Now	Feb 2026
I Replaced a $3/hr GPU Dev Workflow with Docker Model Runner	Medium	May 2026
GPU-Aware Autoscaling for Docker Containers	Medium	May 2026

� Media Mentions

Quoted as a subject-matter expert across 11+ publications on enterprise AI, GPU infrastructure, cloud security, and platform engineering.

Featured & Quoted

Publication	Article	Quote / Mention
AI Business (Informa PLC)	OpenAI vs. Anthropic vs. Google: But the Model Isn't the Point	"The real dependency risk comes from the orchestration, workflow and data integration layers built around them... Relying on third-party orchestration is where real lock-ins happen."
VKTR (Simpler Media Group)	Enterprise AI Costs Climb as GPU Demand Outpaces Supply	"The architecture that works is a routing layer: simple tasks go to a lightweight SLM, complex reasoning escalates to the frontier model. You stop paying frontier prices for envelope-delivery workloads."
Techopedia (10M+ monthly visitors)	AI Experts Call for a Reality Check on Allbirds' Pivot	"GPU capacity is genuinely hard to get right now... You can't buy that institutional knowledge with a convertible note and a rebrand."
Reworked (Simpler Media Group)	AI Agents and the Process Documentation Fallacy	"If an AI agent is trained purely by observing the official workflow in the ticketing platform, it's learning a fantasy... You have to fence the AI in."
InfoSec Relations	Agentic AI is Exposing the Accountability Gap in Cloud Security Governance	"We enforce this with Policy-as-Code at the admission layer, so the agent's available responses are constrained by the infrastructure itself, not by a governance doc that someone wrote once and nobody checks."
Tech Round (UK)	Meta Acquires Moltbook: What Responsibility Do Meta And Regulators Have?	"We are building autonomous agents without implementing Zero Trust security... Regulators must urgently pivot to regulating Agentic Privileges."
TLDR Newsletter (3M+ subscribers)	Featured Mention	CNCF GPU autoscaling blog featured to 3M+ subscribers
Habr (VKTech / VK Group)	GPU Auto-Scaling on Kubernetes with KEDA	Russian-language adaptation of CNCF blog — 4,500+ views in 13 hours
Cloud Native Now (Techstrong Group)	Stop Wasting GPU Budget: Autoscaling AI Inference on Kubernetes with KEDA	Primary author — GPU autoscaling architecture and scale-to-zero for AI inference
Y Square Technology	AI Agent Documentation Reality Gap	Quoted on enterprise AI agent deployment challenges

CNCF Official Recognition

CNCF LinkedIn (500K+ followers)

"Pavan Madduri breaks down how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly — cutting metric latency from 15–30s to 2–4s."

204+ likes · 28 reposts · 3 comments

CNCF Twitter/X (@CloudNativeFdn)

"See how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly, with scaling profiles for vLLM, Triton, and training workloads."

2,122 views · 24 likes · 7 bookmarks

CNCF Bluesky (cncf.io)

"GPU autoscaling on Kubernetes with KEDA: Building an external scaler"

Featured across all 3 CNCF social platforms

CNCF LinkedIn (500K+ followers)

"From public static void main to Golden Kubestronaut: The Art of Unlearning — Pavan Madduri shares his journey through all five Kubernetes certifications."

26+ likes · 1 repost

�� GitHub Stats

Stats updated on 2026-06-11 15:25 UTC

🐍 Contribution Activity

🤝 Let's Connect

Building GPU infrastructure for Kubernetes? Working on CNCF projects? Let's collaborate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly