PD utils

This repository includes Agentic Skills that automate common llm-d operational tasks. Skills are custom slash commands defined by SKILL.md files — invoke them in Claude Code with /<skill-name> and they guide AI agent through multi-step workflows (deploying configs, running tests, interpreting results) using the instructions and scripts in this repo.

Preflight Checks Skill

The /llm-d-preflight-checks skill patches an llm-d model server deployment to run a diagnostics script before vLLM starts. It collects environment variables, GPU topology (nvidia-smi topo -m, NVLink status), and CPU/PCI info, giving operators a window to inspect the pod environment and run network tests before the model loads.

In pause mode (LLMD_PREFLIGHT_CHECKS=pause), the script starts an HTTP server on the vLLM port that satisfies K8s health probes while blocking vLLM startup. Call /exit on the server to release the port and let vLLM proceed. See the SKILL.md for deployment instructions covering the llm-d quickstart, P/D disaggregation, and llm-d-benchmark.

The problem addressed and design rationale is in preflight-checks-design.md in docs directory.

Networking Test Skill

The /llm-d-networking-tests skill validates GPU topology and inter-pod network performance for llm-d deployments. It drives the run-tests.sh automation scripts in this repo to:

Discover GPU topology — verify that GPUs within each pod are optimally connected (NVLink/NVSwitch for NVIDIA, Infinity Fabric for AMD) rather than separated by PCIe hops across NUMA nodes.
Run network performance tests — measure RDMA bandwidth and latency between pods using perftest (ib_write_bw, ib_read_bw), iperf3, NCCL/RCCL collectives, and nixlbench (GPU VRAM-to-VRAM via UCX).

The skill asks for the target namespace and pod label selector, then runs the tests and helps interpret the results. See the SKILL.md for the full testing workflow and troubleshooting guide.

The network testing scripts used are in run-*.py files and their design is described in networking-tests-design.md in docs directory.

Build NIXL Image

docker build -t nixl:latest .
docker tag nixl:latest ghcr.io/<>/nixl:latest
docker push ghcr.io/<>/nixl:latest

Deploy & Test

cd benchmarks
./benchmark_deployment.sh nixl -rdma

Use -rdma for the deployment in cluster with RoCE enabled for performance. This will deploy nixl-client-roce and nixl-server-roce, and run a benchmarking script to measure the transfer throughput (GB/s). Refer to benchmarking for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
benchmarks		benchmarks
deployment		deployment
docs		docs
hooks		hooks
installers		installers
skills		skills
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.version.json		.version.json
Dockerfile		Dockerfile
Dockerfile copy		Dockerfile copy
Makefile		Makefile
README.md		README.md
run-tests.sh		run-tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PD utils

Preflight Checks Skill

Networking Test Skill

Build NIXL Image

Deploy & Test

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PD utils

Preflight Checks Skill

Networking Test Skill

Build NIXL Image

Deploy & Test

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages