Skip to content

llm-d/llm-d-pd-utils

PD utils

This repository includes Agentic Skills that automate common llm-d operational tasks. Skills are custom slash commands defined by SKILL.md files — invoke them in Claude Code with /<skill-name> and they guide AI agent through multi-step workflows (deploying configs, running tests, interpreting results) using the instructions and scripts in this repo.

Preflight Checks Skill

The /llm-d-preflight-checks skill patches an llm-d model server deployment to run a diagnostics script before vLLM starts. It collects environment variables, GPU topology (nvidia-smi topo -m, NVLink status), and CPU/PCI info, giving operators a window to inspect the pod environment and run network tests before the model loads.

In pause mode (LLMD_PREFLIGHT_CHECKS=pause), the script starts an HTTP server on the vLLM port that satisfies K8s health probes while blocking vLLM startup. Call /exit on the server to release the port and let vLLM proceed. See the SKILL.md for deployment instructions covering the llm-d quickstart, P/D disaggregation, and llm-d-benchmark.

The problem addressed and design rationale is in preflight-checks-design.md in docs directory.

Networking Test Skill

The /llm-d-networking-tests skill validates GPU topology and inter-pod network performance for llm-d deployments. It drives the run-tests.sh automation scripts in this repo to:

  1. Discover GPU topology — verify that GPUs within each pod are optimally connected (NVLink/NVSwitch for NVIDIA, Infinity Fabric for AMD) rather than separated by PCIe hops across NUMA nodes.
  2. Run network performance tests — measure RDMA bandwidth and latency between pods using perftest (ib_write_bw, ib_read_bw), iperf3, NCCL/RCCL collectives, and nixlbench (GPU VRAM-to-VRAM via UCX).

The skill asks for the target namespace and pod label selector, then runs the tests and helps interpret the results. See the SKILL.md for the full testing workflow and troubleshooting guide.

The network testing scripts used are in run-*.py files and their design is described in networking-tests-design.md in docs directory.

Build NIXL Image

docker build -t nixl:latest .
docker tag nixl:latest ghcr.io/<>/nixl:latest
docker push ghcr.io/<>/nixl:latest

Deploy & Test

cd benchmarks
./benchmark_deployment.sh nixl -rdma

Use -rdma for the deployment in cluster with RoCE enabled for performance. This will deploy nixl-client-roce and nixl-server-roce, and run a benchmarking script to measure the transfer throughput (GB/s). Refer to benchmarking for more details.

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors