Benchmark mode enables framework-level performance benchmarking for LLM inference engines (vLLM, SGLang) with integrated trace analysis capabilities.
Execution: Benchmarks use run_mode: docker (default), local (host / in-pod, via YAML or --run-mode local), or ray (driver submits RayJobExecutor; a GPU worker runs the same InferenceX → vLLM/SGLang flow—see Magpie + Ray). InferenceX is cloned automatically when inferencex_path is empty (see Magpie/config.yaml benchmark.inferencex_path).
┌─────────────────────────────────────────────────────────────────────┐
│ Benchmark Mode │
├─────────────────────────────────────────────────────────────────────┤
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────────┐ │
│ │BenchmarkConfig│ → │ BenchmarkMode │ → │ BenchmarkResult │ │
│ │ (YAML) │ │ │ │ (JSON + CSV) │ │
│ └───────────────┘ └───────────────┘ └────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Runtime: docker │ local │ ray │ │
│ │ ┌─────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ InferenceX │ → │ vLLM / SGLang Server + Client │ │ │
│ │ │ scripts │ │ + Torch Profiler │ │ │
│ │ └─────────────┘ └─────────────────────────────────┘ │ │
│ │ Ray: Magpie driver → RayJobExecutor → GPU worker runs the │ │
│ │ same stack (local/docker on worker; NFS for cache/ │ │
│ │ results). See docs/ray-magpie.md │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Gap Analysis │ │ TraceLens Analysis │ │
│ │ • Time window filter │ │ • Perf report (per-rank) │ │
│ │ • Category filter │ │ • Multi-rank collective report │ │
│ │ • Kernel stats CSV │ │ │ │
│ └────────────────────────┘ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
# Basic vLLM benchmark (paths are under examples/benchmarks/)
python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_dsr1.yaml
# vLLM with TraceLens analysis
python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_tracelens.yaml
# vLLM with gap analysis (kernel bottleneck report)
python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_kimi_k2.yaml
# Standalone gap analysis on existing traces
python -m Magpie benchmark gap-analysis --trace-dir results/benchmark_vllm_<timestamp>/
# SGLang benchmark
python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_sglang_dsr1.yaml
# Ad-hoc CLI without a YAML file (framework + model; optional torch profiler)
python -m Magpie benchmark vllm --model deepseek-ai/DeepSeek-R1-0528 --torch-profilerbenchmark:
framework: vllm # "vllm" or "sglang"
model: deepseek-ai/DeepSeek-R1-0528
precision: fp8 # "fp8", "fp16", "bf16"
envs:
TP: 8 # Tensor parallelism
CONC: 32 # Concurrency (num_prompts = CONC * 10)
ISL: 1024 # Input sequence length
OSL: 1024 # Output sequence length
profiler:
torch_profiler:
enabled: true # Generate torch profiling traces
timeout_seconds: 3600benchmark:
# Framework selection
framework: vllm # Required: "vllm" or "sglang"
model: <model_name> # Required: HuggingFace model name/path
precision: fp8 # Optional: "fp8" (default), "fp16", "bf16"
# Benchmark parameters
envs:
TP: 8 # Tensor parallelism (GPU count)
CONC: 32 # Request concurrency
ISL: 1024 # Input sequence length
OSL: 1024 # Output sequence length
RANDOM_RANGE_RATIO: 1 # Length randomization (0-1)
MAX_MODEL_LEN: 131072 # Max model context length
GPU_MEM_UTIL: 0.95 # GPU memory utilization (0-1)
ENABLE_PROFILE: "true" # Enable profiling in benchmark script
# Profiler configuration
profiler:
# PyTorch profiler (generates JSON traces)
torch_profiler:
enabled: true # Sets VLLM_TORCH_PROFILER_DIR
# System profiler (rocprof-compute / ncu)
system_profiler:
enabled: false
profile_args: [] # Additional profiler arguments
# TraceLens trace analysis
tracelens:
enabled: true # Enable TraceLens analysis
export_format: csv # "csv" or "excel"
perf_report_enabled: true # Single-rank performance report
multi_rank_report_enabled: true # Multi-rank collective report
gpu_arch_config: null # Optional: GPU arch config for roofline
# Gap analysis (kernel bottleneck report)
gap_analysis:
enabled: true # Enable gap analysis after benchmark
trace_start_pct: 50 # Start of analysis window (0-100)
trace_end_pct: 80 # End of analysis window (0-100)
top_k: 20 # Number of top kernels in report
min_duration_us: 0.0 # Filter out events shorter than this (us)
categories: # Event category whitelist (default: [kernel, gpu])
- kernel
- gpu
ignore_categories: # Event category blacklist (default: [gpu_user_annotation])
- gpu_user_annotation
# Auto-pick idle GPU(s) before launching (enabled by default).
# See "Automatic GPU Selection" below for details.
gpu_selection:
auto: true # Default: true. Set false to disable.
min_free_memory_gb: 8.0 # Reject GPUs with less free VRAM
count: null # Number of GPUs; null -> use envs.TP
candidates: null # Optional whitelist of physical GPU ids
# Execution settings
run_mode: docker # "docker" (default) or "local" (host / in-container)
docker_image: null # Optional: override auto-selected image
gpu_arch: null # Optional: force GPU architecture
timeout_seconds: 3600 # Benchmark timeout
# Paths
inferencex_path: /path/to/InferenceX # InferenceX installation
hf_cache_path: null # HuggingFace cache directory
# InferenceX specific
runner_type: mi300x # Hardware runner type
benchmark_script: null # Override benchmark script| Variable | Description | Default |
|---|---|---|
TP |
Tensor parallelism (number of GPUs) | 1 |
CONC |
Request concurrency | 32 |
ISL |
Input sequence length | 1024 |
OSL |
Output sequence length | 512 |
RANDOM_RANGE_RATIO |
Length randomization ratio | 0.5 |
MAX_MODEL_LEN |
Maximum model context length | - |
GPU_MEM_UTIL |
GPU memory utilization | 0.95 |
ENABLE_PROFILE |
Enable torch profiler | "false" |
EXTRA_VLLM_ARGS |
Additional arguments passed to vllm serve |
"" |
Before launching the benchmark, Magpie scans the host (rocm-smi / nvidia-smi),
picks the least-busy GPU(s) with enough free VRAM, and pins the run via
vendor-specific environment variables:
- AMD:
ROCR_VISIBLE_DEVICES=<ids>(the launcher script remapsHIP_VISIBLE_DEVICESto the post-filter logical range0..N-1). - NVIDIA:
CUDA_VISIBLE_DEVICES=<ids>+CUDA_DEVICE_ORDER=PCI_BUS_ID.
GPU ids use the same index space as rocm-smi / nvidia-smi. By default the
selector asks for envs.TP idle GPUs; override with gpu_selection.count.
Config knobs (all optional; gpu_selection block is enabled by default):
| Field | Default | Description |
|---|---|---|
auto |
true |
Set false to disable and let the framework see every GPU |
min_free_memory_gb |
8.0 |
Reject GPUs with less free VRAM than this |
count |
null |
Number of GPUs to pin; null → envs.TP |
candidates |
null |
Optional whitelist of physical GPU ids to consider |
Manual override: setting HIP_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES /
ROCR_VISIBLE_DEVICES in envs: pins to specific cards and skips auto-selection:
envs:
TP: 1
ROCR_VISIBLE_DEVICES: "3" # AMD: pin to rocm-smi GPU[3]
# CUDA_VISIBLE_DEVICES: "3" # NVIDIA alternative
# CUDA_DEVICE_ORDER: PCI_BUS_IDRay mode (run_mode: ray): gpu_selection is ignored — Ray schedules
devices itself via num_gpus. To restrict the cluster to specific cards,
export ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in the shell before
starting ray start.
When torch_profiler.enabled: true:
- Sets
VLLM_TORCH_PROFILER_DIRautomatically - Generates JSON trace files for each GPU rank
- Traces saved to:
results/benchmark_<framework>_<timestamp>/torch_trace/
TraceLens provides automated analysis of torch profiler traces:
| Command | Description | Output |
|---|---|---|
TraceLens_generate_perf_report_pytorch |
Single-rank performance report | tracelens_rank0_csvs/ |
TraceLens_generate_multi_rank_collective_report_pytorch |
Multi-rank collective analysis | tracelens_collective_csvs/ |
Single-rank report (tracelens_rank0_csvs/):
gpu_timeline.csv- GPU kernel timelineops_summary.csv- Operation summaryops_summary_by_category.csv- Operations by categorycoll_analysis.csv- Collective communication analysiskernel_summary.csv- Kernel summary statistics
Multi-rank collective report (tracelens_collective_csvs/):
- Aggregated statistics across all GPU ranks
- Communication pattern analysis
- Load balancing metrics
Gap analysis identifies GPU kernel bottlenecks from torch profiler traces. It applies a configurable time window to focus on the steady-state portion of the trace, then aggregates kernel durations by category.
Pipeline:
- Apply time window (
trace_start_pct–trace_end_pct) to isolate steady-state events - Filter by category (case-insensitive substring matching on the event
catfield) - Aggregate stats per kernel name, rank by total duration
CSV output columns: Name, Calls, Self CUDA total (us), Avg time (us), % Total
Defaults (no YAML needed):
categories:["kernel", "gpu"]ignore_categories:["gpu_user_annotation"]
Minimal config:
gap_analysis:
enabled: true
trace_start_pct: 50
trace_end_pct: 80Run gap analysis on existing trace directories without re-running the benchmark:
# Basic usage (CLI defaults: --start-pct 0 --end-pct 100 unless you override)
python -m Magpie benchmark gap-analysis \
--trace-dir results/benchmark_vllm_<timestamp>/
# With custom window and categories (align with YAML gap_analysis window if desired)
python -m Magpie benchmark gap-analysis \
--trace-dir results/benchmark_vllm_<timestamp>/torch_trace \
--start-pct 50 --end-pct 80 \
--top-k 15 \
--categories kernel gpu \
--ignore-categories gpu_user_annotationThe --trace-dir argument accepts either a benchmark workspace directory (auto-detects torch_trace/ inside) or a direct path to the trace directory.
Output is written to a gap_analysis/ subfolder under the trace directory's parent.
results/benchmark_vllm_<timestamp>/
├── benchmark_report.json # Main benchmark results
├── summary.txt # Human-readable summary
├── config.yaml # Snapshot of benchmark configuration
├── container_stdout.log # Container stdout
├── container_stderr.log # Container stderr
├── inferencex_result.json # Raw InferenceX output
├── torch_trace/ # Raw torch profiler traces
│ ├── *-rank-0.*.pt.trace.json.gz
│ ├── *-rank-1.*.pt.trace.json.gz
│ └── ...
├── gap_analysis/ # Gap analysis output (if enabled)
│ ├── gap_analysis.csv # Merged kernel stats across all ranks
│ ├── gap_analysis_rank0.csv # Per-rank kernel stats
│ ├── gap_analysis_rank1.csv
│ └── ...
├── tracelens_rank0_csvs/ # Single-rank TraceLens analysis
│ ├── gpu_timeline.csv
│ ├── ops_summary.csv
│ └── ...
└── tracelens_collective_csvs/ # Multi-rank TraceLens analysis
└── ...
The primary summary file is benchmark_report.json in the run workspace (see WorkspaceManager.save_report). It aggregates throughput, latency, and optional gap_analysis / tracelens_analysis sections. A typical shape:
{
"success": true,
"framework": "vllm",
"model": "amd/Kimi-K2-Thinking-MXFP4",
"throughput": {
"request_throughput": 0.16,
"output_throughput": 1.13,
"total_token_throughput": 1192.76,
"completed_requests": 40
},
"latency": {
"ttft": { "mean_ms": 1185.44, "p99_ms": 1969.59 },
"tpot": { "mean_ms": 131.09, "p99_ms": 282.21 }
},
"gap_analysis": {
"config": { "trace_start_pct": 50, "trace_end_pct": 80, "categories": ["kernel", "gpu"] },
"csv_path": "results/.../gap_analysis/gap_analysis.csv",
"top_kernels": [
{ "name": "rcclGenericKernel<...>", "calls": 19620, "self_cuda_total_us": 28999961.95, "pct_total": 44.0 },
{ "name": "kernel_moe_mxgemm_2lds<...>", "calls": 9360, "self_cuda_total_us": 12495324.68, "pct_total": 18.9 }
]
},
"tracelens_analysis": { "output_files": [...] }
}Minimal configuration for fast trace collection:
benchmark:
framework: vllm
model: deepseek-ai/DeepSeek-R1-0528
precision: fp8
envs:
TP: 8
CONC: 4 # Small concurrency for quick run
ISL: 128
OSL: 64
GPU_MEM_UTIL: 0.85
profiler:
torch_profiler:
enabled: true
tracelens:
enabled: true
export_format: csv
multi_rank_report_enabled: false # Skip multi-rank for speed
timeout_seconds: 1200benchmark:
framework: vllm
model: deepseek-ai/DeepSeek-R1-0528
precision: fp8
envs:
TP: 8
CONC: 64
ISL: 2048
OSL: 2048
MAX_MODEL_LEN: 131072
profiler:
torch_profiler:
enabled: true
tracelens:
enabled: true
export_format: csv
perf_report_enabled: true
multi_rank_report_enabled: true
timeout_seconds: 7200benchmark:
framework: sglang
model: meta-llama/Llama-3.1-70B-Instruct
precision: fp16
envs:
TP: 4
CONC: 32
ISL: 1024
OSL: 512
profiler:
torch_profiler:
enabled: true
timeout_seconds: 36001. GPU Memory Error
ValueError: Free memory on device (...) is less than desired GPU memory utilization
Solution: Reduce GPU_MEM_UTIL in config (e.g., 0.85)
2. Docker Permission Error
docker: permission denied
Solution: Add user to docker group or run with sudo
3. TraceLens Not Found
TraceLens CLI command 'TraceLens_generate_perf_report_pytorch' not found
Solution: TraceLens will be auto-installed. If issues persist:
pip install git+https://github.com/AMD-AIG-AIMA/TraceLens.git4. Timeout During Model Loading
Large models (e.g., DeepSeek-R1) may need longer timeouts:
timeout_seconds: 7200 # 2 hours5. gpu_selection.auto failed: ...
Not enough idle GPUs on the host. Either free a GPU, lower
gpu_selection.min_free_memory_gb, narrow gpu_selection.candidates, or pin
manually via envs.ROCR_VISIBLE_DEVICES (AMD) / envs.CUDA_VISIBLE_DEVICES
(NVIDIA). See Automatic GPU Selection.
Enable verbose logging:
python -m Magpie benchmark --benchmark-config config.yaml --log-level DEBUG| Component | File | Description |
|---|---|---|
BenchmarkMode |
benchmarker.py |
Main orchestrator |
BenchmarkConfig |
config.py |
Configuration dataclasses |
TraceLensAnalyzer |
tracelens.py |
TraceLens CLI integration |
GapAnalyzer |
gap_analysis.py |
Kernel bottleneck analysis |
BenchmarkResult |
result.py |
Result data structures |
- Configuration Loading: Parse YAML config into
BenchmarkConfig - Runtime Setup: For
run_mode: docker, prepare a container with InferenceX; forlocal, use the host environment - Server Launch: Start vLLM/SGLang server (in container or on host per
run_mode) - Client Execution: Run benchmark client with profiling enabled
- Trace Collection: Torch profiler traces saved to workspace
- TraceLens Analysis: Run TraceLens CLI commands on host (if enabled)
- Gap Analysis: Analyze kernel bottlenecks within time window (if enabled)
- Result Generation: Aggregate metrics and generate reports
- Analyze vs Compare — kernel evaluation modes (orthogonal to Benchmark)
- TraceLens — Trace analysis library
- InferenceX — Benchmark scripts (auto-clone target in default config)
- vLLM — LLM inference engine
- SGLang — LLM serving framework
- Ray + Magpie — optional remote benchmark scheduling