Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 4 additions & 47 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,16 @@
In order to test configurations described in `configs`, the primary workflow file used is `.github/workflows/e2e-tests.yml`. As input, this workflow takes in the CLI arguments for the `utils/matrix_logic/generate_sweep_configs.py` script. The usage for this script is shown below:

```
usage: generate_sweep_configs.py [-h] {full-sweep,runner-model-sweep,test-config} ...
usage: generate_sweep_configs.py [-h] {full-sweep,test-config} ...

Generate benchmark configurations from YAML config files

positional arguments:
{full-sweep,runner-model-sweep,test-config}
{full-sweep,test-config}
Available commands
full-sweep Generate full sweep configurations with optional
filtering by model, precision, framework, runner type,
and sequence lengths
runner-model-sweep Given a runner type, find all configurations matching
the type, and run that configuration on all individual
runner nodes for the specified runner type. This is
meant to validate that all runner nodes work on all
configurations for a runner type. For instance, to
validate that all configs that specify an h200 runner
successfully run across all h200 runner nodes.
test-config Generate full sweep for specific config keys.
Supports wildcard patterns (* and ?) for matching
multiple keys at once.
Expand Down Expand Up @@ -92,46 +85,10 @@ full-sweep --single-node --max-conc 64 --max-tp 4 --config-files configs/nvidia-
full-sweep --multi-node --config-files configs/nvidia-master.yaml
```

## `runner-model-sweep` Command

The `runner-model-sweep` command validates that all runner nodes of a specific type work with all model configurations. You can specify `--single-node`, `--multi-node`, or both. If neither is specified, both types are generated.

```
usage: generate_sweep_configs.py runner-model-sweep
--config-files CONFIG_FILES [CONFIG_FILES ...]
[--runner-config RUNNER_CONFIG]
[--no-evals | --evals-only] [--all-evals]
--runner-type RUNNER_TYPE
[--runner-node-filter RUNNER_NODE_FILTER]
[--single-node] [--multi-node]
**Test agentic configurations:**
```

### Scenario: Validating Runner Infrastructure

I just upgraded the CUDA drivers on all H200 runners and need to verify that all models that use H200 still work correctly across all H200 nodes.

Go to the GitHub Actions UI, click on the `End-to-End Tests` workflow, and enter the following command as the text input:
full-sweep --scenario-type agentic-coding --config-files configs/nvidia-master.yaml configs/amd-master.yaml
```
runner-model-sweep --single-node --runner-type h200 --config-files configs/amd-master.yaml configs/nvidia-master.yaml
```

This will run a test (just the highest available parallelism and lowest available concurrency) for each configuration that specifies the `h200` runner type, across all H200 runner nodes defined in `configs/runners.yaml`.

For example, if you have configs `dsr1-fp8-h200-sglang`, `dsr1-fp8-h200-trt`, and `gptoss-fp4-h200-vllm` that all use `runner: h200`, and you have 8 H200 nodes (`h200-cw_0`, `h200-cw_1`, etc.), this will run all 3 configs on all 8 nodes (24 total test runs).

This is particularly useful when:
- You've made infrastructure changes to a specific runner type (driver updates, system configuration, Docker setup)
- You've added new runner nodes and want to validate they work with all existing model configurations
- You want to verify that all models remain compatible with a specific GPU type after system updates

### Filtering Runner Nodes

Use `--runner-node-filter` to only test a subset of runner nodes:
```
runner-model-sweep --single-node --runner-type mi300x --runner-node-filter mi300x-amd --config-files configs/amd-master.yaml
```

This will only include runner nodes whose names contain "mi300x-amd"

## `test-config` Command

Expand Down
66 changes: 38 additions & 28 deletions .github/workflows/benchmark-multinode-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,26 +91,30 @@ on:
type: string
required: false
default: ""
# Agentic-coding inputs. Fixed-seq-len jobs leave these empty.
scenario-type:
description: "Scenario type (fixed-seq-len or agentic-coding)"
type: string
required: false
default: fixed-seq-len
conc:
description: "Concurrency for agentic-coding scenarios (single value per matrix entry)"
description: "First concurrency for agentic-coding scenarios; CONC_LIST carries the full batch"
type: string
required: false
default: ""
duration:
description: "Agentic trace replay duration in seconds"
type: string
required: false
default: "1800"
offloading:
description: "KV offload backend for agentic scenarios (none/cpu/ssd)"
default: "3600"
kv-offloading:
description: "KV offload mode for agentic scenarios (none/dram)"
required: false
type: string
kv-offload-backend:
description: "KV offload backend for agentic scenarios when kv-offloading is not none"
required: false
type: string
default: 'none'
total-cpu-dram-gb:
description: "Total CPU DRAM in GB for KV offloading"
required: false
Expand Down Expand Up @@ -143,12 +147,14 @@ env:
RUN_EVAL: ${{ inputs.run-eval }}
EVAL_ONLY: ${{ inputs.eval-only }}
EVAL_CONC: ${{ inputs.eval-conc }}
# Agentic-coding env. Fixed-seq-len jobs leave these empty.
SCENARIO_TYPE: ${{ inputs.scenario-type }}
SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || 'fixed_seq_len/' }}
IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }}
CONC: ${{ inputs.conc }}
DURATION: ${{ inputs.duration }}
OFFLOADING: ${{ inputs.offloading }}
KV_OFFLOADING: ${{ inputs.kv-offloading }}
KV_OFFLOAD_BACKEND: ${{ inputs.kv-offload-backend }}
TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
PYTHONDONTWRITEBYTECODE: '1'
PYTHONPYCACHEPREFIX: /tmp/inferencex-pycache
Expand Down Expand Up @@ -181,10 +187,10 @@ jobs:
- name: Slurm cleanup (pre-run)
run: &slurm-cleanup |
if command -v squeue >/dev/null 2>&1; then
echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
echo "[Slurm] Cleaning up jobs named: ${{ runner.name }} ..."
scancel --name="${{ runner.name }}" || true
while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do
squeue --name="${{ runner.name }}"
while [ -n "$(squeue --user="$USER" --name='${{ runner.name }}' --noheader --format='%i')" ]; do
squeue --user="$USER" --name="${{ runner.name }}"
sleep 5
done
fi
Expand All @@ -197,13 +203,6 @@ jobs:
clean: true
submodules: true

- name: Cleanup stale eval outputs (pre-run)
if: ${{ inputs.run-eval || inputs.eval-only }}
run: |
rm -f meta_env.json || true
rm -f results*.json || true
rm -f sample*.jsonl || true

- name: Launch multi-node job script
env:
RUNNER_NAME: ${{ runner.name }}
Expand All @@ -213,7 +212,7 @@ jobs:
run: |
set -x
# Export RESULT_FILENAME early so it's available for artifact uploads even if cancelled
echo "RESULT_FILENAME=${RESULT_FILENAME}" >> $GITHUB_ENV
echo "RESULT_FILENAME=${RESULT_FILENAME}" >> "$GITHUB_ENV"

export ${{ join(fromJson(inputs.prefill-additional-settings), ' ') }} ${{ join(fromJson(inputs.decode-additional-settings), ' ') }}
export IS_MULTINODE=true
Expand All @@ -226,12 +225,25 @@ jobs:
exit 1
fi
elif [ "${{ inputs.scenario-type }}" = "agentic-coding" ]; then
if [ -f "${RESULT_FILENAME}.json" ]; then
echo "Found agentic result file: ${RESULT_FILENAME}.json"
else
echo "Run failed: Agentic benchmark result ${RESULT_FILENAME}.json not found." >&2
expected_count=$(wc -w <<< "$CONC_LIST" | tr -d ' ')
shopt -s nullglob
agentic_results=("${RESULT_FILENAME}"_conc*.json)
shopt -u nullglob
if [ "${#agentic_results[@]}" -ne "$expected_count" ]; then
echo "Run failed: expected $expected_count agentic results, found ${#agentic_results[@]}." >&2
exit 1
fi
# Existence is not enough: the agentic aggregation step writes the
# aggregate even when aiperf recorded zero valid requests. Require
# successful requests in every concurrency result.
for result_file in "${agentic_results[@]}"; do
echo "Found agentic result file: $result_file"
ok=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(int(bool(d.get('num_requests_successful'))))" "$result_file" 2>/dev/null || echo 0)
if [ "$ok" != "1" ]; then
echo "Run failed: $result_file has zero successful requests." >&2
exit 1
fi
done
else
# Check if at least one result file was created
if ls ${RESULT_FILENAME}_*.json 1> /dev/null 2>&1; then
Expand Down Expand Up @@ -281,6 +293,7 @@ jobs:
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: multinode_server_logs_${{ env.RESULT_FILENAME }}
# multinode launchers package server logs into this tarball.
path: multinode_server_logs.tar.gz
if-no-files-found: ignore

Expand All @@ -289,20 +302,17 @@ jobs:
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: bmk_agentic_${{ env.RESULT_FILENAME }}
path: ${{ env.RESULT_FILENAME }}.json
path: ${{ env.RESULT_FILENAME }}_conc*.json

- name: Upload agentic raw results
if: ${{ always() && inputs.scenario-type == 'agentic-coding' }}
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: agentic_${{ env.RESULT_FILENAME }}
path: |
LOGS/agentic/benchmark.log
LOGS/agentic/benchmark_command.txt
LOGS/agentic/workload_distribution_summary.txt
LOGS/agentic/workload_distribution_plots.png
LOGS/agentic/aiperf_artifacts/detailed_results.csv
LOGS/agentic/aiperf_artifacts/debug_trace.jsonl
LOGS/agentic/**
!LOGS/agentic/**/aiperf_artifacts/inputs.json
!LOGS/agentic/**/aiperf_artifacts/profile_export_raw.jsonl
if-no-files-found: ignore

- name: Upload eval results (if any)
Expand Down
82 changes: 27 additions & 55 deletions .github/workflows/benchmark-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,26 +67,30 @@ on:
description: "Git ref (branch/sha) to checkout"
required: false
type: string
# Agentic-coding inputs. Fixed-seq-len jobs leave these empty.
scenario-type:
description: "Scenario type (fixed-seq-len or agentic-coding)"
required: false
type: string
default: 'fixed-seq-len'
offloading:
description: "KV offload backend for agentic scenarios (none/cpu/ssd/lmcache/lmcache-mp/hicache)"
kv-offloading:
description: "KV offload mode for agentic scenarios (none/dram)"
required: false
type: string
kv-offload-backend:
description: "KV offload backend for agentic scenarios when kv-offloading is not none"
required: false
type: string
default: 'none'
total-cpu-dram-gb:
description: "Total CPU DRAM in GB for KV offloading"
description: "Configured CPU DRAM capacity in GB for KV offloading"
required: false
type: string
default: '600'
default: '0'
duration:
description: "Benchmark duration in seconds"
required: false
type: string
default: '1800'
default: '3600'
env:
RANDOM_RANGE_RATIO: 0.8
HF_TOKEN: ${{ secrets.INFERENCEX_OFFICIAL_RO_HF_TOKEN }}
Expand All @@ -108,12 +112,15 @@ env:
DISAGG: ${{ inputs.disagg }}
RUN_EVAL: ${{ inputs.run-eval }}
EVAL_ONLY: ${{ inputs.eval-only }}
# Agentic-coding env. Fixed-seq-len jobs leave these empty.
SCENARIO_TYPE: ${{ inputs.scenario-type }}
SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || 'fixed_seq_len/' }}
IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }}
OFFLOADING: ${{ inputs.offloading }}
KV_OFFLOADING: ${{ inputs.kv-offloading }}
KV_OFFLOAD_BACKEND: ${{ inputs.kv-offload-backend }}
TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
DURATION: ${{ inputs.duration }}
AIPERF_FAILED_REQUEST_THRESHOLD: '0.10'
RESULT_DIR: /workspace/results
PYTHONDONTWRITEBYTECODE: '1'
PYTHONPYCACHEPREFIX: /tmp/inferencex-pycache
Expand Down Expand Up @@ -154,12 +161,6 @@ jobs:
done
fi

# Cleanup results/ from a prior job on this runner. Agentic jobs
# write to fixed subpaths (aiperf_artifacts/, metrics_*, etc.), so stale
# data from a previous job would otherwise be picked up as this
# job's output when replay fails early.
rm -rf "${{ github.workspace }}/results" 2>/dev/null || true

- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
token: ${{ secrets.REPO_PAT }}
Expand All @@ -168,14 +169,6 @@ jobs:
clean: true
submodules: true

- name: Cleanup stale outputs (pre-run)
run: |
rm -f meta_env.json || true
rm -f results*.json || true
rm -f sample*.jsonl || true
rm -f server.log || true
rm -f gpu_metrics.csv || true

- name: Launch job script
env:
RUNNER_NAME: ${{ runner.name }}
Expand Down Expand Up @@ -213,6 +206,12 @@ jobs:
echo "Run failed: Benchmark result $RESULT_FILENAME.json not found." >&2
exit 1
fi

if [ "${{ inputs.scenario-type }}" = "agentic-coding" ]; then
python3 -m utils.agentic.validation.validate_agentic_result \
results/aiperf_artifacts \
--failed-request-threshold "$AIPERF_FAILED_REQUEST_THRESHOLD"
fi
fi

- name: Process result
Expand Down Expand Up @@ -242,48 +241,21 @@ jobs:
with:
name: agentic_${{ env.RESULT_FILENAME }}
path: |
results/server.log
results/lmcache_server.log
results/benchmark.log
results/config.yaml
results/lmcache_command.txt
results/sglang_command.txt
results/vllm_command.txt
results/benchmark_command.txt
results/workload_distribution_summary.txt
results/workload_distribution_plots.png
results/metrics_plots.png
results/aiperf_artifacts/profile_export.jsonl
results/aiperf_artifacts/profile_export_aiperf.json
results/aiperf_artifacts/profile_export_aiperf.csv
results/aiperf_artifacts/profile_export_aiperf_timeslices.json
results/aiperf_artifacts/profile_export_aiperf_timeslices.csv
results/aiperf_artifacts/profile_export_aiperf_aggregate.json
results/aiperf_artifacts/profile_export_aiperf_aggregate.csv
results/aiperf_artifacts/profile_export_aiperf_collated.json
results/aiperf_artifacts/server_metrics_export.json
results/aiperf_artifacts/server_metrics_export.jsonl
results/aiperf_artifacts/server_metrics_export.csv
results/aiperf_artifacts/server_metrics_export.parquet
results/aiperf_artifacts/gpu_telemetry_export.jsonl
results/aiperf_artifacts/logs/aiperf.log
results/aiperf_artifacts/logs/*.log
# Excluded by design (multi-GB debug artifacts, not consumed by
# post-processing): results/aiperf_artifacts/inputs.json (pre-formatted
# request bodies — the mmap'd binary equivalent is rebuilt from
# --public-dataset + --random-seed) and
# results/aiperf_artifacts/profile_export_raw.jsonl (full HTTP bodies
# per request — recoverable by re-running the same trace).
results/**
!results/aiperf_artifacts/inputs.json
!results/aiperf_artifacts/profile_export_raw.jsonl
if-no-files-found: ignore

- name: Upload server logs
if: always()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: ${{ inputs.eval-only && 'eval_server_logs_' || 'server_logs_' }}${{ env.RESULT_FILENAME }}
# fixed-seq writes server.log at root; agentic writes logs under results/.
path: |
${{ inputs.scenario-type == 'agentic-coding' && 'results/server.log' || 'server.log' }}
${{ inputs.scenario-type == 'agentic-coding' && 'results/lmcache_server.log' || '' }}
server.log
results/*.log
results/*_config.json
if-no-files-found: ignore

- name: Upload GPU metrics
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/claude.yml
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ jobs:
)
```
The `generate-cli-command` input accepts arguments for `generate_sweep_configs.py`. Usage: `generate_sweep_configs.py` `[-h]` `{full-sweep,runner-model-sweep,test-config}`
The `generate-cli-command` input accepts arguments for `generate_sweep_configs.py`. Usage: `generate_sweep_configs.py` `[-h]` `{full-sweep,test-config}`
**Subcommand reference:**
- `full-sweep`: Use this subcommand with filter flags like `--model-prefix`, `--framework`, `--precision`, `--runner-type`, `--min-conc`, `--max-conc`, `--seq-lens`. This is the primary subcommand for running benchmarks.
Expand Down
Loading