SemiAnalysisAI · cquil11 · Jul 3, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -3,23 +3,16 @@
 In order to test configurations described in `configs`, the primary workflow file used is `.github/workflows/e2e-tests.yml`. As input, this workflow takes in the CLI arguments for the `utils/matrix_logic/generate_sweep_configs.py` script. The usage for this script is shown below:
 
 ```
-usage: generate_sweep_configs.py [-h] {full-sweep,runner-model-sweep,test-config} ...
+usage: generate_sweep_configs.py [-h] {full-sweep,test-config} ...
 
 Generate benchmark configurations from YAML config files
 
 positional arguments:
-  {full-sweep,runner-model-sweep,test-config}
+  {full-sweep,test-config}
                         Available commands
     full-sweep          Generate full sweep configurations with optional
                         filtering by model, precision, framework, runner type,
                         and sequence lengths
-    runner-model-sweep  Given a runner type, find all configurations matching
-                        the type, and run that configuration on all individual
-                        runner nodes for the specified runner type. This is
-                        meant to validate that all runner nodes work on all
-                        configurations for a runner type. For instance, to
-                        validate that all configs that specify an h200 runner
-                        successfully run across all h200 runner nodes.
     test-config         Generate full sweep for specific config keys.
                         Supports wildcard patterns (* and ?) for matching
                         multiple keys at once.
@@ -92,46 +85,10 @@ full-sweep --single-node --max-conc 64 --max-tp 4 --config-files configs/nvidia-
 full-sweep --multi-node --config-files configs/nvidia-master.yaml
 ```
 
-## `runner-model-sweep` Command
-
-The `runner-model-sweep` command validates that all runner nodes of a specific type work with all model configurations. You can specify `--single-node`, `--multi-node`, or both. If neither is specified, both types are generated.
-
-```
-usage: generate_sweep_configs.py runner-model-sweep
-    --config-files CONFIG_FILES [CONFIG_FILES ...]
-    [--runner-config RUNNER_CONFIG]
-    [--no-evals | --evals-only] [--all-evals]
-    --runner-type RUNNER_TYPE
-    [--runner-node-filter RUNNER_NODE_FILTER]
-    [--single-node] [--multi-node]
+**Test agentic configurations:**
 ```
-
-### Scenario: Validating Runner Infrastructure
-
-I just upgraded the CUDA drivers on all H200 runners and need to verify that all models that use H200 still work correctly across all H200 nodes.
-
-Go to the GitHub Actions UI, click on the `End-to-End Tests` workflow, and enter the following command as the text input:
+full-sweep --scenario-type agentic-coding --config-files configs/nvidia-master.yaml configs/amd-master.yaml
 ```
-runner-model-sweep --single-node --runner-type h200 --config-files configs/amd-master.yaml configs/nvidia-master.yaml
-```
-
-This will run a test (just the highest available parallelism and lowest available concurrency) for each configuration that specifies the `h200` runner type, across all H200 runner nodes defined in `configs/runners.yaml`.
-
-For example, if you have configs `dsr1-fp8-h200-sglang`, `dsr1-fp8-h200-trt`, and `gptoss-fp4-h200-vllm` that all use `runner: h200`, and you have 8 H200 nodes (`h200-cw_0`, `h200-cw_1`, etc.), this will run all 3 configs on all 8 nodes (24 total test runs).
-
-This is particularly useful when:
-- You've made infrastructure changes to a specific runner type (driver updates, system configuration, Docker setup)
-- You've added new runner nodes and want to validate they work with all existing model configurations
-- You want to verify that all models remain compatible with a specific GPU type after system updates
-
-### Filtering Runner Nodes
-
-Use `--runner-node-filter` to only test a subset of runner nodes:
-```
-runner-model-sweep --single-node --runner-type mi300x --runner-node-filter mi300x-amd --config-files configs/amd-master.yaml
-```
-
-This will only include runner nodes whose names contain "mi300x-amd"
 
 ## `test-config` Command
 

diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml
@@ -91,26 +91,30 @@ on:
         type: string
         required: false
         default: ""
+      # Agentic-coding inputs. Fixed-seq-len jobs leave these empty.
       scenario-type:
         description: "Scenario type (fixed-seq-len or agentic-coding)"
         type: string
         required: false
         default: fixed-seq-len
       conc:
-        description: "Concurrency for agentic-coding scenarios (single value per matrix entry)"
+        description: "First concurrency for agentic-coding scenarios; CONC_LIST carries the full batch"
         type: string
         required: false
         default: ""
       duration:
         description: "Agentic trace replay duration in seconds"
         type: string
         required: false
-        default: "1800"
-      offloading:
-        description: "KV offload backend for agentic scenarios (none/cpu/ssd)"
+        default: "3600"
+      kv-offloading:
+        description: "KV offload mode for agentic scenarios (none/dram)"
+        required: false
+        type: string
+      kv-offload-backend:
+        description: "KV offload backend for agentic scenarios when kv-offloading is not none"
         required: false
         type: string
-        default: 'none'
       total-cpu-dram-gb:
         description: "Total CPU DRAM in GB for KV offloading"
         required: false
@@ -143,12 +147,14 @@ env:
   RUN_EVAL: ${{ inputs.run-eval }}
   EVAL_ONLY: ${{ inputs.eval-only }}
   EVAL_CONC: ${{ inputs.eval-conc }}
+  # Agentic-coding env. Fixed-seq-len jobs leave these empty.
   SCENARIO_TYPE: ${{ inputs.scenario-type }}
   SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || 'fixed_seq_len/' }}
   IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }}
   CONC: ${{ inputs.conc }}
   DURATION: ${{ inputs.duration }}
-  OFFLOADING: ${{ inputs.offloading }}
+  KV_OFFLOADING: ${{ inputs.kv-offloading }}
+  KV_OFFLOAD_BACKEND: ${{ inputs.kv-offload-backend }}
   TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
   PYTHONDONTWRITEBYTECODE: '1'
   PYTHONPYCACHEPREFIX: /tmp/inferencex-pycache
@@ -181,10 +187,10 @@ jobs:
       - name: Slurm cleanup (pre-run)
         run: &slurm-cleanup |
           if command -v squeue >/dev/null 2>&1; then
-            echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
+            echo "[Slurm] Cleaning up jobs named: ${{ runner.name }} ..."
             scancel --name="${{ runner.name }}" || true
-            while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do
-              squeue --name="${{ runner.name }}"
+            while [ -n "$(squeue --user="$USER" --name='${{ runner.name }}' --noheader --format='%i')" ]; do
+              squeue --user="$USER" --name="${{ runner.name }}"
               sleep 5
             done
           fi
@@ -197,13 +203,6 @@ jobs:
           clean: true
           submodules: true
 
-      - name: Cleanup stale eval outputs (pre-run)
-        if: ${{ inputs.run-eval || inputs.eval-only }}
-        run: |
-          rm -f meta_env.json || true
-          rm -f results*.json || true
-          rm -f sample*.jsonl || true
-
       - name: Launch multi-node job script
         env:
           RUNNER_NAME: ${{ runner.name }}
@@ -213,7 +212,7 @@ jobs:
         run: |
           set -x
           # Export RESULT_FILENAME early so it's available for artifact uploads even if cancelled
-          echo "RESULT_FILENAME=${RESULT_FILENAME}" >> $GITHUB_ENV
+          echo "RESULT_FILENAME=${RESULT_FILENAME}" >> "$GITHUB_ENV"
 
           export ${{ join(fromJson(inputs.prefill-additional-settings), ' ') }} ${{ join(fromJson(inputs.decode-additional-settings), ' ') }}
           export IS_MULTINODE=true
@@ -226,12 +225,25 @@ jobs:
               exit 1
             fi
           elif [ "${{ inputs.scenario-type }}" = "agentic-coding" ]; then
-            if [ -f "${RESULT_FILENAME}.json" ]; then
-              echo "Found agentic result file: ${RESULT_FILENAME}.json"
-            else
-              echo "Run failed: Agentic benchmark result ${RESULT_FILENAME}.json not found." >&2
+            expected_count=$(wc -w <<< "$CONC_LIST" | tr -d ' ')
+            shopt -s nullglob
+            agentic_results=("${RESULT_FILENAME}"_conc*.json)
+            shopt -u nullglob
+            if [ "${#agentic_results[@]}" -ne "$expected_count" ]; then
+              echo "Run failed: expected $expected_count agentic results, found ${#agentic_results[@]}." >&2
               exit 1
             fi
+            # Existence is not enough: the agentic aggregation step writes the
+            # aggregate even when aiperf recorded zero valid requests. Require
+            # successful requests in every concurrency result.
+            for result_file in "${agentic_results[@]}"; do
+              echo "Found agentic result file: $result_file"
+              ok=$(python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(int(bool(d.get('num_requests_successful'))))" "$result_file" 2>/dev/null || echo 0)
+              if [ "$ok" != "1" ]; then
+                echo "Run failed: $result_file has zero successful requests." >&2
+                exit 1
+              fi
+            done
           else
             # Check if at least one result file was created
             if ls ${RESULT_FILENAME}_*.json 1> /dev/null 2>&1; then
@@ -281,6 +293,7 @@ jobs:
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
         with:
           name: multinode_server_logs_${{ env.RESULT_FILENAME }}
+          # multinode launchers package server logs into this tarball.
           path: multinode_server_logs.tar.gz
           if-no-files-found: ignore
 
@@ -289,20 +302,17 @@ jobs:
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
         with:
           name: bmk_agentic_${{ env.RESULT_FILENAME }}
-          path: ${{ env.RESULT_FILENAME }}.json
+          path: ${{ env.RESULT_FILENAME }}_conc*.json
 
       - name: Upload agentic raw results
         if: ${{ always() && inputs.scenario-type == 'agentic-coding' }}
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
         with:
           name: agentic_${{ env.RESULT_FILENAME }}
           path: |
-            LOGS/agentic/benchmark.log
-            LOGS/agentic/benchmark_command.txt
-            LOGS/agentic/workload_distribution_summary.txt
-            LOGS/agentic/workload_distribution_plots.png
-            LOGS/agentic/aiperf_artifacts/detailed_results.csv
-            LOGS/agentic/aiperf_artifacts/debug_trace.jsonl
+            LOGS/agentic/**
+            !LOGS/agentic/**/aiperf_artifacts/inputs.json
+            !LOGS/agentic/**/aiperf_artifacts/profile_export_raw.jsonl
           if-no-files-found: ignore
 
       - name: Upload eval results (if any)

diff --git a/.github/workflows/benchmark-tmpl.yml b/.github/workflows/benchmark-tmpl.yml
@@ -67,26 +67,30 @@ on:
         description: "Git ref (branch/sha) to checkout"
         required: false
         type: string
+      # Agentic-coding inputs. Fixed-seq-len jobs leave these empty.
       scenario-type:
         description: "Scenario type (fixed-seq-len or agentic-coding)"
         required: false
         type: string
         default: 'fixed-seq-len'
-      offloading:
-        description: "KV offload backend for agentic scenarios (none/cpu/ssd/lmcache/lmcache-mp/hicache)"
+      kv-offloading:
+        description: "KV offload mode for agentic scenarios (none/dram)"
+        required: false
+        type: string
+      kv-offload-backend:
+        description: "KV offload backend for agentic scenarios when kv-offloading is not none"
         required: false
         type: string
-        default: 'none'
       total-cpu-dram-gb:
-        description: "Total CPU DRAM in GB for KV offloading"
+        description: "Configured CPU DRAM capacity in GB for KV offloading"
         required: false
         type: string
-        default: '600'
+        default: '0'
       duration:
         description: "Benchmark duration in seconds"
         required: false
         type: string
-        default: '1800'
+        default: '3600'
 env:
   RANDOM_RANGE_RATIO: 0.8
   HF_TOKEN: ${{ secrets.INFERENCEX_OFFICIAL_RO_HF_TOKEN }}
@@ -108,12 +112,15 @@ env:
   DISAGG: ${{ inputs.disagg }}
   RUN_EVAL: ${{ inputs.run-eval }}
   EVAL_ONLY: ${{ inputs.eval-only }}
+  # Agentic-coding env. Fixed-seq-len jobs leave these empty.
   SCENARIO_TYPE: ${{ inputs.scenario-type }}
   SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || 'fixed_seq_len/' }}
   IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }}
-  OFFLOADING: ${{ inputs.offloading }}
+  KV_OFFLOADING: ${{ inputs.kv-offloading }}
+  KV_OFFLOAD_BACKEND: ${{ inputs.kv-offload-backend }}
   TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
   DURATION: ${{ inputs.duration }}
+  AIPERF_FAILED_REQUEST_THRESHOLD: '0.10'
   RESULT_DIR: /workspace/results
   PYTHONDONTWRITEBYTECODE: '1'
   PYTHONPYCACHEPREFIX: /tmp/inferencex-pycache
@@ -154,12 +161,6 @@ jobs:
             done
           fi
 
-          # Cleanup results/ from a prior job on this runner. Agentic jobs
-          # write to fixed subpaths (aiperf_artifacts/, metrics_*, etc.), so stale
-          # data from a previous job would otherwise be picked up as this
-          # job's output when replay fails early.
-          rm -rf "${{ github.workspace }}/results" 2>/dev/null || true
-
       - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
         with:
           token: ${{ secrets.REPO_PAT }}
@@ -168,14 +169,6 @@ jobs:
           clean: true
           submodules: true
 
-      - name: Cleanup stale outputs (pre-run)
-        run: |
-          rm -f meta_env.json || true
-          rm -f results*.json || true
-          rm -f sample*.jsonl || true
-          rm -f server.log || true
-          rm -f gpu_metrics.csv || true
-
       - name: Launch job script
         env:
           RUNNER_NAME: ${{ runner.name }}
@@ -213,6 +206,12 @@ jobs:
               echo "Run failed: Benchmark result $RESULT_FILENAME.json not found." >&2
               exit 1
             fi
+
+            if [ "${{ inputs.scenario-type }}" = "agentic-coding" ]; then
+              python3 -m utils.agentic.validation.validate_agentic_result \
+                results/aiperf_artifacts \
+                --failed-request-threshold "$AIPERF_FAILED_REQUEST_THRESHOLD"
+            fi
           fi
 
       - name: Process result
@@ -242,48 +241,21 @@ jobs:
         with:
           name: agentic_${{ env.RESULT_FILENAME }}
           path: |
-            results/server.log
-            results/lmcache_server.log
-            results/benchmark.log
-            results/config.yaml
-            results/lmcache_command.txt
-            results/sglang_command.txt
-            results/vllm_command.txt
-            results/benchmark_command.txt
-            results/workload_distribution_summary.txt
-            results/workload_distribution_plots.png
-            results/metrics_plots.png
-            results/aiperf_artifacts/profile_export.jsonl
-            results/aiperf_artifacts/profile_export_aiperf.json
-            results/aiperf_artifacts/profile_export_aiperf.csv
-            results/aiperf_artifacts/profile_export_aiperf_timeslices.json
-            results/aiperf_artifacts/profile_export_aiperf_timeslices.csv
-            results/aiperf_artifacts/profile_export_aiperf_aggregate.json
-            results/aiperf_artifacts/profile_export_aiperf_aggregate.csv
-            results/aiperf_artifacts/profile_export_aiperf_collated.json
-            results/aiperf_artifacts/server_metrics_export.json
-            results/aiperf_artifacts/server_metrics_export.jsonl
-            results/aiperf_artifacts/server_metrics_export.csv
-            results/aiperf_artifacts/server_metrics_export.parquet
-            results/aiperf_artifacts/gpu_telemetry_export.jsonl
-            results/aiperf_artifacts/logs/aiperf.log
-            results/aiperf_artifacts/logs/*.log
-          # Excluded by design (multi-GB debug artifacts, not consumed by
-          # post-processing): results/aiperf_artifacts/inputs.json (pre-formatted
-          # request bodies — the mmap'd binary equivalent is rebuilt from
-          # --public-dataset + --random-seed) and
-          # results/aiperf_artifacts/profile_export_raw.jsonl (full HTTP bodies
-          # per request — recoverable by re-running the same trace).
+            results/**
+            !results/aiperf_artifacts/inputs.json
+            !results/aiperf_artifacts/profile_export_raw.jsonl
           if-no-files-found: ignore
 
       - name: Upload server logs
         if: always()
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
         with:
           name: ${{ inputs.eval-only && 'eval_server_logs_' || 'server_logs_' }}${{ env.RESULT_FILENAME }}
+          # fixed-seq writes server.log at root; agentic writes logs under results/.
           path: |
-            ${{ inputs.scenario-type == 'agentic-coding' && 'results/server.log' || 'server.log' }}
-            ${{ inputs.scenario-type == 'agentic-coding' && 'results/lmcache_server.log' || '' }}
+            server.log
+            results/*.log
+            results/*_config.json
           if-no-files-found: ignore
 
       - name: Upload GPU metrics

diff --git a/.github/workflows/claude.yml b/.github/workflows/claude.yml
@@ -95,7 +95,7 @@ jobs:
             )
             ```
 
-            The `generate-cli-command` input accepts arguments for `generate_sweep_configs.py`. Usage: `generate_sweep_configs.py` `[-h]` `{full-sweep,runner-model-sweep,test-config}`
+            The `generate-cli-command` input accepts arguments for `generate_sweep_configs.py`. Usage: `generate_sweep_configs.py` `[-h]` `{full-sweep,test-config}`
 
             **Subcommand reference:**
             - `full-sweep`: Use this subcommand with filter flags like `--model-prefix`, `--framework`, `--precision`, `--runner-type`, `--min-conc`, `--max-conc`, `--seq-lens`. This is the primary subcommand for running benchmarks.