Summary
Add a local "dry run" mode that reproduces everything the e2e benchmark pipeline does to turn a dispatch into runnable work — expand the matrix, fill in the per-job inputs, and generate the config files and command scripts — but stops short of executing anything: no workflow_dispatch, no runner allocation, no server launch, no benchmark/aiperf run, nothing touching the clusters.
The point is to preview exactly what a given generate-cli-command would produce (the bucketed matrices, the per-job config.yaml, the vllm_command.txt / sglang_command.txt / benchmark_command.txt / lmcache_command.txt / mooncake_config.json) before spending a real run on the GPU fleet.
Motivation
Right now the only way to see what a dispatch expands into is to actually fire e2e-tests.yml, which allocates self-hosted Slurm runners and starts jobs on the GPU clusters. That's expensive and slow for what is often just "did I get the matrix / concurrency sweep / command flags right?"
Almost all of the "fill in inputs + create scripts" logic is our own Python and bash, not native GHA steps — which is good news: a dry run can call the same code directly and stay faithful to CI without emulating GitHub at all.
What a dry run must reproduce (grounded inventory)
Mapped from the current pipeline:
| Stage |
Where |
What happens |
| Matrix generation |
utils/matrix_logic/generate_sweep_configs.py (generate_full_sweep ~L223-575, mark_eval_entries ~L124-220) |
Reads .github/configs/{nvidia,amd}-master.yaml + runners.yaml, filters by model-prefix/precision/framework/runner/seq-lens/conc/tp/ep, expands conc-start..conc-end by step-size, chunks multi-node conc lists, marks eval entries. Emits config JSON to stdout. |
| Matrix bucketing |
.github/workflows/e2e-tests.yml get-jobs (~L70-83) |
Six inline python3 -c filters split the JSON into single-node / multi-node / eval / multi-node-eval / agentic / multi-node-agentic matrices. This logic lives only in the YAML today. |
| Per-job input fill |
benchmark-tmpl.yml / benchmark-multinode-tmpl.yml |
Each matrix entry is unpacked into ~93 job inputs and exported as env vars. |
| Runner + script launch |
runners/launch_*.sh |
Parses runner name, allocates cluster (salloc/srun), mounts image + workspaces, invokes the scenario script under benchmarks/single_node/.... |
| Command/config generation |
benchmarks/benchmark_lib.sh (build_replay_cmd ~L1210-1348, run_agentic_replay_and_write_outputs ~L1365-1400) + scenario scripts (e.g. benchmarks/single_node/agentic/*.sh) |
Builds the server launch command array and writes sglang_command.txt / vllm_command.txt; builds the full aiperf CLI ($REPLAY_CMD, 40+ flags from env) and writes benchmark_command.txt; then immediately executes the server + benchmark. |
The critical wrinkle: matrix generation is already a clean Python function (trivial to reuse), but script generation and execution are interleaved in bash — the same functions that build $SGLANG_CMD / $REPLAY_CMD also launch the server and run aiperf. So a real "generate but don't run" needs the generation split from the execution.
There is no existing dry-run/preview path in the pipeline (only unrelated --dry-run flags in utils/agentic/sample_proxy_traces.py).
Proposed shape (two layers)
Layer 1 — matrix/config preview (cheap, do first). A CLI (utils/matrix_logic/dry_run.py or make dry-run GEN_CMD="...") that takes the same generate-cli-command, calls generate_sweep_configs.py, and applies the same bucketing to print all six matrices + a summary (models, SKUs, concurrency points, job count, which entries are eval-marked). Requires factoring the six inline python3 -c filters out of e2e-tests.yml into a shared module (e.g. utils/matrix_logic/bucket_configs.py) that both the workflow and the dry run import — so they can't drift.
Layer 2 — script/command preview (the valuable part). Thread a DRY_RUN=1 through benchmark_lib.sh, the scenario scripts, and the launchers so they still build the command arrays and write config.yaml / *_command.txt / mooncake_config.json into an output dir, but skip salloc/srun, the backgrounded server launch, and the aiperf invocation. This makes the generated artifacts inspectable locally without a cluster. Guard the execution lines (server launch, build_replay_cmd → run, run_agentic_replay_and_write_outputs run/tee) behind the flag.
Net: dry-run GEN_CMD="full-sweep --config-files ... --runner-type h200 --precision fp8" writes a tree of exactly-what-would-run scripts and configs, dispatching nothing.
Library research
I surveyed the landscape for tools that could power this (full report + sources below). Conclusion up front: no off-the-shelf tool does what we want end-to-end, and the two hard pieces — matrix expansion and previewing step-generated files — are exactly the parts GitHub keeps server-side and only partial public reimplementations exist. Since our generation logic is already our own Python/bash, building a thin custom harness on top of it is both the most faithful and the least work.
Full local GHA runners
- nektos/act (~71k stars, active) — the incumbent. Important gotcha for us:
act --dryrun/-n writes each step's shell script to disk but never executes them (issue #1347), so our Python/bash generation never runs and none of the config/command files materialize — it does not give us the preview we want. act in normal mode would run the generation, but needs Docker + Ubuntu containers and has no Slurm/self-hosted-runner fidelity. Not a fit.
- wrkflw (Rust, ~3.3k stars, v0.8.0 2026-04) — the real newcomer and closest off-the-shelf fit: its emulation mode runs steps directly on the host with no container and no dispatch, so our generation actually runs and files appear. Worth keeping as a cross-check of our custom harness, not as the primary tool (younger, host-emulation quirks; still runs on your host, not the cluster — which is fine for preview).
- gflows (abandoned), Cirrus CLI (different YAML schema), Dagger (pipeline-as-code rewrite, not a preview), Forgejo/Gitea
act_runner (server-coupled act) — none fit.
Static analysis / linters (all lint-only — none expand matrices or evaluate expressions)
- actionlint (~4k stars, active) — gold standard for YAML correctness: type-checks
${{ }}, runs shellcheck on run: shell and pyflakes on run: python, validates needs:/matrix refs/runner labels/cron. It's a type-checker, not an evaluator — won't enumerate matrix combos or compute values. Worth adding as an orthogonal CI gate; doesn't power the dry run.
- zizmor (security), action-validator (JSON-schema only), check-jsonschema, poutine, ghalint, octoscan — orthogonal/weaker; none help.
Workflow parsers
- @actions/workflow-parser (GitHub official, JS/TS, MIT) — parses + validates and models
workflow_dispatch inputs, but does not evaluate expressions and has no matrix/strategy converter at all. TS-only (needs a Node sidecar from Python).
- Python side: nothing usable —
actionlint-py just wraps the Go binary; gha/gha-workflow/actions-workflow-parser don't exist on PyPI. Realistic Python path is ruamel.yaml/PyYAML + our own model (which we already effectively have).
${{ }} expression evaluators (the piece people ask for)
- @actions/expressions (GitHub official, JS/TS, MIT, v0.3.58 2026-06) — a genuinely standalone, maintained evaluator implementing the full grammar (
fromJSON/format/contains/... ; hashFiles/success()/etc. are runner-supplied, you register them). This is the only standalone evaluator worth using — but it's JS/TS.
- Python: no GHA-dialect evaluator exists (CEL libs are a different grammar;
simpleeval is Python syntax). We'd have to port or shell out — but we likely don't need it: our workflow expressions are simple and the real logic already lives in Python.
Matrix expansion (include/exclude semantics)
- No published spec-perfect standalone library; the correct implementations are embedded in act (Go, ~80 lines), austenstone/actio (TS, most faithful, unmaintained), and the
wrkflw-matrix crate (Rust, minor deviations). Moot for us — our matrix isn't a native GHA strategy.matrix; it's generated by generate_sweep_configs.py, so we already own expansion.
GitHub's own tooling
- No first-party dry-run/validate/matrix-preview anywhere:
gh workflow run/view have no such flag; the REST/GraphQL dispatches endpoint fires a real run and only returns HTTP 422 on bad input (validation with the side effect we're avoiding).
Recommendation
- Build Layer 1 + Layer 2 on our own generation code — highest value, no Docker/Node/dispatch, most faithful. Extract the bucketing out of the YAML into a shared module so dry-run and CI share one implementation.
- Optionally run wrkflw (emulation mode) as a sanity cross-check that the workflow steps behave locally.
- Add actionlint in CI as an orthogonal YAML-correctness gate.
- Only reach for @actions/expressions (via a small Node sidecar) if we ever need faithful
${{ }} evaluation — currently unnecessary.
- Do not rely on
act --dryrun (writes scripts, never runs them) or wait for a GitHub/gh dry-run (doesn't exist).
Acceptance criteria
Research: two-pass investigation (repo pipeline map + tooling landscape survey, cross-verified July 2026). Key sources: act / act #1347, wrkflw, actionlint, @actions/expressions, @actions/workflow-parser, GHA matrix docs.
Summary
Add a local "dry run" mode that reproduces everything the e2e benchmark pipeline does to turn a dispatch into runnable work — expand the matrix, fill in the per-job inputs, and generate the config files and command scripts — but stops short of executing anything: no
workflow_dispatch, no runner allocation, no server launch, no benchmark/aiperf run, nothing touching the clusters.The point is to preview exactly what a given
generate-cli-commandwould produce (the bucketed matrices, the per-jobconfig.yaml, thevllm_command.txt/sglang_command.txt/benchmark_command.txt/lmcache_command.txt/mooncake_config.json) before spending a real run on the GPU fleet.Motivation
Right now the only way to see what a dispatch expands into is to actually fire
e2e-tests.yml, which allocates self-hosted Slurm runners and starts jobs on the GPU clusters. That's expensive and slow for what is often just "did I get the matrix / concurrency sweep / command flags right?"Almost all of the "fill in inputs + create scripts" logic is our own Python and bash, not native GHA steps — which is good news: a dry run can call the same code directly and stay faithful to CI without emulating GitHub at all.
What a dry run must reproduce (grounded inventory)
Mapped from the current pipeline:
utils/matrix_logic/generate_sweep_configs.py(generate_full_sweep~L223-575,mark_eval_entries~L124-220).github/configs/{nvidia,amd}-master.yaml+runners.yaml, filters by model-prefix/precision/framework/runner/seq-lens/conc/tp/ep, expandsconc-start..conc-endbystep-size, chunks multi-node conc lists, marks eval entries. Emits config JSON to stdout..github/workflows/e2e-tests.ymlget-jobs(~L70-83)python3 -cfilters split the JSON into single-node / multi-node / eval / multi-node-eval / agentic / multi-node-agentic matrices. This logic lives only in the YAML today.benchmark-tmpl.yml/benchmark-multinode-tmpl.ymlrunners/launch_*.shbenchmarks/single_node/....benchmarks/benchmark_lib.sh(build_replay_cmd~L1210-1348,run_agentic_replay_and_write_outputs~L1365-1400) + scenario scripts (e.g.benchmarks/single_node/agentic/*.sh)sglang_command.txt/vllm_command.txt; builds the full aiperf CLI ($REPLAY_CMD, 40+ flags from env) and writesbenchmark_command.txt; then immediately executes the server + benchmark.The critical wrinkle: matrix generation is already a clean Python function (trivial to reuse), but script generation and execution are interleaved in bash — the same functions that build
$SGLANG_CMD/$REPLAY_CMDalso launch the server and run aiperf. So a real "generate but don't run" needs the generation split from the execution.There is no existing dry-run/preview path in the pipeline (only unrelated
--dry-runflags inutils/agentic/sample_proxy_traces.py).Proposed shape (two layers)
Layer 1 — matrix/config preview (cheap, do first). A CLI (
utils/matrix_logic/dry_run.pyormake dry-run GEN_CMD="...") that takes the samegenerate-cli-command, callsgenerate_sweep_configs.py, and applies the same bucketing to print all six matrices + a summary (models, SKUs, concurrency points, job count, which entries are eval-marked). Requires factoring the six inlinepython3 -cfilters out ofe2e-tests.ymlinto a shared module (e.g.utils/matrix_logic/bucket_configs.py) that both the workflow and the dry run import — so they can't drift.Layer 2 — script/command preview (the valuable part). Thread a
DRY_RUN=1throughbenchmark_lib.sh, the scenario scripts, and the launchers so they still build the command arrays and writeconfig.yaml/*_command.txt/mooncake_config.jsoninto an output dir, but skipsalloc/srun, the backgrounded server launch, and the aiperf invocation. This makes the generated artifacts inspectable locally without a cluster. Guard the execution lines (server launch,build_replay_cmd→ run,run_agentic_replay_and_write_outputsrun/tee) behind the flag.Net:
dry-run GEN_CMD="full-sweep --config-files ... --runner-type h200 --precision fp8"writes a tree of exactly-what-would-run scripts and configs, dispatching nothing.Library research
I surveyed the landscape for tools that could power this (full report + sources below). Conclusion up front: no off-the-shelf tool does what we want end-to-end, and the two hard pieces — matrix expansion and previewing step-generated files — are exactly the parts GitHub keeps server-side and only partial public reimplementations exist. Since our generation logic is already our own Python/bash, building a thin custom harness on top of it is both the most faithful and the least work.
Full local GHA runners
act --dryrun/-nwrites each step's shell script to disk but never executes them (issue #1347), so our Python/bash generation never runs and none of the config/command files materialize — it does not give us the preview we want.actin normal mode would run the generation, but needs Docker + Ubuntu containers and has no Slurm/self-hosted-runner fidelity. Not a fit.act_runner(server-coupled act) — none fit.Static analysis / linters (all lint-only — none expand matrices or evaluate expressions)
${{ }}, runs shellcheck onrun:shell and pyflakes onrun:python, validatesneeds:/matrix refs/runner labels/cron. It's a type-checker, not an evaluator — won't enumerate matrix combos or compute values. Worth adding as an orthogonal CI gate; doesn't power the dry run.Workflow parsers
workflow_dispatchinputs, but does not evaluate expressions and has no matrix/strategy converter at all. TS-only (needs a Node sidecar from Python).actionlint-pyjust wraps the Go binary;gha/gha-workflow/actions-workflow-parserdon't exist on PyPI. Realistic Python path isruamel.yaml/PyYAML+ our own model (which we already effectively have).${{ }}expression evaluators (the piece people ask for)fromJSON/format/contains/... ;hashFiles/success()/etc. are runner-supplied, you register them). This is the only standalone evaluator worth using — but it's JS/TS.simpleevalis Python syntax). We'd have to port or shell out — but we likely don't need it: our workflow expressions are simple and the real logic already lives in Python.Matrix expansion (include/exclude semantics)
wrkflw-matrixcrate (Rust, minor deviations). Moot for us — our matrix isn't a native GHAstrategy.matrix; it's generated bygenerate_sweep_configs.py, so we already own expansion.GitHub's own tooling
gh workflow run/viewhave no such flag; the REST/GraphQLdispatchesendpoint fires a real run and only returns HTTP 422 on bad input (validation with the side effect we're avoiding).Recommendation
${{ }}evaluation — currently unnecessary.act --dryrun(writes scripts, never runs them) or wait for a GitHub/ghdry-run (doesn't exist).Acceptance criteria
dry-run GEN_CMD="..."prints all six bucketed matrices + a summary (job count, SKUs, conc points, eval-marked entries) from the samegenerate-cli-commandCI uses.e2e-tests.ymlinto a shared module imported by both the workflow and the dry run (no duplication / drift).DRY_RUNpath throughbenchmark_lib.sh+ scenario scripts + launchers writesconfig.yaml,*_command.txt, andmooncake_config.jsonto an output dir while skipping salloc/srun, server launch, and aiperf execution.Research: two-pass investigation (repo pipeline map + tooling landscape survey, cross-verified July 2026). Key sources: act / act #1347, wrkflw, actionlint, @actions/expressions, @actions/workflow-parser, GHA matrix docs.