Add local "dry run" mode: run all GHA input/script-generation logic without triggering a run

## Summary

Add a **local "dry run" mode** that reproduces everything the e2e benchmark pipeline does to turn a dispatch into runnable work — expand the matrix, fill in the per-job inputs, and generate the config files and command scripts — **but stops short of executing anything**: no `workflow_dispatch`, no runner allocation, no server launch, no benchmark/aiperf run, nothing touching the clusters.

The point is to preview *exactly* what a given `generate-cli-command` would produce (the bucketed matrices, the per-job `config.yaml`, the `vllm_command.txt` / `sglang_command.txt` / `benchmark_command.txt` / `lmcache_command.txt` / `mooncake_config.json`) before spending a real run on the GPU fleet.

## Motivation

Right now the only way to see what a dispatch expands into is to actually fire `e2e-tests.yml`, which allocates self-hosted Slurm runners and starts jobs on the GPU clusters. That's expensive and slow for what is often just "did I get the matrix / concurrency sweep / command flags right?"

Almost all of the "fill in inputs + create scripts" logic is **our own Python and bash**, not native GHA steps — which is good news: a dry run can call the same code directly and stay faithful to CI without emulating GitHub at all.

## What a dry run must reproduce (grounded inventory)

Mapped from the current pipeline:

| Stage | Where | What happens |
|---|---|---|
| Matrix generation | `utils/matrix_logic/generate_sweep_configs.py` (`generate_full_sweep` ~L223-575, `mark_eval_entries` ~L124-220) | Reads `.github/configs/{nvidia,amd}-master.yaml` + `runners.yaml`, filters by model-prefix/precision/framework/runner/seq-lens/conc/tp/ep, expands `conc-start..conc-end` by `step-size`, chunks multi-node conc lists, marks eval entries. Emits config JSON to stdout. |
| Matrix bucketing | `.github/workflows/e2e-tests.yml` `get-jobs` (~L70-83) | Six inline `python3 -c` filters split the JSON into single-node / multi-node / eval / multi-node-eval / agentic / multi-node-agentic matrices. **This logic lives only in the YAML today.** |
| Per-job input fill | `benchmark-tmpl.yml` / `benchmark-multinode-tmpl.yml` | Each matrix entry is unpacked into ~93 job inputs and exported as env vars. |
| Runner + script launch | `runners/launch_*.sh` | Parses runner name, allocates cluster (salloc/srun), mounts image + workspaces, invokes the scenario script under `benchmarks/single_node/...`. |
| Command/config generation | `benchmarks/benchmark_lib.sh` (`build_replay_cmd` ~L1210-1348, `run_agentic_replay_and_write_outputs` ~L1365-1400) + scenario scripts (e.g. `benchmarks/single_node/agentic/*.sh`) | Builds the server launch command array and writes `sglang_command.txt` / `vllm_command.txt`; builds the full aiperf CLI (`$REPLAY_CMD`, 40+ flags from env) and writes `benchmark_command.txt`; then **immediately executes** the server + benchmark. |

The critical wrinkle: matrix generation is already a clean Python function (trivial to reuse), but **script generation and execution are interleaved in bash** — the same functions that build `$SGLANG_CMD` / `$REPLAY_CMD` also launch the server and run aiperf. So a real "generate but don't run" needs the generation split from the execution.

There is **no existing dry-run/preview path** in the pipeline (only unrelated `--dry-run` flags in `utils/agentic/sample_proxy_traces.py`).

## Proposed shape (two layers)

**Layer 1 — matrix/config preview (cheap, do first).** A CLI (`utils/matrix_logic/dry_run.py` or `make dry-run GEN_CMD="..."`) that takes the same `generate-cli-command`, calls `generate_sweep_configs.py`, and applies the *same* bucketing to print all six matrices + a summary (models, SKUs, concurrency points, job count, which entries are eval-marked). Requires factoring the six inline `python3 -c` filters out of `e2e-tests.yml` into a shared module (e.g. `utils/matrix_logic/bucket_configs.py`) that both the workflow and the dry run import — so they can't drift.

**Layer 2 — script/command preview (the valuable part).** Thread a `DRY_RUN=1` through `benchmark_lib.sh`, the scenario scripts, and the launchers so they still build the command arrays and write `config.yaml` / `*_command.txt` / `mooncake_config.json` into an output dir, but **skip** `salloc`/`srun`, the backgrounded server launch, and the aiperf invocation. This makes the generated artifacts inspectable locally without a cluster. Guard the execution lines (server launch, `build_replay_cmd` → run, `run_agentic_replay_and_write_outputs` run/tee) behind the flag.

Net: `dry-run GEN_CMD="full-sweep --config-files ... --runner-type h200 --precision fp8"` writes a tree of exactly-what-would-run scripts and configs, dispatching nothing.

## Library research

I surveyed the landscape for tools that could power this (full report + sources below). Conclusion up front: **no off-the-shelf tool does what we want end-to-end, and the two hard pieces — matrix expansion and previewing step-generated files — are exactly the parts GitHub keeps server-side and only partial public reimplementations exist. Since our generation logic is already our own Python/bash, building a thin custom harness on top of it is both the most faithful and the least work.**

**Full local GHA runners**
- **[nektos/act](https://github.com/nektos/act)** (~71k stars, active) — the incumbent. Important gotcha for us: **`act --dryrun`/`-n` writes each step's shell script to disk but never executes them** ([issue #1347](https://github.com/nektos/act/issues/1347)), so our Python/bash generation never runs and none of the config/command files materialize — it does *not* give us the preview we want. `act` in *normal* mode would run the generation, but needs Docker + Ubuntu containers and has no Slurm/self-hosted-runner fidelity. Not a fit.
- **[wrkflw](https://github.com/bahdotsh/wrkflw)** (Rust, ~3.3k stars, v0.8.0 2026-04) — the real newcomer and closest off-the-shelf fit: its **emulation mode runs steps directly on the host with no container and no dispatch**, so our generation actually runs and files appear. Worth keeping as a **cross-check** of our custom harness, not as the primary tool (younger, host-emulation quirks; still runs on your host, not the cluster — which is fine for preview).
- gflows (abandoned), Cirrus CLI (different YAML schema), Dagger (pipeline-as-code rewrite, not a preview), Forgejo/Gitea `act_runner` (server-coupled act) — none fit.

**Static analysis / linters (all lint-only — none expand matrices or evaluate expressions)**
- **[actionlint](https://github.com/rhysd/actionlint)** (~4k stars, active) — gold standard for YAML *correctness*: type-checks `${{ }}`, runs shellcheck on `run:` shell and pyflakes on `run:` python, validates `needs:`/matrix refs/runner labels/cron. It's a type-checker, **not** an evaluator — won't enumerate matrix combos or compute values. Worth adding as an **orthogonal CI gate**; doesn't power the dry run.
- [zizmor](https://github.com/zizmorcore/zizmor) (security), [action-validator](https://github.com/mpalmer/action-validator) (JSON-schema only), check-jsonschema, poutine, ghalint, octoscan — orthogonal/weaker; none help.

**Workflow parsers**
- **[@actions/workflow-parser](https://www.npmjs.com/package/@actions/workflow-parser)** (GitHub official, JS/TS, MIT) — parses + validates and models `workflow_dispatch` inputs, but **does not evaluate expressions and has no matrix/strategy converter at all**. TS-only (needs a Node sidecar from Python).
- Python side: **nothing usable** — `actionlint-py` just wraps the Go binary; `gha`/`gha-workflow`/`actions-workflow-parser` don't exist on PyPI. Realistic Python path is `ruamel.yaml`/`PyYAML` + our own model (which we already effectively have).

**`${{ }}` expression evaluators (the piece people ask for)**
- **[@actions/expressions](https://www.npmjs.com/package/@actions/expressions)** (GitHub official, JS/TS, MIT, v0.3.58 2026-06) — a genuinely standalone, maintained evaluator implementing the full grammar (`fromJSON`/`format`/`contains`/... ; `hashFiles`/`success()`/etc. are runner-supplied, you register them). **This is the only standalone evaluator worth using — but it's JS/TS.**
- **Python: no GHA-dialect evaluator exists** (CEL libs are a *different* grammar; `simpleeval` is Python syntax). We'd have to port or shell out — but we likely don't need it: our workflow expressions are simple and the real logic already lives in Python.

**Matrix expansion (include/exclude semantics)**
- No published spec-perfect standalone library; the correct implementations are embedded in act (Go, ~80 lines), [austenstone/actio](https://github.com/austenstone/actio/blob/main/packages/core/src/passes/expandMatrix.ts) (TS, most faithful, unmaintained), and the `wrkflw-matrix` crate (Rust, minor deviations). **Moot for us** — our matrix isn't a native GHA `strategy.matrix`; it's generated by `generate_sweep_configs.py`, so we already own expansion.

**GitHub's own tooling**
- **No first-party dry-run/validate/matrix-preview** anywhere: `gh workflow run`/`view` have no such flag; the REST/GraphQL `dispatches` endpoint *fires a real run* and only returns HTTP 422 on bad input (validation *with* the side effect we're avoiding).

### Recommendation

1. **Build Layer 1 + Layer 2 on our own generation code** — highest value, no Docker/Node/dispatch, most faithful. Extract the bucketing out of the YAML into a shared module so dry-run and CI share one implementation.
2. Optionally run **wrkflw (emulation mode)** as a sanity cross-check that the workflow steps behave locally.
3. Add **actionlint** in CI as an orthogonal YAML-correctness gate.
4. Only reach for **@actions/expressions** (via a small Node sidecar) if we ever need faithful `${{ }}` evaluation — currently unnecessary.
5. Do **not** rely on `act --dryrun` (writes scripts, never runs them) or wait for a GitHub/`gh` dry-run (doesn't exist).

## Acceptance criteria

- [ ] Layer 1: `dry-run GEN_CMD="..."` prints all six bucketed matrices + a summary (job count, SKUs, conc points, eval-marked entries) from the same `generate-cli-command` CI uses.
- [ ] The six bucketing filters are factored out of `e2e-tests.yml` into a shared module imported by both the workflow and the dry run (no duplication / drift).
- [ ] Layer 2: a `DRY_RUN` path through `benchmark_lib.sh` + scenario scripts + launchers writes `config.yaml`, `*_command.txt`, and `mooncake_config.json` to an output dir while skipping salloc/srun, server launch, and aiperf execution.
- [ ] Nothing is dispatched to GitHub and nothing runs on the clusters.
- [ ] (Optional) actionlint wired into CI as a lint gate.

---
_Research: two-pass investigation (repo pipeline map + tooling landscape survey, cross-verified July 2026). Key sources: [act](https://github.com/nektos/act) / [act #1347](https://github.com/nektos/act/issues/1347), [wrkflw](https://github.com/bahdotsh/wrkflw), [actionlint](https://github.com/rhysd/actionlint), [@actions/expressions](https://www.npmjs.com/package/@actions/expressions), [@actions/workflow-parser](https://www.npmjs.com/package/@actions/workflow-parser), [GHA matrix docs](https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/run-job-variations)._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add local "dry run" mode: run all GHA input/script-generation logic without triggering a run #1972

Summary

Motivation

What a dry run must reproduce (grounded inventory)

Proposed shape (two layers)

Library research

Recommendation

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage	Where	What happens
Matrix generation	`utils/matrix_logic/generate_sweep_configs.py` (`generate_full_sweep` ~L223-575, `mark_eval_entries` ~L124-220)	Reads `.github/configs/{nvidia,amd}-master.yaml` + `runners.yaml`, filters by model-prefix/precision/framework/runner/seq-lens/conc/tp/ep, expands `conc-start..conc-end` by `step-size`, chunks multi-node conc lists, marks eval entries. Emits config JSON to stdout.
Matrix bucketing	`.github/workflows/e2e-tests.yml` `get-jobs` (~L70-83)	Six inline `python3 -c` filters split the JSON into single-node / multi-node / eval / multi-node-eval / agentic / multi-node-agentic matrices. This logic lives only in the YAML today.
Per-job input fill	`benchmark-tmpl.yml` / `benchmark-multinode-tmpl.yml`	Each matrix entry is unpacked into ~93 job inputs and exported as env vars.
Runner + script launch	`runners/launch_*.sh`	Parses runner name, allocates cluster (salloc/srun), mounts image + workspaces, invokes the scenario script under `benchmarks/single_node/...`.
Command/config generation	`benchmarks/benchmark_lib.sh` (`build_replay_cmd` ~L1210-1348, `run_agentic_replay_and_write_outputs` ~L1365-1400) + scenario scripts (e.g. `benchmarks/single_node/agentic/*.sh`)	Builds the server launch command array and writes `sglang_command.txt` / `vllm_command.txt`; builds the full aiperf CLI (`$REPLAY_CMD`, 40+ flags from env) and writes `benchmark_command.txt`; then immediately executes the server + benchmark.

Uh oh!

Add local "dry run" mode: run all GHA input/script-generation logic without triggering a run #1972

Description

Summary

Motivation

What a dry run must reproduce (grounded inventory)

Proposed shape (two layers)

Library research

Recommendation

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions