Skip to content

Add local "dry run" mode: run all GHA input/script-generation logic without triggering a run #1972

Description

@cquil11

Summary

Add a local "dry run" mode that reproduces everything the e2e benchmark pipeline does to turn a dispatch into runnable work — expand the matrix, fill in the per-job inputs, and generate the config files and command scripts — but stops short of executing anything: no workflow_dispatch, no runner allocation, no server launch, no benchmark/aiperf run, nothing touching the clusters.

The point is to preview exactly what a given generate-cli-command would produce (the bucketed matrices, the per-job config.yaml, the vllm_command.txt / sglang_command.txt / benchmark_command.txt / lmcache_command.txt / mooncake_config.json) before spending a real run on the GPU fleet.

Motivation

Right now the only way to see what a dispatch expands into is to actually fire e2e-tests.yml, which allocates self-hosted Slurm runners and starts jobs on the GPU clusters. That's expensive and slow for what is often just "did I get the matrix / concurrency sweep / command flags right?"

Almost all of the "fill in inputs + create scripts" logic is our own Python and bash, not native GHA steps — which is good news: a dry run can call the same code directly and stay faithful to CI without emulating GitHub at all.

What a dry run must reproduce (grounded inventory)

Mapped from the current pipeline:

Stage Where What happens
Matrix generation utils/matrix_logic/generate_sweep_configs.py (generate_full_sweep ~L223-575, mark_eval_entries ~L124-220) Reads .github/configs/{nvidia,amd}-master.yaml + runners.yaml, filters by model-prefix/precision/framework/runner/seq-lens/conc/tp/ep, expands conc-start..conc-end by step-size, chunks multi-node conc lists, marks eval entries. Emits config JSON to stdout.
Matrix bucketing .github/workflows/e2e-tests.yml get-jobs (~L70-83) Six inline python3 -c filters split the JSON into single-node / multi-node / eval / multi-node-eval / agentic / multi-node-agentic matrices. This logic lives only in the YAML today.
Per-job input fill benchmark-tmpl.yml / benchmark-multinode-tmpl.yml Each matrix entry is unpacked into ~93 job inputs and exported as env vars.
Runner + script launch runners/launch_*.sh Parses runner name, allocates cluster (salloc/srun), mounts image + workspaces, invokes the scenario script under benchmarks/single_node/....
Command/config generation benchmarks/benchmark_lib.sh (build_replay_cmd ~L1210-1348, run_agentic_replay_and_write_outputs ~L1365-1400) + scenario scripts (e.g. benchmarks/single_node/agentic/*.sh) Builds the server launch command array and writes sglang_command.txt / vllm_command.txt; builds the full aiperf CLI ($REPLAY_CMD, 40+ flags from env) and writes benchmark_command.txt; then immediately executes the server + benchmark.

The critical wrinkle: matrix generation is already a clean Python function (trivial to reuse), but script generation and execution are interleaved in bash — the same functions that build $SGLANG_CMD / $REPLAY_CMD also launch the server and run aiperf. So a real "generate but don't run" needs the generation split from the execution.

There is no existing dry-run/preview path in the pipeline (only unrelated --dry-run flags in utils/agentic/sample_proxy_traces.py).

Proposed shape (two layers)

Layer 1 — matrix/config preview (cheap, do first). A CLI (utils/matrix_logic/dry_run.py or make dry-run GEN_CMD="...") that takes the same generate-cli-command, calls generate_sweep_configs.py, and applies the same bucketing to print all six matrices + a summary (models, SKUs, concurrency points, job count, which entries are eval-marked). Requires factoring the six inline python3 -c filters out of e2e-tests.yml into a shared module (e.g. utils/matrix_logic/bucket_configs.py) that both the workflow and the dry run import — so they can't drift.

Layer 2 — script/command preview (the valuable part). Thread a DRY_RUN=1 through benchmark_lib.sh, the scenario scripts, and the launchers so they still build the command arrays and write config.yaml / *_command.txt / mooncake_config.json into an output dir, but skip salloc/srun, the backgrounded server launch, and the aiperf invocation. This makes the generated artifacts inspectable locally without a cluster. Guard the execution lines (server launch, build_replay_cmd → run, run_agentic_replay_and_write_outputs run/tee) behind the flag.

Net: dry-run GEN_CMD="full-sweep --config-files ... --runner-type h200 --precision fp8" writes a tree of exactly-what-would-run scripts and configs, dispatching nothing.

Library research

I surveyed the landscape for tools that could power this (full report + sources below). Conclusion up front: no off-the-shelf tool does what we want end-to-end, and the two hard pieces — matrix expansion and previewing step-generated files — are exactly the parts GitHub keeps server-side and only partial public reimplementations exist. Since our generation logic is already our own Python/bash, building a thin custom harness on top of it is both the most faithful and the least work.

Full local GHA runners

  • nektos/act (~71k stars, active) — the incumbent. Important gotcha for us: act --dryrun/-n writes each step's shell script to disk but never executes them (issue #1347), so our Python/bash generation never runs and none of the config/command files materialize — it does not give us the preview we want. act in normal mode would run the generation, but needs Docker + Ubuntu containers and has no Slurm/self-hosted-runner fidelity. Not a fit.
  • wrkflw (Rust, ~3.3k stars, v0.8.0 2026-04) — the real newcomer and closest off-the-shelf fit: its emulation mode runs steps directly on the host with no container and no dispatch, so our generation actually runs and files appear. Worth keeping as a cross-check of our custom harness, not as the primary tool (younger, host-emulation quirks; still runs on your host, not the cluster — which is fine for preview).
  • gflows (abandoned), Cirrus CLI (different YAML schema), Dagger (pipeline-as-code rewrite, not a preview), Forgejo/Gitea act_runner (server-coupled act) — none fit.

Static analysis / linters (all lint-only — none expand matrices or evaluate expressions)

  • actionlint (~4k stars, active) — gold standard for YAML correctness: type-checks ${{ }}, runs shellcheck on run: shell and pyflakes on run: python, validates needs:/matrix refs/runner labels/cron. It's a type-checker, not an evaluator — won't enumerate matrix combos or compute values. Worth adding as an orthogonal CI gate; doesn't power the dry run.
  • zizmor (security), action-validator (JSON-schema only), check-jsonschema, poutine, ghalint, octoscan — orthogonal/weaker; none help.

Workflow parsers

  • @actions/workflow-parser (GitHub official, JS/TS, MIT) — parses + validates and models workflow_dispatch inputs, but does not evaluate expressions and has no matrix/strategy converter at all. TS-only (needs a Node sidecar from Python).
  • Python side: nothing usableactionlint-py just wraps the Go binary; gha/gha-workflow/actions-workflow-parser don't exist on PyPI. Realistic Python path is ruamel.yaml/PyYAML + our own model (which we already effectively have).

${{ }} expression evaluators (the piece people ask for)

  • @actions/expressions (GitHub official, JS/TS, MIT, v0.3.58 2026-06) — a genuinely standalone, maintained evaluator implementing the full grammar (fromJSON/format/contains/... ; hashFiles/success()/etc. are runner-supplied, you register them). This is the only standalone evaluator worth using — but it's JS/TS.
  • Python: no GHA-dialect evaluator exists (CEL libs are a different grammar; simpleeval is Python syntax). We'd have to port or shell out — but we likely don't need it: our workflow expressions are simple and the real logic already lives in Python.

Matrix expansion (include/exclude semantics)

  • No published spec-perfect standalone library; the correct implementations are embedded in act (Go, ~80 lines), austenstone/actio (TS, most faithful, unmaintained), and the wrkflw-matrix crate (Rust, minor deviations). Moot for us — our matrix isn't a native GHA strategy.matrix; it's generated by generate_sweep_configs.py, so we already own expansion.

GitHub's own tooling

  • No first-party dry-run/validate/matrix-preview anywhere: gh workflow run/view have no such flag; the REST/GraphQL dispatches endpoint fires a real run and only returns HTTP 422 on bad input (validation with the side effect we're avoiding).

Recommendation

  1. Build Layer 1 + Layer 2 on our own generation code — highest value, no Docker/Node/dispatch, most faithful. Extract the bucketing out of the YAML into a shared module so dry-run and CI share one implementation.
  2. Optionally run wrkflw (emulation mode) as a sanity cross-check that the workflow steps behave locally.
  3. Add actionlint in CI as an orthogonal YAML-correctness gate.
  4. Only reach for @actions/expressions (via a small Node sidecar) if we ever need faithful ${{ }} evaluation — currently unnecessary.
  5. Do not rely on act --dryrun (writes scripts, never runs them) or wait for a GitHub/gh dry-run (doesn't exist).

Acceptance criteria

  • Layer 1: dry-run GEN_CMD="..." prints all six bucketed matrices + a summary (job count, SKUs, conc points, eval-marked entries) from the same generate-cli-command CI uses.
  • The six bucketing filters are factored out of e2e-tests.yml into a shared module imported by both the workflow and the dry run (no duplication / drift).
  • Layer 2: a DRY_RUN path through benchmark_lib.sh + scenario scripts + launchers writes config.yaml, *_command.txt, and mooncake_config.json to an output dir while skipping salloc/srun, server launch, and aiperf execution.
  • Nothing is dispatched to GitHub and nothing runs on the clusters.
  • (Optional) actionlint wired into CI as a lint gate.

Research: two-pass investigation (repo pipeline map + tooling landscape survey, cross-verified July 2026). Key sources: act / act #1347, wrkflw, actionlint, @actions/expressions, @actions/workflow-parser, GHA matrix docs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions