Skip to content

Generalize per-subpoint benchmark checkpointing for reruns / 泛化基准测试子点检查点机制以支持重跑 #1971

Description

@cquil11

Context

PR #1970 currently has an AgentX-specific retry/resume implementation for multi-node agentic runs. It is useful, but it is broader than the PR scope and should be generalized instead of landing as an AgentX-only helper.

The underlying problem is not agentic-specific: some GitHub jobs run multiple independent benchmark subpoints inside one job, usually via CONC_LIST. If an early subpoint succeeds and a later subpoint fails or times out, rerun --failed reruns the whole job from scratch. On self-hosted runners, GitHub can also reroute the rerun to a different physical runner, so local filesystem state is not enough.

Current implementation to preserve conceptually

The AgentX prototype had three pieces:

  1. Stable artifact identity for reruns

    • In .github/workflows/benchmark-multinode-tmpl.yml, agentic result identity used the stable runner label (inputs.runner) rather than the physical runner.name so artifacts could be found after GitHub rerouted a failed-job rerun.
  2. Per-concurrency result checkpoints

    • benchmarks/multi_node/agentic_srt.sh iterated over CONC_LIST.
    • Before running a concurrency, it attempted to restore a completed result checkpoint.
    • After a successful concurrency, it staged the result into agentic_checkpoints/.
    • Completed concurrencies were skipped on later attempts.
  3. Validated checkpoint helper

    • Prototype file: utils/agentic_checkpoint.py.
    • stage validated the result JSON before writing a checkpoint.
    • restore required both <base>_conc<N>.json and <base>_conc<N>.success.json.
    • Marker payload included schema_version, base_result_filename, result_filename, and concurrency.
    • Validation rejected mismatched conc and results with num_requests_successful <= 0.
    • Writes were atomic: copy JSON first, write success marker last.

Workflow artifact behavior:

  • Download prior checkpoint artifacts on github.run_attempt > 1.
  • Upload checkpoint artifacts with if: always() so completed subpoints survive later failures.
  • Artifact pattern was keyed by stable result identity and run attempt.

Proposed follow-up

Generalize this into a benchmark-level checkpoint facility, probably under utils/benchmark_checkpoint.py or utils/checkpoint_result.py, instead of an AgentX-only implementation.

Likely requirements:

  • Support any multi-node job that runs multiple independent subpoints in one GitHub job.
  • Keep validation pluggable:
    • agentic result: require matching conc and num_requests_successful > 0.
    • fixed-seq result: require matching max_concurrency / processed result expectations.
  • Keep stable artifact identity where reruns can move across physical runners.
  • Keep marker-file semantics so partial writes are never treated as completed checkpoints.
  • Add focused tests for stage/restore and invalid checkpoint rejection.

This was removed from PR #1970 to reduce scope, but the implementation shape above is the working prototype to start from.

中文说明

PR #1970 中包含一个 AgentX 专用的多节点 agentic 运行重试/恢复实现。本 issue 提议将其泛化为通用的基准测试级检查点功能,支持任何在单个 GitHub 作业中通过 CONC_LIST 运行多个独立子点的多节点作业。核心功能包括:稳定的产物标识(使 rerun 在不同物理运行器间仍可找到先前结果)、按并发度的结果检查点(成功的子点在后续重跑时跳过)、以及带验证的检查点辅助工具(原子写入、标记文件语义、可插拔验证逻辑)。

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions