Generalize per-subpoint benchmark checkpointing for reruns / 泛化基准测试子点检查点机制以支持重跑

## Context

PR #1970 currently has an AgentX-specific retry/resume implementation for multi-node agentic runs. It is useful, but it is broader than the PR scope and should be generalized instead of landing as an AgentX-only helper.

The underlying problem is not agentic-specific: some GitHub jobs run multiple independent benchmark subpoints inside one job, usually via `CONC_LIST`. If an early subpoint succeeds and a later subpoint fails or times out, `rerun --failed` reruns the whole job from scratch. On self-hosted runners, GitHub can also reroute the rerun to a different physical runner, so local filesystem state is not enough.

## Current implementation to preserve conceptually

The AgentX prototype had three pieces:

1. Stable artifact identity for reruns
   - In `.github/workflows/benchmark-multinode-tmpl.yml`, agentic result identity used the stable runner label (`inputs.runner`) rather than the physical `runner.name` so artifacts could be found after GitHub rerouted a failed-job rerun.

2. Per-concurrency result checkpoints
   - `benchmarks/multi_node/agentic_srt.sh` iterated over `CONC_LIST`.
   - Before running a concurrency, it attempted to restore a completed result checkpoint.
   - After a successful concurrency, it staged the result into `agentic_checkpoints/`.
   - Completed concurrencies were skipped on later attempts.

3. Validated checkpoint helper
   - Prototype file: `utils/agentic_checkpoint.py`.
   - `stage` validated the result JSON before writing a checkpoint.
   - `restore` required both `<base>_conc<N>.json` and `<base>_conc<N>.success.json`.
   - Marker payload included `schema_version`, `base_result_filename`, `result_filename`, and `concurrency`.
   - Validation rejected mismatched `conc` and results with `num_requests_successful <= 0`.
   - Writes were atomic: copy JSON first, write success marker last.

Workflow artifact behavior:

- Download prior checkpoint artifacts on `github.run_attempt > 1`.
- Upload checkpoint artifacts with `if: always()` so completed subpoints survive later failures.
- Artifact pattern was keyed by stable result identity and run attempt.

## Proposed follow-up

Generalize this into a benchmark-level checkpoint facility, probably under `utils/benchmark_checkpoint.py` or `utils/checkpoint_result.py`, instead of an AgentX-only implementation.

Likely requirements:

- Support any multi-node job that runs multiple independent subpoints in one GitHub job.
- Keep validation pluggable:
  - agentic result: require matching `conc` and `num_requests_successful > 0`.
  - fixed-seq result: require matching `max_concurrency` / processed result expectations.
- Keep stable artifact identity where reruns can move across physical runners.
- Keep marker-file semantics so partial writes are never treated as completed checkpoints.
- Add focused tests for stage/restore and invalid checkpoint rejection.

This was removed from PR #1970 to reduce scope, but the implementation shape above is the working prototype to start from.

## 中文说明
PR #1970 中包含一个 AgentX 专用的多节点 agentic 运行重试/恢复实现。本 issue 提议将其泛化为通用的基准测试级检查点功能，支持任何在单个 GitHub 作业中通过 `CONC_LIST` 运行多个独立子点的多节点作业。核心功能包括：稳定的产物标识（使 rerun 在不同物理运行器间仍可找到先前结果）、按并发度的结果检查点（成功的子点在后续重跑时跳过）、以及带验证的检查点辅助工具（原子写入、标记文件语义、可插拔验证逻辑）。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generalize per-subpoint benchmark checkpointing for reruns / 泛化基准测试子点检查点机制以支持重跑 #1971

Context

Current implementation to preserve conceptually

Proposed follow-up

中文说明

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Generalize per-subpoint benchmark checkpointing for reruns / 泛化基准测试子点检查点机制以支持重跑 #1971

Description

Context

Current implementation to preserve conceptually

Proposed follow-up

中文说明

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions