Fix double run_id bump in recover/watch reset path by daniel-thom · Pull Request #368 · NatLabRockies/torc

daniel-thom · 2026-06-06T01:48:24Z

Problem

After a failure, the sequence torc recover <id> then torc run <id> advanced the workflow's run_id by 2 instead of 1 — a recovered job's result jumped from run_id=1 to run_id=3, leaving run_id=2 with no results.

Root cause

run_id is incremented in exactly one place: reset_workflow_status (run_id = run_id + 1). The recovery flow called it twice:

reset_failed_jobs → reset_workflow_status (bump)
reinitialize_workflow → WorkflowManager::reinitialize → reset_workflow_status (bump)

Every caller of reset_failed_jobs immediately follows it with reinitialize_workflow, so the reset inside reset_failed_jobs was always redundant. (The subsequent torc run <id> does not add a third bump — run_jobs_cmd only re-initializes when the workflow is fully uninitialized, which it isn't after recover.)

Fix

Drop the redundant reset_workflow_status from reset_failed_jobs; it now only resets the failed jobs, and the single run_id bump comes from the reinitialize_workflow step every caller already performs.

This corrects torc recover (normal + interactive paths) and the identical pattern in torc watch's auto-recovery (watch.rs:1009). Nothing relied on reset_failed_jobs bumping run_id.

Testing

New test test_recover_reset_sequence_bumps_run_id_once runs the actual recover reset pair (reset_failed_jobs + reinitialize_workflow) and asserts run_id advances by exactly 1 (fails with +2 on the old code).
cargo fmt --check, cargo clippy --all --all-targets --all-features -D warnings, dprint check — clean.

🤖 Generated with Claude Code

`reset_failed_jobs` reset workflow status (which bumps run_id) and was always immediately followed by `reinitialize_workflow`, which also resets workflow status and bumps run_id. So each recovery bumped run_id twice, leaving a gap (e.g. a recovered job's result jumping from run 1 to run 3). Drop the redundant `reset_workflow_status` from `reset_failed_jobs`; it now only resets the failed jobs, and the single run_id bump comes from the `reinitialize_workflow` step that every caller already performs. This fixes `torc recover` and the identical pattern in `torc watch`'s auto-recovery. Adds a test asserting the recover reset sequence advances run_id by exactly 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes an issue where the recovery/watch reset flow advanced a workflow’s run_id twice by removing the redundant workflow-status reset from reset_failed_jobs, and adds a regression test to ensure the reset+reinitialize sequence bumps run_id exactly once.

Changes:

Remove the reset_workflow_status call from reset_failed_jobs so the run_id bump happens only during reinitialize_workflow.
Add a regression test covering the recover reset sequence to assert run_id increments by exactly 1.
Adjust the example CPU-intensive script’s process exit code (currently makes the script report success while exiting non-zero).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/client/commands/recover.rs`	Removes the redundant workflow status reset in the recovery reset path to prevent double `run_id` bumps.
`tests/test_workflow_manager.rs`	Adds coverage to ensure `reset_failed_jobs` + `reinitialize_workflow` bumps `run_id` once.
`examples/scripts/cpu_intensive.py`	Changes the script’s success-path exit code (currently causes the job to be marked failed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- reset_failed_jobs now returns the server-reported updated_count from reset_job_status instead of job_ids.len(), so the logged "reset N job(s)" count is accurate (reset_job_status resets all retryable failed jobs workflow-wide; job_ids only gates the no-op early return). - Revert an accidental change to examples/scripts/cpu_intensive.py (sys.exit(1) -> sys.exit(0)) that was swept into the first commit; it made the demo job report success while exiting non-zero. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

daniel-thom · 2026-06-06T02:27:01Z

Addressed in ab2e62d:

recover.rs — reset_failed_jobs now returns the server's updated_count from reset_job_status instead of job_ids.len(), so the logged count is accurate. (The broader point that job_ids doesn't filter which jobs reset is by design — reset_job_status is a workflow-wide reset of retryable failed jobs — so I left that as-is; the param only gates the no-op early return.)
cpu_intensive.py — that exit-code change was an accidental inclusion of a failure-injection edit; reverted, so it's no longer part of this PR.

daniel-thom requested a review from Copilot June 6, 2026 02:14

Copilot started reviewing on behalf of daniel-thom June 6, 2026 02:15 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Comment thread examples/scripts/cpu_intensive.py Outdated

Comment thread src/client/commands/recover.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double run_id bump in recover/watch reset path#368

Fix double run_id bump in recover/watch reset path#368
daniel-thom wants to merge 2 commits into
mainfrom
fix-recover-run-id-double-bump

daniel-thom commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

daniel-thom commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daniel-thom commented Jun 6, 2026

Problem

Root cause

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

daniel-thom commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants