Skip to content

Fix double run_id bump in recover/watch reset path#368

Open
daniel-thom wants to merge 2 commits into
mainfrom
fix-recover-run-id-double-bump
Open

Fix double run_id bump in recover/watch reset path#368
daniel-thom wants to merge 2 commits into
mainfrom
fix-recover-run-id-double-bump

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

Problem

After a failure, the sequence torc recover <id> then torc run <id> advanced the workflow's run_id by 2 instead of 1 — a recovered job's result jumped from run_id=1 to run_id=3, leaving run_id=2 with no results.

Root cause

run_id is incremented in exactly one place: reset_workflow_status (run_id = run_id + 1). The recovery flow called it twice:

  1. reset_failed_jobsreset_workflow_status (bump)
  2. reinitialize_workflowWorkflowManager::reinitializereset_workflow_status (bump)

Every caller of reset_failed_jobs immediately follows it with reinitialize_workflow, so the reset inside reset_failed_jobs was always redundant. (The subsequent torc run <id> does not add a third bump — run_jobs_cmd only re-initializes when the workflow is fully uninitialized, which it isn't after recover.)

Fix

Drop the redundant reset_workflow_status from reset_failed_jobs; it now only resets the failed jobs, and the single run_id bump comes from the reinitialize_workflow step every caller already performs.

This corrects torc recover (normal + interactive paths) and the identical pattern in torc watch's auto-recovery (watch.rs:1009). Nothing relied on reset_failed_jobs bumping run_id.

Testing

  • New test test_recover_reset_sequence_bumps_run_id_once runs the actual recover reset pair (reset_failed_jobs + reinitialize_workflow) and asserts run_id advances by exactly 1 (fails with +2 on the old code).
  • cargo fmt --check, cargo clippy --all --all-targets --all-features -D warnings, dprint check — clean.

🤖 Generated with Claude Code

`reset_failed_jobs` reset workflow status (which bumps run_id) and was always
immediately followed by `reinitialize_workflow`, which also resets workflow
status and bumps run_id. So each recovery bumped run_id twice, leaving a gap
(e.g. a recovered job's result jumping from run 1 to run 3).

Drop the redundant `reset_workflow_status` from `reset_failed_jobs`; it now
only resets the failed jobs, and the single run_id bump comes from the
`reinitialize_workflow` step that every caller already performs. This fixes
`torc recover` and the identical pattern in `torc watch`'s auto-recovery.

Adds a test asserting the recover reset sequence advances run_id by exactly 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an issue where the recovery/watch reset flow advanced a workflow’s run_id twice by removing the redundant workflow-status reset from reset_failed_jobs, and adds a regression test to ensure the reset+reinitialize sequence bumps run_id exactly once.

Changes:

  • Remove the reset_workflow_status call from reset_failed_jobs so the run_id bump happens only during reinitialize_workflow.
  • Add a regression test covering the recover reset sequence to assert run_id increments by exactly 1.
  • Adjust the example CPU-intensive script’s process exit code (currently makes the script report success while exiting non-zero).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/client/commands/recover.rs Removes the redundant workflow status reset in the recovery reset path to prevent double run_id bumps.
tests/test_workflow_manager.rs Adds coverage to ensure reset_failed_jobs + reinitialize_workflow bumps run_id once.
examples/scripts/cpu_intensive.py Changes the script’s success-path exit code (currently causes the job to be marked failed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/scripts/cpu_intensive.py Outdated
Comment thread src/client/commands/recover.rs
- reset_failed_jobs now returns the server-reported updated_count from
  reset_job_status instead of job_ids.len(), so the logged "reset N job(s)"
  count is accurate (reset_job_status resets all retryable failed jobs
  workflow-wide; job_ids only gates the no-op early return).
- Revert an accidental change to examples/scripts/cpu_intensive.py
  (sys.exit(1) -> sys.exit(0)) that was swept into the first commit; it made
  the demo job report success while exiting non-zero.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom
Copy link
Copy Markdown
Collaborator Author

Addressed in ab2e62d:

  • recover.rsreset_failed_jobs now returns the server's updated_count from reset_job_status instead of job_ids.len(), so the logged count is accurate. (The broader point that job_ids doesn't filter which jobs reset is by design — reset_job_status is a workflow-wide reset of retryable failed jobs — so I left that as-is; the param only gates the no-op early return.)
  • cpu_intensive.py — that exit-code change was an accidental inclusion of a failure-injection edit; reverted, so it's no longer part of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants