Fix double run_id bump in recover/watch reset path#368
Open
daniel-thom wants to merge 2 commits into
Open
Conversation
`reset_failed_jobs` reset workflow status (which bumps run_id) and was always immediately followed by `reinitialize_workflow`, which also resets workflow status and bumps run_id. So each recovery bumped run_id twice, leaving a gap (e.g. a recovered job's result jumping from run 1 to run 3). Drop the redundant `reset_workflow_status` from `reset_failed_jobs`; it now only resets the failed jobs, and the single run_id bump comes from the `reinitialize_workflow` step that every caller already performs. This fixes `torc recover` and the identical pattern in `torc watch`'s auto-recovery. Adds a test asserting the recover reset sequence advances run_id by exactly 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes an issue where the recovery/watch reset flow advanced a workflow’s run_id twice by removing the redundant workflow-status reset from reset_failed_jobs, and adds a regression test to ensure the reset+reinitialize sequence bumps run_id exactly once.
Changes:
- Remove the
reset_workflow_statuscall fromreset_failed_jobsso therun_idbump happens only duringreinitialize_workflow. - Add a regression test covering the recover reset sequence to assert
run_idincrements by exactly 1. - Adjust the example CPU-intensive script’s process exit code (currently makes the script report success while exiting non-zero).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/client/commands/recover.rs |
Removes the redundant workflow status reset in the recovery reset path to prevent double run_id bumps. |
tests/test_workflow_manager.rs |
Adds coverage to ensure reset_failed_jobs + reinitialize_workflow bumps run_id once. |
examples/scripts/cpu_intensive.py |
Changes the script’s success-path exit code (currently causes the job to be marked failed). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- reset_failed_jobs now returns the server-reported updated_count from reset_job_status instead of job_ids.len(), so the logged "reset N job(s)" count is accurate (reset_job_status resets all retryable failed jobs workflow-wide; job_ids only gates the no-op early return). - Revert an accidental change to examples/scripts/cpu_intensive.py (sys.exit(1) -> sys.exit(0)) that was swept into the first commit; it made the demo job report success while exiting non-zero. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Addressed in ab2e62d:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After a failure, the sequence
torc recover <id>thentorc run <id>advanced the workflow'srun_idby 2 instead of 1 — a recovered job's result jumped fromrun_id=1torun_id=3, leavingrun_id=2with no results.Root cause
run_idis incremented in exactly one place:reset_workflow_status(run_id = run_id + 1). The recovery flow called it twice:reset_failed_jobs→reset_workflow_status(bump)reinitialize_workflow→WorkflowManager::reinitialize→reset_workflow_status(bump)Every caller of
reset_failed_jobsimmediately follows it withreinitialize_workflow, so the reset insidereset_failed_jobswas always redundant. (The subsequenttorc run <id>does not add a third bump —run_jobs_cmdonly re-initializes when the workflow is fully uninitialized, which it isn't after recover.)Fix
Drop the redundant
reset_workflow_statusfromreset_failed_jobs; it now only resets the failed jobs, and the singlerun_idbump comes from thereinitialize_workflowstep every caller already performs.This corrects
torc recover(normal + interactive paths) and the identical pattern intorc watch's auto-recovery (watch.rs:1009). Nothing relied onreset_failed_jobsbumpingrun_id.Testing
test_recover_reset_sequence_bumps_run_id_onceruns the actual recover reset pair (reset_failed_jobs+reinitialize_workflow) and assertsrun_idadvances by exactly 1 (fails with+2on the old code).cargo fmt --check,cargo clippy --all --all-targets --all-features -D warnings,dprint check— clean.🤖 Generated with Claude Code