feat: resilient background job retry & monitoring#1072
Open
Entr0zy wants to merge 1 commit into
Open
Conversation
Adds production-grade retry logic and execution monitoring for all background jobs, resolving issue rohitdash08#130. New service (app/services/job_retry.py): - JobRecord — per-job dataclass tracking run_count, success_count, failure_count, last_run, last_success, last_failure, last_error, status (idle/running/success/failed) - JobMonitor — thread-safe in-process singleton registry - retryable() — decorator factory: wraps job fn with exponential-backoff retry (configurable max_retries, backoff_base) and auto records every execution in the JobMonitor - _dispatch_due_reminders_job(app) — sends all due unsent Reminder rows - init_job_scheduler(app) — registers reminder_dispatch on APScheduler (every 1 min) wrapped with retryable(); stores scheduler at app.extensions['job_scheduler'] New routes (app/routes/jobs.py): - GET /jobs/status — JWT required; returns all job records - POST /jobs/trigger/<job_id> — JWT required; manually fires a job app/__init__.py: - Calls init_job_scheduler(app) on startup (skipped when TESTING=true or DISABLE_SCHEDULER is set) Tests (tests/test_job_retry.py — 20 tests): - JobMonitor: register/idempotent/all_records/to_dict - retryable: success, retry-on-failure, recover-on-2nd-attempt, permanent-failure, error-cleared-on-success, timestamps, run_count - HTTP: GET /jobs/status auth gate, JSON shape, registered job visible, record key structure - HTTP: POST /jobs/trigger auth gate, 503 when no scheduler - Dispatch job: runs without error in fresh DB context Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
/claim #130
What this PR adds
Production-grade resilient background job retry & monitoring, fulfilling all acceptance criteria in issue #130.
Changes
packages/backend/app/services/job_retry.py(new)JobRecordrun_count,success_count,failure_count,last_run,last_success,last_failure,last_error,statusper jobJobMonitorretryable(name, max_retries, backoff_base, app)JobMonitorrecording_dispatch_due_reminders_job(app)Reminderrows; returns{"sent": N, "failed": N}init_job_scheduler(app)reminder_dispatchon APScheduler (every 1 min) withretryable(max_retries=3); stores scheduler atapp.extensions['job_scheduler']Retry behaviour: attempt 1 → wait
backoff_base^0s → attempt 2 → waitbackoff_base^1s → … → attempt N. On permanent failure:status="failed",last_errorset. On any success:last_errorcleared.packages/backend/app/routes/jobs.py(new)GET /jobs/statusPOST /jobs/trigger/<job_id>Modified:
packages/backend/app/__init__.pyCalls
init_job_scheduler(app)afterregister_routes(app). Skipped whenTESTING=trueorDISABLE_SCHEDULERenv var is set.Modified:
packages/backend/app/routes/__init__.pyRegisters
jobsblueprint at/jobs.packages/backend/tests/test_job_retry.py(new — 20 tests)JobMonitor: register, idempotent, all_records, to_dictretryable: success, retry-on-failure, recover-on-2nd-attempt, permanent-failure, error-cleared-on-success, timestamps, run_countGET /jobs/status: auth gate, JSON shape, registered job visible, record key structurePOST /jobs/trigger: auth gate, 503 when no schedulerVerification
🤖 Generated with Claude Code