Skip to content

feat: resilient background job retry & monitoring#1072

Open
Entr0zy wants to merge 1 commit into
rohitdash08:mainfrom
Entr0zy:feat/job-retry-monitoring-130
Open

feat: resilient background job retry & monitoring#1072
Entr0zy wants to merge 1 commit into
rohitdash08:mainfrom
Entr0zy:feat/job-retry-monitoring-130

Conversation

@Entr0zy
Copy link
Copy Markdown

@Entr0zy Entr0zy commented May 24, 2026

/claim #130

What this PR adds

Production-grade resilient background job retry & monitoring, fulfilling all acceptance criteria in issue #130.


Changes

packages/backend/app/services/job_retry.py (new)

Component Purpose
JobRecord Dataclass tracking run_count, success_count, failure_count, last_run, last_success, last_failure, last_error, status per job
JobMonitor Thread-safe in-process singleton registry — all jobs share one instance
retryable(name, max_retries, backoff_base, app) Decorator factory: wraps any job function with exponential-backoff retry and automatic JobMonitor recording
_dispatch_due_reminders_job(app) Sends all due unsent Reminder rows; returns {"sent": N, "failed": N}
init_job_scheduler(app) Registers reminder_dispatch on APScheduler (every 1 min) with retryable(max_retries=3); stores scheduler at app.extensions['job_scheduler']

Retry behaviour: attempt 1 → wait backoff_base^0 s → attempt 2 → wait backoff_base^1 s → … → attempt N. On permanent failure: status="failed", last_error set. On any success: last_error cleared.

packages/backend/app/routes/jobs.py (new)

Endpoint Description
GET /jobs/status JWT required — returns all registered job records (status, counts, timestamps, last_error)
POST /jobs/trigger/<job_id> JWT required — immediately fires a scheduled job by its APScheduler id; 503 if scheduler not running

Modified: packages/backend/app/__init__.py

Calls init_job_scheduler(app) after register_routes(app). Skipped when TESTING=true or DISABLE_SCHEDULER env var is set.

Modified: packages/backend/app/routes/__init__.py

Registers jobs blueprint at /jobs.

packages/backend/tests/test_job_retry.py (new — 20 tests)

  • JobMonitor: register, idempotent, all_records, to_dict
  • retryable: success, retry-on-failure, recover-on-2nd-attempt, permanent-failure, error-cleared-on-success, timestamps, run_count
  • HTTP GET /jobs/status: auth gate, JSON shape, registered job visible, record key structure
  • HTTP POST /jobs/trigger: auth gate, 503 when no scheduler
  • Dispatch job: runs without error in fresh DB context

Verification

# Auth-gate tests (no Redis/DB required)
cd packages/backend
TESTING=true pytest tests/test_job_retry.py -v -k "requires_auth or no_scheduler"

# Full test suite
docker compose up -d
pytest tests/test_job_retry.py -v

# Check job health at runtime
curl -H "Authorization: Bearer $TOKEN" http://localhost:5000/jobs/status | jq .
# → {"job_count":1,"jobs":[{"name":"reminder_dispatch","status":"idle","run_count":0,...}]}

# Manually trigger the reminder dispatch job
curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  http://localhost:5000/jobs/trigger/reminder_dispatch
# → {"message":"job 'reminder_dispatch' triggered"}

🤖 Generated with Claude Code

Adds production-grade retry logic and execution monitoring for all
background jobs, resolving issue rohitdash08#130.

New service (app/services/job_retry.py):
- JobRecord     — per-job dataclass tracking run_count, success_count,
                  failure_count, last_run, last_success, last_failure,
                  last_error, status (idle/running/success/failed)
- JobMonitor    — thread-safe in-process singleton registry
- retryable()   — decorator factory: wraps job fn with exponential-backoff
                  retry (configurable max_retries, backoff_base) and auto
                  records every execution in the JobMonitor
- _dispatch_due_reminders_job(app) — sends all due unsent Reminder rows
- init_job_scheduler(app) — registers reminder_dispatch on APScheduler
  (every 1 min) wrapped with retryable(); stores scheduler at
  app.extensions['job_scheduler']

New routes (app/routes/jobs.py):
- GET  /jobs/status           — JWT required; returns all job records
- POST /jobs/trigger/<job_id> — JWT required; manually fires a job

app/__init__.py:
- Calls init_job_scheduler(app) on startup (skipped when TESTING=true
  or DISABLE_SCHEDULER is set)

Tests (tests/test_job_retry.py — 20 tests):
- JobMonitor: register/idempotent/all_records/to_dict
- retryable: success, retry-on-failure, recover-on-2nd-attempt,
  permanent-failure, error-cleared-on-success, timestamps, run_count
- HTTP: GET /jobs/status auth gate, JSON shape, registered job visible,
  record key structure
- HTTP: POST /jobs/trigger auth gate, 503 when no scheduler
- Dispatch job: runs without error in fresh DB context

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant