Skip to content

feat: add automated Heroku data health equivalency checks#864

Open
PRAteek-singHWY wants to merge 3 commits intoOWASP:mainfrom
PRAteek-singHWY:feat/149-auto-data-health-checks
Open

feat: add automated Heroku data health equivalency checks#864
PRAteek-singHWY wants to merge 3 commits intoOWASP:mainfrom
PRAteek-singHWY:feat/149-auto-data-health-checks

Conversation

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

Summary

Fixes #149.

This PR adds an automatic data health check for OpenCRE backups.

In simple terms: we now compare the latest Heroku backup with the latest known-good Heroku backup.
If they are not equivalent, the workflow fails.


Change Details + Verification

1) Canonical comparison for OpenCRE data

I added a canonical comparison layer for the core OpenCRE tables: cre, node, cre_links, and cre_node_links.
This comparison is ID-insensitive, so random internal UUID differences do not create false failures.
It also validates link integrity while building the snapshot.

Verification

  • Equivalent datasets with different internal IDs are treated as equal.
  • Real content changes are detected as mismatches.
  • Invalid references are rejected with clear errors.
  • Covered by tests in application/tests/data_health_test.py.

2) Script to compare two PostgreSQL datasets

I added scripts/check_data_health.py to compare two PostgreSQL DBs directly.
The script prints row counts and digests for both sides, and gives a small diff summary if they differ.
Exit behavior is CI-friendly: 0 when equal, 1 when not equal.

Verification

  • Returns success when datasets are equivalent.
  • Returns failure when datasets are not equivalent.
  • Diff output includes missing/extra sample rows per table.

3) Automated GitHub Action for data health checks

I added .github/workflows/data-health-check.yml (scheduled + manual trigger).
The workflow downloads the known-good backup artifact, captures/downloads the latest Heroku backup, restores both in local Postgres service DBs, and runs the comparison script.
Any mismatch fails the workflow.

Verification

  • Workflow path is implemented as: download -> restore -> compare.
  • Compare step is fail-fast by design (non-zero exit on mismatch).
  • Full upstream end-to-end run is pending maintainer-side setup listed below.

4) Alignment with issue constraints

Issue guidance said known-good must be a Heroku backup (not local sqlite).
This implementation follows that requirement end-to-end in PostgreSQL backup flow.

Verification

  • Known-good source: Heroku backup artifact.
  • Current source: latest Heroku backup.
  • Comparison target: PostgreSQL databases restored in CI.

Important Note for Maintainers (Required Before Merge)

I could not complete upstream end-to-end execution from my side due to environment/access limits.
To fully validate and merge safely, these maintainer-side steps are required:

  1. Re-enable the Backup workflow (currently disabled by inactivity).
  2. Confirm HEROKU_API_KEY is available in Heroku-DB-Backup environment secrets.
  3. Run Backup on main once so a fresh opencreorg_db_backup artifact exists.
  4. Run Data Health Check manually (workflow_dispatch) and confirm expected behavior.
  5. If the run is green and logs look correct, this PR is ready to merge.

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor Author

Hi @Pa04rth @northdpole , quick update for #149:

  • Implemented automated Heroku backup equivalency check (data-health-check.yml + compare script + tests).
  • Added follow-up cleanup in commit 5589391 to remove password-like literals flagged by GitGuardian.
  • Current PR checks are green (Lint, Test).

Before merge, please run maintainer-side validation: re-enable Backup, ensure HEROKU_API_KEY, run Backup once on main, then trigger Data Health Check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto data health checks

1 participant