feat: add automated Heroku data health equivalency checks by PRAteek-singHWY · Pull Request #864 · OWASP/OpenCRE

PRAteek-singHWY · 2026-04-05T22:39:02Z

Summary

Fixes #149.

This PR adds an automatic data health check for OpenCRE backups.

In simple terms: we now compare the latest Heroku backup with the latest known-good Heroku backup.
If they are not equivalent, the workflow fails.

Change Details + Verification

1) Canonical comparison for OpenCRE data

I added a canonical comparison layer for the core OpenCRE tables: cre, node, cre_links, and cre_node_links.
This comparison is ID-insensitive, so random internal UUID differences do not create false failures.
It also validates link integrity while building the snapshot.

Verification

Equivalent datasets with different internal IDs are treated as equal.
Real content changes are detected as mismatches.
Invalid references are rejected with clear errors.
Covered by tests in application/tests/data_health_test.py.

2) Script to compare two PostgreSQL datasets

I added scripts/check_data_health.py to compare two PostgreSQL DBs directly.
The script prints row counts and digests for both sides, and gives a small diff summary if they differ.
Exit behavior is CI-friendly: 0 when equal, 1 when not equal.

Verification

Returns success when datasets are equivalent.
Returns failure when datasets are not equivalent.
Diff output includes missing/extra sample rows per table.

3) Automated GitHub Action for data health checks

I added .github/workflows/data-health-check.yml (scheduled + manual trigger).
The workflow downloads the known-good backup artifact, captures/downloads the latest Heroku backup, restores both in local Postgres service DBs, and runs the comparison script.
Any mismatch fails the workflow.

Verification

Workflow path is implemented as: download -> restore -> compare.
Compare step is fail-fast by design (non-zero exit on mismatch).
Full upstream end-to-end run is pending maintainer-side setup listed below.

4) Alignment with issue constraints

Issue guidance said known-good must be a Heroku backup (not local sqlite).
This implementation follows that requirement end-to-end in PostgreSQL backup flow.

Verification

Known-good source: Heroku backup artifact.
Current source: latest Heroku backup.
Comparison target: PostgreSQL databases restored in CI.

Important Note for Maintainers (Required Before Merge)

I could not complete upstream end-to-end execution from my side due to environment/access limits.
To fully validate and merge safely, these maintainer-side steps are required:

Re-enable the Backup workflow (currently disabled by inactivity).
Confirm HEROKU_API_KEY is available in Heroku-DB-Backup environment secrets.
Run Backup on main once so a fresh opencreorg_db_backup artifact exists.
Run Data Health Check manually (workflow_dispatch) and confirm expected behavior.
If the run is green and logs look correct, this PR is ready to merge.

PRAteek-singHWY · 2026-04-05T23:12:05Z

Hi @Pa04rth @northdpole , quick update for #149:

Implemented automated Heroku backup equivalency check (data-health-check.yml + compare script + tests).
Added follow-up cleanup in commit 5589391 to remove password-like literals flagged by GitGuardian.
Current PR checks are green (Lint, Test).

Before merge, please run maintainer-side validation: re-enable Backup, ensure HEROKU_API_KEY, run Backup once on main, then trigger Data Health Check.

PRAteek-singHWY added 2 commits April 6, 2026 03:45

feat: add automated Heroku data health equivalency checks

7e126da

chore: remove hardcoded postgres password literals from workflow

5589391

Merge branch 'main' into feat/149-auto-data-health-checks

fd15038

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add automated Heroku data health equivalency checks#864

feat: add automated Heroku data health equivalency checks#864
PRAteek-singHWY wants to merge 3 commits intoOWASP:mainfrom
PRAteek-singHWY:feat/149-auto-data-health-checks

PRAteek-singHWY commented Apr 5, 2026

Uh oh!

PRAteek-singHWY commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PRAteek-singHWY commented Apr 5, 2026

Summary

Change Details + Verification

1) Canonical comparison for OpenCRE data

2) Script to compare two PostgreSQL datasets

3) Automated GitHub Action for data health checks

4) Alignment with issue constraints

Important Note for Maintainers (Required Before Merge)

Uh oh!

PRAteek-singHWY commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant