Skip to content

make --upstream_sync sync gap analysis progressively (#534)#755

Open
PRAteek-singHWY wants to merge 7 commits intoOWASP:mainfrom
PRAteek-singHWY:issue-534-upstream-sync-gap-analysis-progressive
Open

make --upstream_sync sync gap analysis progressively (#534)#755
PRAteek-singHWY wants to merge 7 commits intoOWASP:mainfrom
PRAteek-singHWY:issue-534-upstream-sync-gap-analysis-progressive

Conversation

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

@PRAteek-singHWY PRAteek-singHWY commented Feb 22, 2026

Fixes #534

Summary

This PR extends make upstream-sync so it can progressively backfill gap-analysis results from upstream into the local cache, instead of requiring a large standalone local dataset up front.

After syncing the core CRE graph, it:

  • fetches upstream standards
  • requests upstream /map_analysis results per standard pair
  • stores only successful payloads that already contain result
  • optionally syncs weak-link subresults when extra > 0
  • skips pairs already cached locally

This keeps the current on-demand local fallback behavior intact for pairs that are not prefilled.

Why

Issue #534 highlights that local gap-analysis data is too large and impractical to use reliably in normal local development.

The goal here is to make local gap analysis progressively usable after upstream sync by caching only the map-analysis data that is actually available from upstream, instead of expecting a monolithic full download first.

Behavior notes

  • only payloads that already contain result are stored locally
  • if upstream returns only job_id, that pair is skipped and remains available through the existing on-demand local path
  • already cached gap-analysis pairs are skipped
  • weak-link data is only requested for entries where extra > 0
  • progressive sync is bounded by default to avoid unexpectedly large request volume during normal upstream sync
  • CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRS=0 still allows full sync when explicitly desired
  • the pair limit now applies to attempted missing pairs, which makes request volume more predictable even when upstream responses are incomplete

Validation

./venv/bin/python -m pytest -q application/tests/cre_main_test.py -k download_graph_from_upstream
./venv/bin/python -m pytest -q application/tests/gap_analysis_db_test.py
./venv/bin/python -m pytest -q application/tests/cre_main_test.py

Screenshots

Example used in the screenshots below:
SAMM compared against ASVS

Before: On-Demand Gap Analysis

Before progressive upstream backfill, requesting SAMM vs ASVS returns a job_id, meaning the result is not yet prefetched locally and falls back to background/on-demand computation.

image

Before UI: Waiting For On-Demand Gap Analysis

This screenshot shows the SAMM vs ASVS map-analysis view before progressive upstream backfill is available locally. The analysis has been requested, but the result is still pending and the UI remains in its loading state.

image

After UI: Cached Local Gap Analysis

After sync, the same SAMM vs ASVS pair is available directly in the Map Analysis UI from the local cache.

image

Notes For Maintainers

This PR is intentionally focused on progressive cache backfill during upstream sync.

It does not replace the existing local on-demand computation path, and it does not assume that every upstream pair is immediately available.

A few implementation choices are intentional here:

  • progressive backfill is integrated into upstream-sync because #534 specifically asks for opportunistic upstream loading during sync
  • only completed upstream result payloads are cached locally; incomplete upstream responses are skipped
  • the sync is bounded by default to avoid unexpectedly heavy request volume during normal contributor workflows
  • CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRS=0 still allows full sync when explicitly desired
  • the default cap is intended as a conservative operational guardrail and can be adjusted if a different default is preferred

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor Author

PRAteek-singHWY commented Feb 22, 2026

Hi @Pa04rth @northdpole ,

This PR follows the progressive upstream-sync direction discussed in #534, with the goal of making local gap analysis more practical without requiring a large standalone dataset up front.

It keeps the existing core graph sync flow, then opportunistically backfills available upstream map-analysis results and weak-link subresults into the local cache, while skipping already-cached pairs and preserving the current on-demand fallback behavior for anything not prefetched.

I also tightened the behavior and validation around this flow to keep the sync bounded and more predictable during normal local use.

If this direction needs to be adjusted to better match the intended long-term approach for upstream sync and gap analysis, I’d be happy to refine it further based on your guidance. Tagging @Pa04rth as well for visibility.

@PRAteek-singHWY PRAteek-singHWY force-pushed the issue-534-upstream-sync-gap-analysis-progressive branch from 2351c6c to e871da8 Compare March 5, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make --upstream_sync also sync the gap analysis graph progressively

1 participant