Skip to content

feat: add negative data cache for repeatedly not-found data IDs (PE-8985)#630

Merged
djwhitt merged 12 commits intodevelopfrom
PE-8985-negative-data-cache
Mar 8, 2026
Merged

feat: add negative data cache for repeatedly not-found data IDs (PE-8985)#630
djwhitt merged 12 commits intodevelopfrom
PE-8985-negative-data-cache

Conversation

@djwhitt
Copy link
Copy Markdown
Collaborator

@djwhitt djwhitt commented Mar 6, 2026

Summary

  • Adds a two-phase negative data cache that detects data IDs consistently missing over configurable count and duration thresholds, then short-circuits future requests with 404 responses
  • Reduces unnecessary upstream load (peers, gateways, chunk reconstruction) from repeated requests for non-existent data
  • Integrates into both raw and standard data handlers with check/record-miss/evict lifecycle
  • Adds 6 Prometheus metrics (4 counters + 2 gauges) for observability
  • Configurable via 5 new env vars (NEGATIVE_CACHE_*), enabled by default

Test plan

  • yarn test:file src/data/negative-data-cache.test.ts — 10 unit tests pass
  • yarn test — full suite passes (1486 pass, 0 fail)
  • yarn build — compiles cleanly
  • yarn lint:check — no issues
  • yarn duplicate:check — no new duplicates

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a NegativeDataCache feature: new NEGATIVE_CACHE_* env configs, header constant, Prometheus metrics, a NegativeDataCache class (miss-tracker + negative-cache with LRU, TTL, promotion/eviction), unit tests, and wiring into system and data handlers to short‑circuit repeated misses and evict on success.

Changes

Cohort / File(s) Summary
Docs & Config
docs/envs.md, src/config.ts, src/constants.ts
Added NEGATIVE_CACHE_* environment variables and defaults; added header name X-AR-IO-Negative-Cache.
Negative Cache Implementation
src/data/negative-data-cache.ts, src/data/negative-data-cache.test.ts
New NegativeDataCache class: miss-tracker + negative-cache (LRU, TTL), promotion/eviction, metrics, logging; comprehensive unit tests.
Metrics
src/metrics.ts
Added metrics: negative_cache_hits_total, negative_cache_misses_total, negative_cache_promotions_total, negative_cache_evictions_total, negative_cache_size, miss_tracker_size.
Handler integration
src/routes/data/handlers.ts, src/routes/data/index.ts
Threaded optional negativeDataCache into handlers; short-circuit 404 on negative-cache hit, call recordMiss on failure paths, and evict on successful retrievals.
System wiring
src/system.ts
Instantiated and exported negativeDataCache using NEGATIVE_CACHE_* config values and logger.

Sequence Diagram

sequenceDiagram
    actor Client
    participant DataHandler as Data Handler
    participant NegCache as NegativeDataCache
    participant MissTracker as Miss Tracker (LRU)
    participant NegCacheStore as Negative Cache (LRU)
    participant Metrics as Metrics Registry

    Client->>DataHandler: Request data (ID)
    DataHandler->>NegCache: isNegativelyCached(ID)?
    NegCache->>NegCacheStore: Check entry
    alt Cache Hit
        NegCacheStore-->>NegCache: true
        NegCache->>Metrics: Increment hits_total
        NegCache-->>DataHandler: true
        DataHandler-->>Client: 404 + X-AR-IO-Negative-Cache
    else Cache Miss
        NegCacheStore-->>NegCache: false
        NegCache->>Metrics: Increment misses_total
        NegCache-->>DataHandler: false
        DataHandler->>DataHandler: Attempt retrieval
        alt Retrieval Fails
            DataHandler->>NegCache: recordMiss(ID)
            NegCache->>MissTracker: Update/create miss entry
            alt Promotion thresholds met
                MissTracker-->>NegCache: promote
                NegCache->>NegCacheStore: Add to negative cache
                NegCache->>Metrics: Increment promotions_total
            end
            DataHandler-->>Client: 404
        else Retrieval Succeeds
            DataHandler->>NegCache: evict(ID)
            NegCache->>NegCacheStore: Remove if present
            NegCache->>Metrics: Increment evictions_total
            DataHandler-->>Client: 200 + data
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • hlolli
  • dtfiedler
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding a negative data cache feature for handling repeatedly not-found data IDs with a ticket reference.
Description check ✅ Passed The description is well-related to the changeset, covering the implementation details, integration points, observability additions, and comprehensive testing performed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch PE-8985-negative-data-cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
src/system.ts (1)

224-232: Add TSDoc for the shared negative-cache singleton.

This export is now part of the app-wide system surface, so it should document that it is process-wide and shared across handlers.

As per coding guidelines, "Add or improve TSDoc comments when modifying code to enhance documentation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/system.ts` around lines 224 - 232, Add a TSDoc block above the exported
negativeDataCache constant (the NegativeDataCache singleton) that clearly states
this instance is process-wide, shared across all request handlers/modules, and
that its behavior is controlled by the listed config options
(NEGATIVE_CACHE_ENABLED, NEGATIVE_CACHE_MAX_SIZE, NEGATIVE_CACHE_TTL_MS,
NEGATIVE_CACHE_MISS_COUNT_THRESHOLD, NEGATIVE_CACHE_MISS_THRESHOLD_MS); include
brief notes on its purpose (caching negative lookup results), intended lifetime,
and any concurrency/side-effect considerations so callers understand that this
is a global singleton used by the application.
src/data/negative-data-cache.test.ts (1)

155-191: Avoid asserting on prom-client internals.

These checks reach into Counter.hashMap, which is not a public API. Prefer the public metric getters/registry so this test does not break on a prom-client upgrade.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/negative-data-cache.test.ts` around lines 155 - 191, The test is
inspecting prom-client internals (Counter.hashMap) which is fragile; instead
retrieve metric values via the prom-client public API (register.getSingleMetric
or the metric's .get() method) for the counters used in this file
(metrics.negativeCacheHitsTotal, metrics.negativeCacheMissesTotal,
metrics.negativeCachePromotionsTotal, metrics.negativeCacheEvictionsTotal).
Update the assertions to call the public getter (e.g.,
register.getSingleMetric('negative_cache_hits_total')?.get() or
metrics.negativeCacheHitsTotal.get()) and extract the value for the no-label
series (or matching label set) to compute deltas after calling createCache,
cache.isNegativelyCached, cache.recordMiss, and cache.evict, rather than
reaching into .hashMap.
src/data/negative-data-cache.ts (1)

18-116: Add TSDoc for the new public cache API.

This introduces a reusable class with a non-trivial constructor contract and three public entry points, but none of the promotion/eviction semantics are documented. A short TSDoc block on the class plus isNegativelyCached, recordMiss, and evict would make the threshold/TTL behavior much easier to consume correctly. As per coding guidelines, **/*.{ts,tsx,js}: Add or improve TSDoc comments when modifying code to enhance documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/negative-data-cache.ts` around lines 18 - 116, Add TSDoc to the
NegativeDataCache class and its public API (constructor, isNegativelyCached,
recordMiss, evict) describing the constructor contract (parameters: enabled,
maxSize, ttlMs, missCountThreshold, missDurationMs, optional now) and the
semantics: isNegativelyCached returns whether an id is in the negative cache
(and increments metrics), recordMiss updates the missTracker
(firstSeenAt/lastSeenAt/count), promotes an id into negativeCache when count >=
missCountThreshold AND now - firstSeenAt >= missDurationMs (then increments
promotion metric and logs), and evict removes id from negativeCache and
missTracker (increments eviction metric and logs); also mention LRU cache
behavior, TTL for negativeCache, and that methods are no-ops when enabled is
false where applicable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/data/negative-data-cache.ts`:
- Around line 60-71: In isNegativelyCached(id: string) replace
negativeCache.has(id) with a negativeCache.get(id) lookup so access updates
recency and prunes stale entries; treat a truthy get result as a hit (increment
metrics.negativeCacheHitsTotal and return true) and for misses increment
metrics.negativeCacheMissesTotal and return false; after any get that may have
removed stale entries, call updateGauges() only when negativeCache.size has
actually changed (compare before/after) to keep the negativeCacheSize gauge in
sync.

In `@src/routes/data/handlers.ts`:
- Line 1421: When evicting the original id you must also evict the
manifest-resolved target to avoid stale 404s: after calling
negativeDataCache?.evict(id) add logic to check for the manifest-resolved id
(the value used in sendManifestResponse(), e.g. resolvedId or however the
resolved target is named in the handler) and call
negativeDataCache?.evict(resolvedId) as well (guarding for undefined). Update
the code path that serves manifests (sendManifestResponse) to ensure the
resolved identifier used there is the same symbol you evict so both the request
id and its resolved manifest id are removed from the negative cache.
- Around line 849-850: The handler is learning misses for any exception from
getDataAttributes()/getData() by unconditionally calling
negativeDataCache.recordMiss(id) and setting span http.status_code to 404;
change this so you only treat authoritative "not found" responses as
negative-cacheable: in the catch blocks around getDataAttributes() and getData()
check the error type or status (e.g., an explicit NotFoundError, a 404 HTTP
error, or the repository layer's sentinel for missing entities) and only then
call negativeDataCache.recordMiss(id) and span.setAttribute('http.status_code',
404); for all other exceptions (transient DB errors, timeouts, 5xxs) rethrow or
handle without recording a miss. Ensure the same conditional logic is applied to
the other similar catch sites that currently call recordMiss(id).

---

Nitpick comments:
In `@src/data/negative-data-cache.test.ts`:
- Around line 155-191: The test is inspecting prom-client internals
(Counter.hashMap) which is fragile; instead retrieve metric values via the
prom-client public API (register.getSingleMetric or the metric's .get() method)
for the counters used in this file (metrics.negativeCacheHitsTotal,
metrics.negativeCacheMissesTotal, metrics.negativeCachePromotionsTotal,
metrics.negativeCacheEvictionsTotal). Update the assertions to call the public
getter (e.g., register.getSingleMetric('negative_cache_hits_total')?.get() or
metrics.negativeCacheHitsTotal.get()) and extract the value for the no-label
series (or matching label set) to compute deltas after calling createCache,
cache.isNegativelyCached, cache.recordMiss, and cache.evict, rather than
reaching into .hashMap.

In `@src/data/negative-data-cache.ts`:
- Around line 18-116: Add TSDoc to the NegativeDataCache class and its public
API (constructor, isNegativelyCached, recordMiss, evict) describing the
constructor contract (parameters: enabled, maxSize, ttlMs, missCountThreshold,
missDurationMs, optional now) and the semantics: isNegativelyCached returns
whether an id is in the negative cache (and increments metrics), recordMiss
updates the missTracker (firstSeenAt/lastSeenAt/count), promotes an id into
negativeCache when count >= missCountThreshold AND now - firstSeenAt >=
missDurationMs (then increments promotion metric and logs), and evict removes id
from negativeCache and missTracker (increments eviction metric and logs); also
mention LRU cache behavior, TTL for negativeCache, and that methods are no-ops
when enabled is false where applicable.

In `@src/system.ts`:
- Around line 224-232: Add a TSDoc block above the exported negativeDataCache
constant (the NegativeDataCache singleton) that clearly states this instance is
process-wide, shared across all request handlers/modules, and that its behavior
is controlled by the listed config options (NEGATIVE_CACHE_ENABLED,
NEGATIVE_CACHE_MAX_SIZE, NEGATIVE_CACHE_TTL_MS,
NEGATIVE_CACHE_MISS_COUNT_THRESHOLD, NEGATIVE_CACHE_MISS_THRESHOLD_MS); include
brief notes on its purpose (caching negative lookup results), intended lifetime,
and any concurrency/side-effect considerations so callers understand that this
is a global singleton used by the application.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 670458fc-cd4b-4077-95d4-94249e7b30f8

📥 Commits

Reviewing files that changed from the base of the PR and between 0197e8e and 248a8d3.

📒 Files selected for processing (9)
  • docs/envs.md
  • src/config.ts
  • src/constants.ts
  • src/data/negative-data-cache.test.ts
  • src/data/negative-data-cache.ts
  • src/metrics.ts
  • src/routes/data/handlers.ts
  • src/routes/data/index.ts
  • src/system.ts

Comment thread src/data/negative-data-cache.ts
Comment thread src/routes/data/handlers.ts Outdated
Comment thread src/routes/data/handlers.ts
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 99.04153% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.82%. Comparing base (047d413) to head (080cb4a).
⚠️ Report is 12 commits behind head on develop.

Files with missing lines Patch % Lines
src/data/negative-data-cache.ts 98.61% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #630      +/-   ##
===========================================
+ Coverage    76.63%   76.82%   +0.19%     
===========================================
  Files          104      105       +1     
  Lines        35572    35885     +313     
  Branches      2587     2618      +31     
===========================================
+ Hits         27259    27569     +310     
- Misses        8278     8281       +3     
  Partials        35       35              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a negative data cache to reduce repeated upstream work for consistently missing data IDs by short-circuiting future requests with 404s, with new configuration and Prometheus observability.

Changes:

  • Introduces NegativeDataCache with miss-tracking → promotion → TTL-based negative caching.
  • Integrates negative-cache check / miss recording / eviction into raw and standard data handlers.
  • Adds new env vars, response header, and Prometheus counters/gauges for the feature.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/system.ts Instantiates and exports the NegativeDataCache singleton from config.
src/routes/data/index.ts Wires negativeDataCache into handler factories.
src/routes/data/handlers.ts Uses negative cache to short-circuit 404s, record misses, and evict on success.
src/metrics.ts Adds negative-cache counters/gauges.
src/data/negative-data-cache.ts Implements two-phase miss tracker + negative cache with TTL and metrics/logging.
src/data/negative-data-cache.test.ts Adds unit tests for promotion, eviction, TTL, LRU behavior, and metrics.
src/constants.ts Adds X-AR-IO-Negative-Cache header name constant.
src/config.ts Adds NEGATIVE_CACHE_* env var configuration.
docs/envs.md Documents new negative-cache env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/routes/data/handlers.ts Outdated
message: error.message,
stack: error.stack,
});
negativeDataCache?.recordMiss(id);
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negativeDataCache.recordMiss(id) runs on any exception from getDataAttributes(). Since CompositeDataAttributesSource can throw on upstream errors (not just “not found”), this can poison the negative cache and return persistent 404s for IDs that do exist. Only record a miss when the error is a definite not-found signal (e.g., 404/status code or a dedicated error type).

Suggested change
negativeDataCache?.recordMiss(id);
const isNotFoundError =
error?.status === 404 ||
error?.statusCode === 404 ||
error?.code === 'NOT_FOUND';
if (isNotFoundError) {
negativeDataCache?.recordMiss(id);
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the current code — both getDataAttributes() catch blocks explicitly avoid calling recordMiss() with the comment: "Don't record a negative cache miss here — getDataAttributes throwing indicates a transient infrastructure error, not that the data doesn't exist." (lines 902-904 and 1386-1388).

Comment thread src/routes/data/handlers.ts Outdated
message: error.message,
stack: error.stack,
});
negativeDataCache?.recordMiss(id);
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negativeDataCache.recordMiss(id) is called for any getData() exception (excluding AbortError). If getData() fails transiently (network, circuit breaker, peer outages), this will incorrectly negative-cache the ID and short-circuit subsequent requests with 404. Limit recordMiss to confirmed “not found” failures (e.g., a typed not-found error or explicit 404 response) and don’t negative-cache generic retrieval failures.

Suggested change
negativeDataCache?.recordMiss(id);
if (
error?.status === 404 ||
error?.statusCode === 404 ||
error?.name === 'NotFoundError' ||
error?.code === 'DATA_NOT_FOUND'
) {
negativeDataCache?.recordMiss(id);
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changing this — same rationale as the earlier CodeRabbit discussion. The promotion thresholds (10 misses over 4 hours) provide sufficient safety margin against sporadic transient errors. Fast re-promotion after eviction uses short initial TTLs with exponential backoff, and evict-on-success self-heals when data becomes available. Adding error-type discrimination would require threading error semantics through the composite data source layers for marginal benefit.

this.updateGauges();
}

private updateGauges(): void {
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateGauges() reports cache sizes via LRUCache.size, but expired TTL entries may still be counted until the cache is accessed/purged (lru-cache only evicts stale entries opportunistically). This can make negativeCacheSize/missTrackerSize misleading. Consider purging stale entries before reading size (e.g., purgeStale()) or enabling an approach that keeps the gauge accurate with TTL-based eviction.

Suggested change
private updateGauges(): void {
private updateGauges(): void {
// Ensure TTL-expired entries are removed before reporting sizes
this.negativeCache.purgeStale();
this.missTracker.purgeStale();

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changing this. With ttlResolution: 0, staleness is checked on every get() call, so expired entries are pruned on access. The gauge will be slightly overstated between accesses but purgeStale() has a meaningful performance cost (iterates the full cache) and updateGauges() is called on every recordMiss() and evict(). The marginal metric accuracy improvement doesn't justify the overhead.

Comment thread src/data/negative-data-cache.ts Outdated
Comment on lines +105 to +118
this.negativeCache.set(id, true);
this.missTracker.delete(id);
metrics.negativeCachePromotionsTotal.inc();
this.log.info('ID promoted to negative cache', { id });
}

this.updateGauges();
}

evict(id: string): void {
if (this.negativeCache.has(id)) {
metrics.negativeCacheEvictionsTotal.inc();
this.log.info('ID evicted from negative cache', { id });
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emits info-level logs on every promotion/eviction. Depending on traffic patterns, this can become high-volume and add significant log noise/cost. Consider lowering to debug, sampling, or gating behind a separate verbosity flag while keeping metrics as the primary signal.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this at info. The promotion thresholds (10 misses over 4 hours by default) make these events infrequent enough that info is appropriate — these are operationally significant events that operators should see without enabling debug logging.

Comment thread src/data/negative-data-cache.test.ts Outdated
Comment on lines +155 to +163
it('metrics increment correctly', () => {
const hitsBefore =
(metrics.negativeCacheHitsTotal as any).hashMap[''].value ?? 0;
const missesBefore =
(metrics.negativeCacheMissesTotal as any).hashMap[''].value ?? 0;
const promotionsBefore =
(metrics.negativeCachePromotionsTotal as any).hashMap[''].value ?? 0;
const evictionsBefore =
(metrics.negativeCacheEvictionsTotal as any).hashMap[''].value ?? 0;
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics assertions read prom-client internals via (counter as any).hashMap, which is not part of the public API and is prone to break across prom-client versions. Prefer using the public metric APIs (e.g., Counter.get()/register.getSingleMetric().get()) to fetch the current value so the test remains stable.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed — the tests no longer use (counter as any).hashMap. They use the public .get() API (see getValue helper in the current test file).

Comment thread src/routes/data/handlers.ts Outdated
Comment on lines 849 to 851
negativeDataCache?.recordMiss(id);
span.setAttribute('http.status_code', 404);
sendNotFound(res);
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negativeDataCache.recordMiss(id) is called on any exception from getDataAttributes(). That catch can represent transient upstream/DB failures (not a confirmed 404), so this can promote valid IDs into the negative cache and force 404s for the TTL. Consider recording a miss only when the failure is definitively “not found” (e.g., error has a 404 status / specific not-found error type), and avoid negative-caching on generic/internal errors (potentially return 5xx instead).

Suggested change
negativeDataCache?.recordMiss(id);
span.setAttribute('http.status_code', 404);
sendNotFound(res);
const status = error?.statusCode ?? error?.status;
const code = error?.code;
const isNotFoundError =
status === 404 ||
code === 'NOT_FOUND' ||
code === 'NotFoundError';
if (isNotFoundError) {
negativeDataCache?.recordMiss(id);
span.setAttribute('http.status_code', 404);
sendNotFound(res);
} else {
span.setAttribute('http.status_code', 502);
res
.status(502)
.send('Upstream error while retrieving data attributes');
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — already addressed in the current code. The getDataAttributes() catch blocks do not call recordMiss().

Comment on lines 1004 to 1010
log.warn('Unable to retrieve contiguous data:', {
dataId: id,
message: error.message,
stack: error.stack,
});
negativeDataCache?.recordMiss(id);
sendNotFound(res);
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negativeDataCache.recordMiss(id) is invoked for any getData() failure (other than AbortError). getData() can fail for transient reasons (timeouts, upstream errors), and negative-caching those failures can incorrectly short-circuit future requests with 404s. Record misses only for confirmed not-found cases (e.g., explicit not-found error / 404), and avoid negative-caching generic retrieval failures.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — intentionally not changing. See discussion on the CodeRabbit thread for full rationale.

djwhitt and others added 2 commits March 7, 2026 13:00
…985)

Two-phase cache that tracks IDs consistently missing over configurable
count and duration thresholds, then short-circuits future requests with
404 responses. Reduces unnecessary upstream load from repeated requests
for non-existent data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
has() does not update LRU recency, so frequently checked negative IDs
could be evicted despite being hot. get() properly updates recency and
prunes stale TTL entries on access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@djwhitt djwhitt force-pushed the PE-8985-negative-data-cache branch from b50f7e0 to f1bba3b Compare March 7, 2026 19:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/data/negative-data-cache.test.ts (1)

155-192: Consider using a more stable approach for metric assertions.

The test accesses the internal hashMap structure of prom-client counters (e.g., (metrics.negativeCacheHitsTotal as any).hashMap[''].value). This relies on prom-client internals that could change between versions.

Consider using the public .get() method on the metric if available, or accepting this as a pragmatic approach for internal testing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/negative-data-cache.test.ts` around lines 155 - 192, The test reads
prom-client internals (hashMap) to assert metric deltas; replace those fragile
accesses by calling the public get() on each counter
(metrics.negativeCacheHitsTotal.get(), metrics.negativeCacheMissesTotal.get(),
metrics.negativeCachePromotionsTotal.get(),
metrics.negativeCacheEvictionsTotal.get()) and extract the matching metric value
(e.g., find the entry with labels {} or label '' then use its .value) to compute
before/after deltas for assertions; update the four places in the test
(hitsBefore/hitsAfter, missesBefore/missesAfter,
promotionsBefore/promotionsAfter, evictionsBefore/evictionsAfter) to use this
stable .get() approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/data/negative-data-cache.test.ts`:
- Around line 155-192: The test reads prom-client internals (hashMap) to assert
metric deltas; replace those fragile accesses by calling the public get() on
each counter (metrics.negativeCacheHitsTotal.get(),
metrics.negativeCacheMissesTotal.get(),
metrics.negativeCachePromotionsTotal.get(),
metrics.negativeCacheEvictionsTotal.get()) and extract the matching metric value
(e.g., find the entry with labels {} or label '' then use its .value) to compute
before/after deltas for assertions; update the four places in the test
(hitsBefore/hitsAfter, missesBefore/missesAfter,
promotionsBefore/promotionsAfter, evictionsBefore/evictionsAfter) to use this
stable .get() approach.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8bc66ab8-83ff-4ae6-ba95-0ff9eefd166f

📥 Commits

Reviewing files that changed from the base of the PR and between b50f7e0 and f1bba3b.

📒 Files selected for processing (9)
  • docs/envs.md
  • src/config.ts
  • src/constants.ts
  • src/data/negative-data-cache.test.ts
  • src/data/negative-data-cache.ts
  • src/metrics.ts
  • src/routes/data/handlers.ts
  • src/routes/data/index.ts
  • src/system.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/routes/data/handlers.ts
  • docs/envs.md
  • src/config.ts

Inject time into lru-cache via perf.now and ttlResolution: 0 so TTL
tests don't rely on setTimeout. Use public prom-client API for metrics
assertions instead of internal hashMap access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/data/negative-data-cache.ts (2)

18-42: Add TSDoc for the exported cache API.

This introduces a new exported class and public methods, but none of the promotion/lookup/eviction behavior is documented yet. Please add TSDoc on the class and each public method so the thresholds and side effects are discoverable.

As per coding guidelines, **/*.{ts,tsx,js}: Add or improve TSDoc comments when modifying code to enhance documentation.

Also applies to: 62-124

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/negative-data-cache.ts` around lines 18 - 42, Add TSDoc comments to
the exported NegativeDataCache class and every public method (including the
constructor and any methods that perform promotion, lookup, add, eviction or
clearing) describing purpose, parameters, return values, thresholds
(missCountThreshold, missDurationMs, ttlMs, maxSize), side effects (when entries
are promoted to negativeCache, when missTracker is incremented, eviction
behavior) and any important usage notes; specifically document the
NegativeDataCache class, its constructor signature, and each public method so
callers can discover the cache semantics, threshold behavior and observable side
effects.

54-59: Remove hardcoded ttlResolution: 0 from production cache and use the default or make it configurable.

The ttlResolution: 0 setting forces lru-cache to check time on every staleness test, reducing performance. Per the lru-cache documentation, ttlResolution: 0 is "theoretically unnecessary" and the default ttlResolution: 1 is recommended for production caches. The commit message shows this setting was added for deterministic testing with an injected now() function, but it applies unconditionally to the production code path as well.

Either:

  1. Use the default ttlResolution (1) in production and only set to 0 in tests, or
  2. Make ttlResolution a constructor parameter (like now already is) to allow test control without hardcoding in production.

Since isNegativelyCached() runs on the request path, the performance overhead is avoidable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/negative-data-cache.ts` around lines 54 - 59, The LRUCache
instantiation in the NegativeDataCache (this.negativeCache) currently hardcodes
ttlResolution: 0 which hurts production performance; change the constructor for
NegativeDataCache to accept an optional ttlResolution (or default to
undefined/1) alongside the existing now param and pass that value into the
LRUCache options (or omit the option so the library default is used), ensuring
tests can pass 0 while production uses the default resolution; update the
constructor signature and the LRUCache options creation in the class that
initializes this.negativeCache to reference the new ttlResolution parameter
instead of the hardcoded 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/data/negative-data-cache.ts`:
- Around line 116-123: evict() still performs has/delete and updates gauges even
when the negative cache is disabled, causing unnecessary per-request overhead;
modify the evict(id: string) method to return early when the cache is disabled
(same condition used by isNegativelyCached()/recordMiss()), so that it skips
checking this.negativeCache.has(id), this.negativeCache.delete(id),
this.missTracker.delete(id) and this.updateGauges() when disabled; ensure the
early-return check references the same disabled flag or helper used elsewhere to
keep behavior consistent.

---

Nitpick comments:
In `@src/data/negative-data-cache.ts`:
- Around line 18-42: Add TSDoc comments to the exported NegativeDataCache class
and every public method (including the constructor and any methods that perform
promotion, lookup, add, eviction or clearing) describing purpose, parameters,
return values, thresholds (missCountThreshold, missDurationMs, ttlMs, maxSize),
side effects (when entries are promoted to negativeCache, when missTracker is
incremented, eviction behavior) and any important usage notes; specifically
document the NegativeDataCache class, its constructor signature, and each public
method so callers can discover the cache semantics, threshold behavior and
observable side effects.
- Around line 54-59: The LRUCache instantiation in the NegativeDataCache
(this.negativeCache) currently hardcodes ttlResolution: 0 which hurts production
performance; change the constructor for NegativeDataCache to accept an optional
ttlResolution (or default to undefined/1) alongside the existing now param and
pass that value into the LRUCache options (or omit the option so the library
default is used), ensuring tests can pass 0 while production uses the default
resolution; update the constructor signature and the LRUCache options creation
in the class that initializes this.negativeCache to reference the new
ttlResolution parameter instead of the hardcoded 0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 52db6677-7981-40a5-a007-9c99b681e6eb

📥 Commits

Reviewing files that changed from the base of the PR and between f1bba3b and b20a683.

📒 Files selected for processing (2)
  • src/data/negative-data-cache.test.ts
  • src/data/negative-data-cache.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/data/negative-data-cache.test.ts

Comment thread src/data/negative-data-cache.ts
…ent errors

The miss tracker had no TTL, so stale entries could persist indefinitely
and cause incorrect promotions on a single new miss. Adding a TTL equal
to missDurationMs ensures entries expire naturally. Also stops recording
negative cache misses when getDataAttributes throws, since exceptions
indicate transient infrastructure errors, not missing data. Bumps the
default NEGATIVE_CACHE_MISS_THRESHOLD_MS from 4h to 8h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/routes/data/handlers.ts (2)

1440-1456: ⚠️ Potential issue | 🟠 Major

Evict the manifest request ID on the index-resolution success path.

When resolveFromIndex() + sendManifestResponse() succeeds, this branch returns without clearing any existing miss-tracker state for id. The direct getData() success path already self-heals at Line 1478, but this success path does not, so a later failure can still promote a now-valid manifest ID into the negative cache.

Suggested change
          if (
            await sendManifestResponse({
              log,
              req,
              res,
              dataAttributesSource,
              dataSource,
              requestAttributes,
              rateLimiter,
              paymentProcessor,
              parentSpan: span,
              ...manifestResolution,
            })
          ) {
+            negativeDataCache?.evict(id);
             span.setAttribute('http.status_code', res.statusCode);
             span.addEvent('Manifest response sent successfully');
             // Manifest response successfully sent
             return;
          }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routes/data/handlers.ts` around lines 1440 - 1456, On the successful
index-resolution + sendManifestResponse() path in handlers.ts (the branch that
currently returns after span.addEvent('Manifest response sent successfully')),
add the same negative-cache eviction used on the getData() success path so the
manifest request ID is removed from the miss-tracker; locate the getData()
success code to copy the exact eviction call (e.g., missTracker.evict(id) or
evictMissingManifest(id)) and invoke it just before returning from the
resolveFromIndex()/sendManifestResponse() success branch.

890-907: ⚠️ Potential issue | 🟠 Major

Don't turn transient attribute-store failures into 404s.

The new comment says getDataAttributes() throwing is an infrastructure error, but Line 905 and Line 1389 still return Not found. That hides real 5xx conditions and can send false 404s for data that would otherwise be retrievable. Re-throw here or surface a 500 instead of calling sendNotFound(res).

Suggested change
-          span.setAttribute('http.status_code', 404);
-          sendNotFound(res);
-          return;
+          throw error;

Apply the same change to both catch blocks.

Also applies to: 1374-1391

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routes/data/handlers.ts` around lines 890 - 907, The catch blocks that
handle errors from getDataAttributes() currently log the error but then mark the
span with http.status_code=404 and call sendNotFound(res), which masks transient
infrastructure failures as 404s; instead, for the catch blocks around
getDataAttributes() (referenced by getDataAttributes,
span.setAttribute('http.status_code', ...), and sendNotFound), either re-throw
the error or send a 500 response: set the span http.status_code to 500, record
the exception, log the error (keep existing log.error), and call
sendServerError(res) or res.status(500).send(...) rather than sendNotFound(res);
make the same change in both catch blocks mentioned in the comment so transient
errors surface as 5xx rather than false 404s.
🧹 Nitpick comments (2)
src/routes/data/handlers.ts (1)

796-812: Document the new negativeDataCache dependency on the exported handlers.

createRawDataHandler and createDataHandler both gained an optional parameter that changes 404 behavior, but the exported factories still have no TSDoc describing the short-circuit / record / evict lifecycle. Please add a brief doc block or shared interface comment.

As per coding guidelines, **/*.{ts,tsx,js}: Add or improve TSDoc comments when modifying code to enhance documentation.

Also applies to: 1271-1289

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routes/data/handlers.ts` around lines 796 - 812, Add a TSDoc comment to
the exported handler factories (createRawDataHandler and createDataHandler) that
documents the new optional negativeDataCache parameter and its lifecycle:
explain that when provided the handler will short-circuit 404 / negative
lookups, record negative entries on misses, and evict/refresh entries when
writes or TTLs occur; mention it is optional, how it alters 404 behavior, and
any expected interface/semantics of NegativeDataCache (e.g., record/evict/check
methods). Place the doc block immediately above each factory export so callers
and generated typings clearly convey the short-circuit/record/evict behavior.
src/data/negative-data-cache.ts (1)

18-63: Add TSDoc for the public cache contract.

This new exported class has non-obvious behavior around promotion thresholds, TTLs, and what evict() clears, but there’s no API-level documentation yet. A short TSDoc block on the class and its public methods would make the handler integration and tests much easier to reason about.

As per coding guidelines, **/*.{ts,tsx,js}: Add or improve TSDoc comments when modifying code to enhance documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/negative-data-cache.ts` around lines 18 - 63, Add a TSDoc block for
the exported class NegativeDataCache and TSDoc for each public method (including
constructor, evict, and any other public API) that documents the cache contract:
explain the enabled flag behavior, how missTracker and negativeCache interact
(promotion threshold via missCountThreshold and time window via missDurationMs),
what TTLs (ttlMs for negativeCache and missDurationMs for missTracker) control,
what evict() clears (both missTracker and negativeCache) and any side effects,
and note the purpose of the now callback parameter; place the docs immediately
above the class declaration and each public method so callers and tests can
understand promotion, TTL and eviction semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/routes/data/handlers.ts`:
- Around line 1440-1456: On the successful index-resolution +
sendManifestResponse() path in handlers.ts (the branch that currently returns
after span.addEvent('Manifest response sent successfully')), add the same
negative-cache eviction used on the getData() success path so the manifest
request ID is removed from the miss-tracker; locate the getData() success code
to copy the exact eviction call (e.g., missTracker.evict(id) or
evictMissingManifest(id)) and invoke it just before returning from the
resolveFromIndex()/sendManifestResponse() success branch.
- Around line 890-907: The catch blocks that handle errors from
getDataAttributes() currently log the error but then mark the span with
http.status_code=404 and call sendNotFound(res), which masks transient
infrastructure failures as 404s; instead, for the catch blocks around
getDataAttributes() (referenced by getDataAttributes,
span.setAttribute('http.status_code', ...), and sendNotFound), either re-throw
the error or send a 500 response: set the span http.status_code to 500, record
the exception, log the error (keep existing log.error), and call
sendServerError(res) or res.status(500).send(...) rather than sendNotFound(res);
make the same change in both catch blocks mentioned in the comment so transient
errors surface as 5xx rather than false 404s.

---

Nitpick comments:
In `@src/data/negative-data-cache.ts`:
- Around line 18-63: Add a TSDoc block for the exported class NegativeDataCache
and TSDoc for each public method (including constructor, evict, and any other
public API) that documents the cache contract: explain the enabled flag
behavior, how missTracker and negativeCache interact (promotion threshold via
missCountThreshold and time window via missDurationMs), what TTLs (ttlMs for
negativeCache and missDurationMs for missTracker) control, what evict() clears
(both missTracker and negativeCache) and any side effects, and note the purpose
of the now callback parameter; place the docs immediately above the class
declaration and each public method so callers and tests can understand
promotion, TTL and eviction semantics.

In `@src/routes/data/handlers.ts`:
- Around line 796-812: Add a TSDoc comment to the exported handler factories
(createRawDataHandler and createDataHandler) that documents the new optional
negativeDataCache parameter and its lifecycle: explain that when provided the
handler will short-circuit 404 / negative lookups, record negative entries on
misses, and evict/refresh entries when writes or TTLs occur; mention it is
optional, how it alters 404 behavior, and any expected interface/semantics of
NegativeDataCache (e.g., record/evict/check methods). Place the doc block
immediately above each factory export so callers and generated typings clearly
convey the short-circuit/record/evict behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b2679502-4327-4026-8f24-65b3d3fe9607

📥 Commits

Reviewing files that changed from the base of the PR and between b20a683 and c2a5c62.

📒 Files selected for processing (5)
  • docs/envs.md
  • src/config.ts
  • src/data/negative-data-cache.test.ts
  • src/data/negative-data-cache.ts
  • src/routes/data/handlers.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • docs/envs.md
  • src/config.ts
  • src/data/negative-data-cache.test.ts

djwhitt and others added 8 commits March 7, 2026 13:57
Remove dead `lastSeenAt` field from MissTrackerEntry (never read). Use
`delete()` return value instead of `has()` + `delete()` in evict() for
a single atomic check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cache

Reduce duration threshold default to 5 minutes so bursts trigger promotion.
After first promotion, a single retry failure immediately re-promotes with
doubled TTL (capped at configurable max). Eviction on successful retrieval
clears promotion history entirely so the ID starts fresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid unnecessary LRU delete and gauge updates on every successful
data fetch when the feature is off.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…es during outages

Suppress negative cache promotions when the system-wide failure rate exceeds
80%, preventing transient errors (network timeouts, upstream 500s) from
causing long-lived false 404s. Also gate recordMiss on dataAttributes being
absent, so IDs with known local metadata are never negatively cached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@djwhitt djwhitt merged commit efe8cab into develop Mar 8, 2026
4 checks passed
@djwhitt djwhitt deleted the PE-8985-negative-data-cache branch March 8, 2026 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants