Skip to content

fix: account fallback for TWAP parts aged out of /by_uids (COW-989)#89

Draft
lgahdl wants to merge 3 commits into
developfrom
luizhatem/cow-989-f11-twap-parts-aged-out-fallback
Draft

fix: account fallback for TWAP parts aged out of /by_uids (COW-989)#89
lgahdl wants to merge 3 commits into
developfrom
luizhatem/cow-989-f11-twap-parts-aged-out-fallback

Conversation

@lgahdl
Copy link
Copy Markdown
Contributor

@lgahdl lgahdl commented Jun 1, 2026

Summary

  • Add fetchOwnerOrderStatuses(chainId, owner) to orderbookClient.ts — calls /account/{owner}/orders and returns a Map<uid, OrderStatusInfo>
  • In C2 stale sweep: after fetchOrderStatusByUids runs for past-validTo candidates, identify UIDs not returned by the API, resolve the generator owner from the DB, group missed UIDs by owner, and call fetchOwnerOrderStatuses once per unique owner (with ORDERBOOK_HTTP_TIMEOUT_MS timeout)
  • Missed UIDs found via the account fallback get their true status written into staleStatuses before the upsert — preventing fulfilled TWAP parts from being permanently recorded as "expired"
  • If the account fallback fails (timeout or API error), the UID still defaults to "expired" — no regression from current behavior

Root cause

The CoW Orderbook API's /orders/by_uids endpoint has a shorter retention window than /account/{owner}/orders. TWAP parts that were filled close to their validTo can age out of /by_uids before C2's stale sweep processes them. Without the fallback, those parts get written as status = "expired" even though they were "fulfilled".

Test plan

  • pnpm typecheck passes (verified locally)
  • Happy path: stale UID returned by /by_uids → status used directly (unchanged)
  • Fallback path: stale UID missing from /by_uids → account fallback called, fulfilled status used
  • Fallback timeout: account fetch times out → UID defaults to "expired" (graceful degradation)

🤖 Generated with Claude Code

When /orders/by_uids returns nothing for a stale candidateDiscreteOrder
(near or past validTo), fall back to /account/{owner}/orders before
defaulting to "expired". Groups missed UIDs by owner so only one account
fetch per unique owner is needed. Prevents fulfilled TWAP parts from
being recorded as "expired" when the API has aged them out of /by_uids.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented Jun 1, 2026

COW-989

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgahdl lgahdl marked this pull request as draft June 2, 2026 13:50
@lgahdl lgahdl changed the base branch from main to develop June 2, 2026 13:52
@lgahdl
Copy link
Copy Markdown
Contributor Author

lgahdl commented Jun 2, 2026

Indexer review — TWAP parts aged out of /by_uids: account fallback

Good fix for the aged-out TWAP parts — the per-owner fallback pattern is exactly right.

Timeout budget is too tight for paginated owners:

const ownerStatuses = await withTimeout(
  fetchOwnerOrderStatuses(chainId, owner),
  ORDERBOOK_HTTP_TIMEOUT_MS,   // 10 s
  "c2:stale:accountFallback",
);

fetchOwnerOrderStatuses calls fetchAccountOrders, which is a paginated function — multiple sequential HTTP requests, each with their own per-page timeouts. For an owner with even 2–3 pages of order history the outer 10 s budget (ORDERBOOK_HTTP_TIMEOUT_MS) is almost certainly too tight. The existing constant for this pattern is BOOTSTRAP_OWNER_FETCH_TIMEOUT_MS = 30_000, which was introduced precisely for full-account fetches like fetchAccountOrders.

Suggest changing to:

const ownerStatuses = await withTimeout(
  fetchOwnerOrderStatuses(chainId, owner),
  BOOTSTRAP_OWNER_FETCH_TIMEOUT_MS,  // 30 s
  "c2:stale:accountFallback",
);

The fallback already silently swallows failures (catch {}), so timeout semantics are forgiving — but using too short a budget means the fallback silently fails for any owner with a moderate order history, and the affected TWAP parts will still be written as expired rather than fulfilled.

@lgahdl
Copy link
Copy Markdown
Contributor Author

lgahdl commented Jun 2, 2026

Missing docs: the /account/{owner}/orders fallback for stale TWAP parts is not mentioned anywhere in docs/architecture.md.

The new logic in CandidateConfirmer silently falls back to /account/{owner}/orders when /by_uids returns nothing for TWAP parts near or past validTo. Without a note in the docs, an operator debugging why a TWAP part shows fulfilled instead of expired won't know the fallback exists — or that it depends on the /account endpoint being reachable.

Suggested addition to the candidate_discrete_order section or the Known Limitations section of docs/architecture.md, e.g.:

TWAP parts can age out of the /orders/by_uids retention window before CandidateConfirmer processes them. In that case the handler falls back to GET /account/{owner}/orders to recover the final status. If the fallback also fails (timeout or 5xx), the part is recorded as expired rather than its true terminal status.

@lgahdl
Copy link
Copy Markdown
Contributor Author

lgahdl commented Jun 2, 2026

The account-endpoint fallback for aged-out TWAP UIDs is a good idea, but fetchOwnerOrderStatuses doesn't write anything to order_uid_cache after it fetches.

Compare with fetchOrderStatusByUids, which after getting fresh data does:

if (newTerminal.length > 0) {
  await cacheUidStatuses(context, chainId, newTerminal);
}

fetchOwnerOrderStatuses skips this entirely. Concretely: if a TWAP part is fulfilled and the account endpoint returns that status, C2 will write the correct fulfilled row — but the next time any sweep runs for the same UID (e.g. C3 or another C2 pass before the row is committed), it will hit /account/{owner}/orders again rather than reading from the cache.

The fix is straightforward — fetchOwnerOrderStatuses currently doesn't receive a context parameter (which cacheUidStatuses needs). Either thread context through, or have the caller do the caching after merging results into staleStatuses. The second option is lower friction given the current signature.

One secondary point: fetchOwnerOrderStatuses is a module-level export but the cacheUidStatuses helper is file-private, so if you want the function to self-cache you'll need to either expose cacheUidStatuses or accept context as a parameter — both are small changes.

@lgahdl
Copy link
Copy Markdown
Contributor Author

lgahdl commented Jun 2, 2026

Code review — TWAP fallback via /account/{owner}/orders (COW-989)

ORDERBOOK_HTTP_TIMEOUT_MS (10 s) vs. a paginated endpoint

fetchOwnerOrderStatuses calls fetchAccountOrders, which is a paginated loop:

while (true) {
  const response = await fetchWithTimeout(url, undefined, ORDERBOOK_HTTP_TIMEOUT_MS, "ob:account");
  // ...
  if (page.length < PAGE_LIMIT) break;
  offset += page.length;
}

Each individual page fetch is independently capped at ORDERBOOK_HTTP_TIMEOUT_MS (10 s), but the loop itself has no aggregate wall-clock cap. An owner with, say, 3,000 orders across 3 pages could take up to 3 × 10 s = 30 s if the API is slow, all within the single withTimeout(..., ORDERBOOK_HTTP_TIMEOUT_MS, "c2:stale:accountFallback") outer guard which fires at 10 s.

The result: the outer withTimeout fires at 10 s during the first slow page, throwing TimeoutError and leaving staleStatuses unchanged for those UIDs — they will default to "expired", which is the intended fallback. So the outer guard does its job.

BOOTSTRAP_OWNER_FETCH_TIMEOUT_MS (30 s) — should it be used here?

BOOTSTRAP_OWNER_FETCH_TIMEOUT_MS is documented as "the whole per-owner bootstrap fetch in C4 (account pagination + by_uids refresh)". The OwnerBackfill handler uses it for fetchComposableOrders, which is a heavier call (pagination + signature decoding + DB lookups). The stale-candidate fallback here is lighter (pagination only, no decoding), and it runs inside a live block handler on every block that has stale candidates — latency is more critical here than in the one-shot OwnerBackfill. Using ORDERBOOK_HTTP_TIMEOUT_MS (10 s) is tighter and more appropriate for this context. The naming concern is moot.

The real issue: the outer timeout fires mid-pagination and silently drops results

If the outer withTimeout fires during page 2 of a 3-page owner, the pages already fetched are discarded — staleStatuses gets nothing for that owner and all their missed UIDs default to "expired". This is the documented acceptable fallback (the comment says "Fallback failed — these UIDs will default to 'expired'"), but it means a high-volume TWAP owner will consistently fail the fallback and their fulfilled-but-aged-out parts will always be recorded as "expired" rather than "fulfilled". This is a latent data accuracy issue for TWAP owners with large order histories, not a correctness regression from the current code (which already defaults to "expired" without any fallback).

Recommendation: Pass an aggregate timeout to fetchAccountOrders so the pagination loop itself is bounded, rather than relying on the outer guard to interrupt mid-page. Alternatively, add a maxPages parameter so the fallback only fetches the first N pages (most recent orders are most likely to contain the TWAP parts near their validTo). For the common case a single page of 1,000 orders is probably sufficient.

No blocking bugs. The starvation/data accuracy concern for large owners is pre-existing; this PR does not make it worse and adds the fallback path that didn't exist before. The 10 s outer timeout is correctly chosen for a block handler context.

@lgahdl
Copy link
Copy Markdown
Contributor Author

lgahdl commented Jun 2, 2026

Documentation / CI review

The /account/{owner}/orders fallback behavior is not documented anywhere in docs/. Must fix before merge.

This PR adds a silent, partial-failure-tolerant fallback: when CandidateConfirmer's stale sweep calls /orders/by_uids for past-validTo candidates and the API returns nothing (UID has aged out of the retention window), it falls back to /account/{owner}/orders once per unique owner, and populates the stale status map with whatever the account endpoint returns.

This behavior is invisible at the API surface — a caller just sees status: "fulfilled" instead of status: "expired" — but operators debugging TWAP accuracy issues need to know:

  1. The fallback exists and silently times out (ORDERBOOK_HTTP_TIMEOUT_MS) if the account endpoint is slow.
  2. If the fallback fails (timeout or HTTP error), the UID defaults to "expired" — this is by design but surprising without documentation.
  3. There is no log line indicating the fallback fired successfully; only a catch-block swallow for failures. A console.log or cowLog for the success path would help ops diagnose whether the fallback is being triggered in production.

Recommended addition to docs/architecture.md — in the CandidateConfirmer block handler description, add a note such as:

For stale candidates (past validTo) not found by /orders/by_uids, CandidateConfirmer falls back to /account/{owner}/orders (one request per unique owner) to recover TWAP parts that have aged out of the /by_uids retention window. If the fallback times out or the API returns an error, the candidate defaults to "expired". This fallback is partial-failure tolerant: a timeout on one owner does not affect other owners in the same block.

Recommended addition to docs/api-reference.md — under the GET /api/sync-progress or discrete order status documentation, a sentence noting that "expired" status on a TWAP part can in rare cases be corrected retroactively to "fulfilled" if the indexer was offline or the fallback timed out around the validTo boundary.

orderbookClient.tsfetchOwnerOrderStatuses is well-documented inline. The JSDoc on the new function clearly explains the fallback motivation. The docs gap is purely in docs/.

Overall: One actionable item before merge — add a short note to docs/architecture.md (CandidateConfirmer section) describing the fallback behavior, its timeout, and its graceful degradation. Consider also adding a cowLog on the success path so ops can confirm the fallback is firing in production.

@lgahdl
Copy link
Copy Markdown
Contributor Author

lgahdl commented Jun 2, 2026

Review: Account fallback for TWAP parts aged out of /by_uids (COW-989)

fetchOwnerOrderStatuses does not call cacheUidStatuses — cache miss for terminal results

fetchOrderStatusByUids (the UID-based path) caches newly-terminal orders via cacheUidStatuses(context, chainId, newTerminal) at the end of its fetch. fetchOwnerOrderStatuses does not have a context parameter and therefore cannot write to cow_cache.order_uid_cache.

In practice, after the stale sweep these UIDs are written to discrete_order with a terminal status, so C3 (eq(discreteOrder.status, "open")) will never pick them up again — the uncached status doesn't cause repeated API calls for the same row. However, if a UID is resolved via the account fallback but then somehow ends up back in an "open" state (e.g. due to a Ponder reorg replay), the cache will be empty and the fallback fetch will run again.

More importantly, the symmetry with fetchOrderStatusByUids is broken: both functions resolve terminal statuses from the API, but only one caches them. If you want to add caching, fetchOwnerOrderStatuses needs a context parameter. The PR description acknowledges the no-regression guarantee — just flagging this as a known gap for a follow-up.

No inArray guard on the DB query

.where(inArray(conditionalOrderGenerator.eventId, generatorIds))

generatorIds is derived from [...new Set(missed.map((c) => c.generatorId))]. If missed is non-empty (it is, that's the guard condition), generatorIds has at least one element. Drizzle handles inArray with a single element correctly. No issue.

withTimeout label for the account fallback

The label "c2:stale:accountFallback" is good for tracing. The timeout uses ORDERBOOK_HTTP_TIMEOUT_MS (10s) — appropriate for a single owner fetch (which may be paginated internally via fetchAccountOrders). Confirm that fetchAccountOrders itself respects a per-request timeout or that 10s is sufficient for the largest expected owner history. If an owner has thousands of orders across many pages, the paginated fetch could exceed 10s.

ownerKey = owner.toLowerCase() as Hex

The owner from the DB is stored via .toLowerCase() at insert time (convention confirmed in code-patterns.md). The .toLowerCase() call is therefore a no-op but not harmful. No issue.

Logic correctness

The fallback result merges into staleStatuses before the upsert rows are built — the stale sweep correctly uses the augmented map. Fallback failures (timeout/error) leave staleStatuses unchanged, so those UIDs default to "expired" — no regression from current behavior. This is correct.

The main actionable item is the missing cacheUidStatuses call, which is a deliberate tradeoff due to the missing context param. Document it with a // TODO: cache terminal results when context is threaded through comment inside the function.

… fallback (COW-989)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant