Skip to content

fix: mark stake transactions correctly when they land more than one block after broadcast#278

Open
Alann27 wants to merge 6 commits into
stagingfrom
bug/fix-transactions-wait
Open

fix: mark stake transactions correctly when they land more than one block after broadcast#278
Alann27 wants to merge 6 commits into
stagingfrom
bug/fix-transactions-wait

Conversation

@Alann27
Copy link
Copy Markdown
Collaborator

@Alann27 Alann27 commented Apr 20, 2026

Summary

  • Fixes a race condition where stake transactions that landed two or more blocks after broadcast were being marked as failure in the middleman DB even though the chain accepted them.
  • Replaces the one-shot post-waitForNextBlock check with a Temporal activity-retry-driven poll across the full TX_EXPIRATION_BLOCKS window (~30 min at pocket's ~1-min block time)
  • Makes unexpected errors (RPC outages, bugs) fail the workflow loudly so the tx stays Pending for human triage instead of being silently converted to Failure

Root cause

broadcastTxSync confirms mempool acceptance, not block inclusion — a tx may land one, two, or more blocks later depending on backlog. The previous flow did:

  1. getBlockHeight() → broadcast → waitForNextBlock(height) (waits for height + 1)
  2. verifyTransaction(hash, height) — one shot across Tiers 1–4

When the tx landed at height + 2 (or later), every tier failed at height + 1:

  • Tier 1 (tx_index) hadn't caught up
  • Tier 2 (REST) still returned 404
  • Tier 3 (block scan) iterated 30 heights but every comet.block(h) past chain head threw and silently continued
  • Tier 4 (supplier state) didn't find the supplier on-chain yet (it landed at height + 2 too)

Resulting DB write: status=failure, code=∅, log=∅, consumedFee=0 — exactly what the user reported.

What changed

apps/middleman-workflows/src/activities/index.ts

  • verifyTransaction now throws ApplicationFailure.retryable('Transaction not found on-chain', 'TX_NOT_FOUND', …) when the hash is not on-chain, returning [success, code, gasUsed] only when found
  • New checkSupplierOnChain(operatorAddress) activity for the Tier 4 positive-only fallback, extracted out of verifyTransaction

apps/middleman-workflows/src/workflows/ExecuteTransaction.ts

  • verifyTransaction moved to its own proxyActivities group with { initialInterval: "60s", backoffCoefficient: 1, maximumAttempts: TX_EXPIRATION_BLOCKS } — one attempt per ~block for the full expiration window
  • try/catch around the activity; isTxNotFoundFailure(err) helper discriminates retries-exhausted (ApplicationFailure.type === 'TX_NOT_FOUND') from any other failure, which is rethrown so the workflow fails and the tx stays Pending
  • Tier 4 (checkSupplierOnChain) runs only after TX_NOT_FOUND retries are exhausted; a positive result short-circuits to success, a missing supplier stays Failure
  • code is now persisted on every verification update, and log distinguishes: chain rejection (verification failed with code N), supplier-state fallback success, and true TX_NOT_FOUND after retries

apps/middleman-workflows/src/lib/blockchain/index.ts

  • Tier 3 block scan capped at the current chain head (Math.min(height + maxBlocks, latestHeight)); future heights no longer consume the scan budget by throwing and continue-ing. The workflow-level retry policy handles waiting for new blocks.

Behavior matrix

Scenario DB result
Tx found, chain code 0 Success
Tx found, chain code ≠ 0 Failure, log = "verification failed with code N"
Tx not found after 30 retries, supplier on-chain Success, log = "verified via supplier state fallback"
Tx not found after 30 retries, supplier not on-chain Failure, log = "tx not found on-chain after 30 retries"
RPC unreachable / unexpected error anywhere in verify Workflow fails; tx marked as Failure

Alann27 and others added 5 commits April 8, 2026 23:00
…dress (#274)

* Added migrate button in keys page and key detail to enable providers migrate keys to another address group.

* Added address group migration workflow

* added confirmation step to bulk migrate modal, added success step to individual migrate key, and some other enhancements

* refactored `migrateKeysToAddressGroup` and showing migrate to group button for all keys

* Fix sidebar and overlay pages not accounting for notification banner height
…ncy (#276)

- Remove transactionsToNodes relation from getNodesByUser to fix PostgreSQL statement timeout that crashed the overview and import suppliers pages
- Load supplier transactions on demand in the detail panel via GetNode, skipping the fetch when transactions are already provided by the caller
- Replace lastUpdatedHeight for the suppliers table Height column instead of deriving it from stake transactions; add info tooltip explaining the field
- Add csvHeader field to CsvColumnDef so JSX column headers export correctly
- Fetch user nodes once in the unstake page and pass data down to step components, eliminating redundant useQuery calls across steps
- Add copyToClipboard utility with execCommand fallback for non-HTTPS contexts and migrate all navigator.clipboard.writeText usages to it
…lock after broadcast

- Poll verifyTransaction across the full TX_EXPIRATION_BLOCKS window via Temporal's activity retry policy (one attempt per ~block) instead of a single post-waitForNextBlock check, so txs that land two or more blocks after broadcast are no longer mis-marked as failure
- Make verifyTransaction throw retriable TX_NOT_FOUND when the hash is not on-chain, and split the supplier-state fallback into a dedicated checkSupplierOnChain activity that only runs after retries are exhausted (positive-only: missing supplier stays Failure)
- Discriminate the verify catch with isTxNotFoundFailure so RPC outages and other unexpected errors fail the workflow loudly (tx stays Pending) rather than being silently converted to Failure on insufficient evidence
- Cap the Tier 3 block scan at the current chain head so future heights stop consuming the scan budget; the workflow-level retry drives forward progress instead
- Persist code on every verification update and write a log that distinguishes chain rejection ("verification failed with code N"), supplier fallback success, and true TX_NOT_FOUND after retries
@Alann27 Alann27 requested a review from jorgecuesta April 20, 2026 18:35
@Alann27 Alann27 self-assigned this Apr 20, 2026
@Alann27 Alann27 added bug Something isn't working enhancement New feature or request labels Apr 20, 2026
@jorgecuesta
Copy link
Copy Markdown
Contributor

Workflow fail, then what? will retry?

@Alann27
Copy link
Copy Markdown
Collaborator Author

Alann27 commented Apr 20, 2026

Workflow fail, then what? will retry?

@jorgecuesta since it remains in a Pending state, it will be retried in the next workflow run. If an unhandled error persists, it will continue retrying indefinitely. We may want to revise this behavior—for example, by marking the transaction as Failure after it has been retried a certain number of times.

@jorgecuesta
Copy link
Copy Markdown
Contributor

@Alann27 I think yes, we should mark as failed after all those retries+supplier check.

Copy link
Copy Markdown
Contributor

@jorgecuesta jorgecuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a thorough pass on the tx-landing fix and the migrate-keys flow. Inline comments on what I'd consider blocking.

@@ -127,20 +139,58 @@ export async function ExecuteTransaction(args: TransactionArgs) {
await waitForNextBlock(txHeight);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The txHeight passed here comes from line 78 (const txHeight = await getBlockHeight()), which is captured BEFORE executeTransaction. If the chain advances during the broadcast — perfectly possible under load with variable block time — txHeight is stale.

waitForNextBlock(txHeight) checks currentHeight >= txHeight + 1. Since we already advanced, it returns immediately without waiting — exactly the wait the rest of the fix relies on. verifyTransaction then goes to look for the tx in the same block where it was included, where the tx_index probably hasn't indexed it yet.

The retries with 60s backoff eventually recover, but we're burning minutes of overhead that waitForNextBlock was meant to avoid.

Fix: re-capture the height post-broadcast before calling waitForNextBlock, or wait for txHeight + 2 to have a one-block margin.

} catch {
if (operatorAddress) {
try {
const supplierExists = await checkSupplierOnChain(operatorAddress);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkSupplierOnChain only checks whether a supplier exists for the operator address — it doesn't validate ownership. Since this path is the fallback after verifyTransaction exhausts its retries (30 attempts × 60s ≈ 30min), we land here precisely in suspicious scenarios: tx was never found on-chain via hash nor block scan.

Failure case: another operator (or the same owner via a parallel route) stakes the same operator address while our original tx was lost or failed. Tier 4 sees a supplier exists with that operator and returns true. The workflow then marks the tx as Success, code = 0, and fires createNewNodesFromTransaction + notifyProviderOfStakedAddresses below, creating node records in our DB that point to a supplier we don't own.

Minimum: have checkSupplierOnChain receive the expected ownerAddress and validate supplier.ownerAddress === expectedOwner before returning true. If owner doesn't match, return false (or an explicit differentOwner signal) and leave the tx as Failure.

parsed.targetAddressGroupId,
)

await TriggerAddressGroupMigrationSchedule()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The await TriggerAddressGroupMigrationSchedule() doesn't inspect the ActionResult. If the schedule doesn't exist (fresh instance), Temporal is down, or the trigger returns { success: false, error: ... }, this action still returns success: true with the migrated count. The operator sees "X migrated" and assumes the remediation flow is running, but the keys sit in the new address group with nothing to re-stake them.

const trigger = await TriggerAddressGroupMigrationSchedule()
if (!trigger.success) {
  throw new Error(trigger.error.message)
}

Even better: move the trigger inside the same DB transaction so that if the schedule fails, the address group change doesn't stick.


for (const chunkIds of chunks) {
const keys = await tx
.select({ id: keysTable.id, state: keysTable.state, remediationHistory: keysTable.remediationHistory })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SELECT against keysTable inside the transaction doesn't take a lock (FOR UPDATE). The CASE expression built in historyCase embeds the current remediationHistory of each key and then writes it back.

If two migrations run in parallel over overlapping IDs (Temporal worker handling two distinct workflows, the operator clicking "Migrate" twice quickly, or the schedule re-firing before the previous run finishes), both read the same history snapshot, compute their CASE expression, and the second write clobbers the first. A history entry is lost silently.

Fix: .for('update') on the select.

if (stateResetIds.length > 0) {
await tx
.update(keysTable)
.set({ state: KeyState.Staked })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second UPDATE only filters by inArray(id, stateResetIds), not by state. stateResetIds was computed from the in-memory snapshot of the prior SELECT — there's no guarantee the state is still AttentionNeeded/RemediationFailed at the time of this UPDATE.

If a concurrent workflow (SupplierStatus, SupplierRemediation) advanced the state of one of those keys between the SELECT and this UPDATE, you blindly reset it back to Staked, which can revert a legitimate transition (e.g. an in-flight Unstake).

Fix: add the guard to the WHERE:

.where(and(
  inArray(keysTable.id, stateResetIds),
  inArray(keysTable.state, STATES_REQUIRING_RESET),
))

Combined with FOR UPDATE on the SELECT above it's fully covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants