fix: mark stake transactions correctly when they land more than one block after broadcast#278
fix: mark stake transactions correctly when they land more than one block after broadcast#278Alann27 wants to merge 6 commits into
Conversation
…dress (#274) * Added migrate button in keys page and key detail to enable providers migrate keys to another address group. * Added address group migration workflow * added confirmation step to bulk migrate modal, added success step to individual migrate key, and some other enhancements * refactored `migrateKeysToAddressGroup` and showing migrate to group button for all keys * Fix sidebar and overlay pages not accounting for notification banner height
…ncy (#276) - Remove transactionsToNodes relation from getNodesByUser to fix PostgreSQL statement timeout that crashed the overview and import suppliers pages - Load supplier transactions on demand in the detail panel via GetNode, skipping the fetch when transactions are already provided by the caller - Replace lastUpdatedHeight for the suppliers table Height column instead of deriving it from stake transactions; add info tooltip explaining the field - Add csvHeader field to CsvColumnDef so JSX column headers export correctly - Fetch user nodes once in the unstake page and pass data down to step components, eliminating redundant useQuery calls across steps - Add copyToClipboard utility with execCommand fallback for non-HTTPS contexts and migrate all navigator.clipboard.writeText usages to it
…lock after broadcast
- Poll verifyTransaction across the full TX_EXPIRATION_BLOCKS window via Temporal's activity retry policy (one attempt per ~block) instead of a single post-waitForNextBlock check, so txs that land two or more blocks after broadcast are no longer mis-marked as failure
- Make verifyTransaction throw retriable TX_NOT_FOUND when the hash is not on-chain, and split the supplier-state fallback into a dedicated checkSupplierOnChain activity that only runs after retries are exhausted (positive-only: missing supplier stays Failure)
- Discriminate the verify catch with isTxNotFoundFailure so RPC outages and other unexpected errors fail the workflow loudly (tx stays Pending) rather than being silently converted to Failure on insufficient evidence
- Cap the Tier 3 block scan at the current chain head so future heights stop consuming the scan budget; the workflow-level retry drives forward progress instead
- Persist code on every verification update and write a log that distinguishes chain rejection ("verification failed with code N"), supplier fallback success, and true TX_NOT_FOUND after retries
|
Workflow fail, then what? will retry? |
@jorgecuesta since it remains in a Pending state, it will be retried in the next workflow run. If an unhandled error persists, it will continue retrying indefinitely. We may want to revise this behavior—for example, by marking the transaction as Failure after it has been retried a certain number of times. |
|
@Alann27 I think yes, we should mark as failed after all those retries+supplier check. |
d999d44 to
b269662
Compare
jorgecuesta
left a comment
There was a problem hiding this comment.
Did a thorough pass on the tx-landing fix and the migrate-keys flow. Inline comments on what I'd consider blocking.
| @@ -127,20 +139,58 @@ export async function ExecuteTransaction(args: TransactionArgs) { | |||
| await waitForNextBlock(txHeight); | |||
There was a problem hiding this comment.
The txHeight passed here comes from line 78 (const txHeight = await getBlockHeight()), which is captured BEFORE executeTransaction. If the chain advances during the broadcast — perfectly possible under load with variable block time — txHeight is stale.
waitForNextBlock(txHeight) checks currentHeight >= txHeight + 1. Since we already advanced, it returns immediately without waiting — exactly the wait the rest of the fix relies on. verifyTransaction then goes to look for the tx in the same block where it was included, where the tx_index probably hasn't indexed it yet.
The retries with 60s backoff eventually recover, but we're burning minutes of overhead that waitForNextBlock was meant to avoid.
Fix: re-capture the height post-broadcast before calling waitForNextBlock, or wait for txHeight + 2 to have a one-block margin.
| } catch { | ||
| if (operatorAddress) { | ||
| try { | ||
| const supplierExists = await checkSupplierOnChain(operatorAddress); |
There was a problem hiding this comment.
checkSupplierOnChain only checks whether a supplier exists for the operator address — it doesn't validate ownership. Since this path is the fallback after verifyTransaction exhausts its retries (30 attempts × 60s ≈ 30min), we land here precisely in suspicious scenarios: tx was never found on-chain via hash nor block scan.
Failure case: another operator (or the same owner via a parallel route) stakes the same operator address while our original tx was lost or failed. Tier 4 sees a supplier exists with that operator and returns true. The workflow then marks the tx as Success, code = 0, and fires createNewNodesFromTransaction + notifyProviderOfStakedAddresses below, creating node records in our DB that point to a supplier we don't own.
Minimum: have checkSupplierOnChain receive the expected ownerAddress and validate supplier.ownerAddress === expectedOwner before returning true. If owner doesn't match, return false (or an explicit differentOwner signal) and leave the tx as Failure.
| parsed.targetAddressGroupId, | ||
| ) | ||
|
|
||
| await TriggerAddressGroupMigrationSchedule() |
There was a problem hiding this comment.
The await TriggerAddressGroupMigrationSchedule() doesn't inspect the ActionResult. If the schedule doesn't exist (fresh instance), Temporal is down, or the trigger returns { success: false, error: ... }, this action still returns success: true with the migrated count. The operator sees "X migrated" and assumes the remediation flow is running, but the keys sit in the new address group with nothing to re-stake them.
const trigger = await TriggerAddressGroupMigrationSchedule()
if (!trigger.success) {
throw new Error(trigger.error.message)
}Even better: move the trigger inside the same DB transaction so that if the schedule fails, the address group change doesn't stick.
|
|
||
| for (const chunkIds of chunks) { | ||
| const keys = await tx | ||
| .select({ id: keysTable.id, state: keysTable.state, remediationHistory: keysTable.remediationHistory }) |
There was a problem hiding this comment.
The SELECT against keysTable inside the transaction doesn't take a lock (FOR UPDATE). The CASE expression built in historyCase embeds the current remediationHistory of each key and then writes it back.
If two migrations run in parallel over overlapping IDs (Temporal worker handling two distinct workflows, the operator clicking "Migrate" twice quickly, or the schedule re-firing before the previous run finishes), both read the same history snapshot, compute their CASE expression, and the second write clobbers the first. A history entry is lost silently.
Fix: .for('update') on the select.
| if (stateResetIds.length > 0) { | ||
| await tx | ||
| .update(keysTable) | ||
| .set({ state: KeyState.Staked }) |
There was a problem hiding this comment.
This second UPDATE only filters by inArray(id, stateResetIds), not by state. stateResetIds was computed from the in-memory snapshot of the prior SELECT — there's no guarantee the state is still AttentionNeeded/RemediationFailed at the time of this UPDATE.
If a concurrent workflow (SupplierStatus, SupplierRemediation) advanced the state of one of those keys between the SELECT and this UPDATE, you blindly reset it back to Staked, which can revert a legitimate transition (e.g. an in-flight Unstake).
Fix: add the guard to the WHERE:
.where(and(
inArray(keysTable.id, stateResetIds),
inArray(keysTable.state, STATES_REQUIRING_RESET),
))Combined with FOR UPDATE on the SELECT above it's fully covered.
Summary
failurein the middleman DB even though the chain accepted them.waitForNextBlockcheck with a Temporal activity-retry-driven poll across the fullTX_EXPIRATION_BLOCKSwindow (~30 min at pocket's ~1-min block time)Pendingfor human triage instead of being silently converted toFailureRoot cause
broadcastTxSyncconfirms mempool acceptance, not block inclusion — a tx may land one, two, or more blocks later depending on backlog. The previous flow did:getBlockHeight()→ broadcast →waitForNextBlock(height)(waits forheight + 1)verifyTransaction(hash, height)— one shot across Tiers 1–4When the tx landed at
height + 2(or later), every tier failed atheight + 1:tx_index) hadn't caught upcomet.block(h)past chain head threw and silentlycontinuedheight + 2too)Resulting DB write:
status=failure, code=∅, log=∅, consumedFee=0— exactly what the user reported.What changed
apps/middleman-workflows/src/activities/index.tsverifyTransactionnow throwsApplicationFailure.retryable('Transaction not found on-chain', 'TX_NOT_FOUND', …)when the hash is not on-chain, returning[success, code, gasUsed]only when foundcheckSupplierOnChain(operatorAddress)activity for the Tier 4 positive-only fallback, extracted out ofverifyTransactionapps/middleman-workflows/src/workflows/ExecuteTransaction.tsverifyTransactionmoved to its ownproxyActivitiesgroup with{ initialInterval: "60s", backoffCoefficient: 1, maximumAttempts: TX_EXPIRATION_BLOCKS }— one attempt per ~block for the full expiration windowtry/catcharound the activity;isTxNotFoundFailure(err)helper discriminates retries-exhausted (ApplicationFailure.type === 'TX_NOT_FOUND') from any other failure, which is rethrown so the workflow fails and the tx staysPendingcheckSupplierOnChain) runs only afterTX_NOT_FOUNDretries are exhausted; a positive result short-circuits to success, a missing supplier staysFailurecodeis now persisted on every verification update, andlogdistinguishes: chain rejection (verification failed with code N), supplier-state fallback success, and true TX_NOT_FOUND after retriesapps/middleman-workflows/src/lib/blockchain/index.tsMath.min(height + maxBlocks, latestHeight)); future heights no longer consume the scan budget by throwing andcontinue-ing. The workflow-level retry policy handles waiting for new blocks.Behavior matrix
SuccessFailure,log = "verification failed with code N"Success,log = "verified via supplier state fallback"Failure,log = "tx not found on-chain after 30 retries"Failure