Skip to content

feat(core): implement TAPI backoff and retry handling#1154

Draft
abueide wants to merge 21 commits intomasterfrom
feature/tapi-backoff-core
Draft

feat(core): implement TAPI backoff and retry handling#1154
abueide wants to merge 21 commits intomasterfrom
feature/tapi-backoff-core

Conversation

@abueide
Copy link
Contributor

@abueide abueide commented Mar 9, 2026

Summary

Implements TAPI error handling specification for the React Native SDK. Adds proper retry logic for rate limits (429), transient errors (5xx), and handling of permanent errors (4xx).

Background

Currently the SDK has minimal error handling - failed events stay in the queue and retry indefinitely. This causes several issues:

  • No backoff for transient errors
  • No handling of 429 rate limits
  • Permanent errors (400, 404) never get dropped and block the queue
  • No retry limits

Solution

Added a RetryManager that handles retry logic for both rate limits and transient errors.

RetryManager - Manages retry state for uploads

  • Handles 429 rate limit responses using server-provided Retry-After wait time
  • Handles transient errors (5xx, network failures) using exponential backoff
  • Three states: READY, RATE_LIMITED, BACKING_OFF
  • Blocks all uploads when in RATE_LIMITED or BACKING_OFF state
  • Tracks retry count and enforces max retry limits
  • State persisted across app restarts

How it works with flush policies

Flush policies (TimerFlushPolicy, CountFlushPolicy) trigger flushes based on their criteria (time interval, event count). The RetryManager acts as a gate in sendEvents():

async sendEvents(events) {
  // Check if blocked by rate limit or backoff
  if (!await retryManager.canRetry()) return;

  // Upload batches in parallel
  const batches = chunk(events, 100, 500KB);
  const results = await Promise.all(batches.map(batch => uploadBatch(batch)));
  
  // Aggregate errors from all batches
  const aggregation = aggregateErrors(results);

  // Update retry manager based on errors
  if (aggregation.has429) {
    await retryManager.handle429(aggregation.longestRetryAfter);
  }
  if (aggregation.hasTransientError) {
    await retryManager.handleTransientError();
  }

  // Remove successful and permanently failed events
  if (aggregation.successfulMessageIds.length > 0) {
    await queuePlugin.dequeueByMessageIds(aggregation.successfulMessageIds);
    await retryManager.reset();
  }
  if (aggregation.permanentErrorMessageIds.length > 0) {
    await queuePlugin.dequeueByMessageIds(aggregation.permanentErrorMessageIds);
  }
}

When uploads are blocked:

  • Flush policies continue triggering at their normal interval
  • sendEvents() checks retryManager.canRetry() and returns early if blocked
  • Next flush attempt occurs on the next policy trigger
  • Once the backoff/rate limit expires, the flush succeeds

Error classification

Status Code Classification Action
200 Success Dequeue events, reset retry count
429 Rate limit Keep in queue, increment retry count, wait for Retry-After
5xx, 408, 410, 460 Transient Keep in queue, increment retry count, exponential backoff
400, 401, 403, 404 Permanent Drop events immediately
Network error Transient Keep in queue, increment retry count, exponential backoff

Implementation details

Error aggregation - When multiple batches are uploaded in parallel and multiple fail:

  • Retry count increments by 1 (not by the number of failed batches)
  • For multiple 429s, uses the longest Retry-After value
  • Successful events are dequeued even if some batches failed

MessageId-based dequeue - Events are removed from the queue by messageId instead of object reference, stable across re-chunking.

Flush serialization - Concurrent flush calls are serialized to prevent uploading the same events multiple times.

Thread safety - State machine updates use atomic dispatch to prevent race conditions from async interleaving.

Configuration

The feature is opt-in via the httpConfig parameter:

const client = createClient({
  writeKey: 'YOUR_WRITE_KEY',
  httpConfig: {
    rateLimitConfig: {
      enabled: true,
      maxRetryCount: 10,
      maxRetryInterval: 1800,      // max 30min retry-after
      maxRateLimitDuration: 86400, // give up after 24 hours
    },
    backoffConfig: {
      enabled: true,
      maxRetryCount: 10,
      baseBackoffInterval: 0.5,
      maxBackoffInterval: 30,
      maxTotalBackoffDuration: 86400,
      jitterPercent: 25,
      default4xxBehavior: 'drop',
      default5xxBehavior: 'retry',
      statusCodeOverrides: {},
    },
  },
});

If httpConfig is not provided, behavior is unchanged from current implementation.

Testing

All tests pass (69 test suites, 422 tests). Added comprehensive tests for retry manager state transitions, backoff calculation, and error aggregation logic.

@abueide abueide changed the title feat: TAPI-compliant error handling with parallel batch processing TAPI error handling with parallel batches Mar 9, 2026
abueide and others added 17 commits March 9, 2026 15:19
Add the UploadStateMachine component for managing global rate limiting
state for 429 responses, along with supporting types and config validation.

Components:
- RateLimitConfig, UploadStateData, HttpConfig types
- validateRateLimitConfig with SDD-specified bounds
- UploadStateMachine with canUpload/handle429/reset/getGlobalRetryCount
- Core test suite (10 tests) and test helpers
- backoff/index.ts barrel export

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improvements to UploadStateMachine:
- Add comprehensive JSDoc comments for all public methods
- Add input validation in handle429() for negative/large retryAfterSeconds
- Add logging when transitioning from RATE_LIMITED to READY
- Add edge case tests for negative, zero, and very large retry values
- Fix linting issues (template literal expressions, unsafe assignments)

All 13 tests pass. Ready for review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add global transient error backoff manager that replaces the per-batch
BatchUploadManager. Uses same exponential backoff formula from the SDD
but tracks state globally rather than per-batch, which aligns with the
RN SDK's queue model where batch identities are ephemeral.

Components:
- BackoffConfig, BackoffStateData types
- Expanded HttpConfig to include backoffConfig
- validateBackoffConfig with SDD-specified bounds
- BackoffManager with canRetry/handleTransientError/reset/getRetryCount
- Core test suite (12 tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Manager

Improvements to BackoffManager:
- Add comprehensive JSDoc comments for all public methods
- Add logging when transitioning from BACKING_OFF to READY
- Fix template literal expression errors with unknown types
- Add edge case tests for multiple status codes and long durations
- Fix linting issues (unsafe assignments in mocks)

All 14 tests pass. Ready for review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CRITICAL FIX: Fixed off-by-one error in exponential backoff calculation
- Was using newRetryCount (retry 1 → 2^1 = 2s), now uses state.retryCount (retry 1 → 2^0 = 0.5s)
- Now correctly implements SDD progression: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s

DOCUMENTATION: Added design rationale for global vs per-batch backoff
- Documented deviation from SDD's per-batch approach
- Explained RN SDK architecture constraints (no stable batch identities)
- Provided rationale: equivalent in practice during TAPI outages

TESTS: Updated all tests to verify SDD-compliant behavior
- Fixed exponential progression test (was testing wrong values)
- Added comprehensive SDD formula validation test (7 retry progression)
- Fixed jitter test to match new 0.5s base delay
- All 15 tests pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add classifyError and parseRetryAfter functions for TAPI error handling,
plus production-ready default HTTP configuration per the SDD.

Components:
- ErrorClassification type
- classifyError() with SDD precedence: overrides -> 429 special -> defaults -> permanent
- parseRetryAfter() supporting seconds and HTTP-date formats
- defaultHttpConfig with SDD defaults (rate limit + backoff configs)
- maxPendingEvents preserved (used by analytics.ts)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

Improvements:
- Add comprehensive JSDoc comments for classifyError and parseRetryAfter
- Create comprehensive test suite (33 tests) covering all edge cases
- Test SDD-specified error code behavior (408, 410, 429, 460, 501, 505)
- Test override precedence and default behaviors
- Test Retry-After parsing (seconds and HTTP-date formats)
- Test edge cases (negative codes, invalid inputs, past dates)

All 33 tests pass. Ready for review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add validation to reject negative seconds in parseRetryAfter()
- Update test to verify negative values are handled correctly
- Negative strings fall through to date parsing (acceptable behavior)

All 33 tests pass. Ready for review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Refactor UploadStateMachine.handle429() to use atomic dispatch
- Refactor BackoffManager.handleTransientError() to use atomic dispatch
- Prevents lost retry count increments from concurrent errors
- Takes longest wait time when already rate limited (most conservative)

Fixes async interleaving issues when multiple batches fail in parallel.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ggregation

Parallel batch processing (PR #3):
- Upload batches in parallel with Promise.all() for performance
- Collect structured BatchResult for each batch
- Aggregate errors by type (success, 429, transient, permanent)
- Extract messageIds from each batch

State machine integration (PR #4):
- Check canUpload() and canRetry() gates before sending
- Handle 429 ONCE per flush with longest retry-after
- Handle transient errors ONCE per flush
- Dequeue events by messageId (success + permanent)
- Reset state machines on success
- Comprehensive logging

Key design decisions:
- Single retry increment per flush (not per batch)
- Longest 429 wins (most conservative)
- MessageId-based dequeue (stable across re-chunking)
- Parallel processing preserved (performance)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Move SDD document to notes/
- Move sdk-research/ to notes/
- Move tapi-*.md files to notes/
- Move e2e-cli to notes/ (testing only)
- Add notes/ to .gitignore

These are working notes/research and should not be in version control.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add httpConfig?: HttpConfig to Config type
- Remove unused checkResponseForErrors import
- Pass correct flattened config to classifyError()
- TypeScript now compiles cleanly

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…yManager

- Merge duplicate state machine code into single RetryManager
- handle429() uses server-provided Retry-After wait time
- handleTransientError() uses calculated exponential backoff
- Single unified state machine with READY/RATE_LIMITED/BACKING_OFF states
- Reduces code duplication (~400 LOC to ~250 LOC)
- All tests passing
Will clean up wiki/ in a separate PR
- Fix strict-boolean-expressions for nullable config checks
- Fix nullable string check in QueueFlushingPlugin
- Run prettier on all modified files
@abueide abueide force-pushed the feature/tapi-backoff-core branch from ead5bfa to c7ba971 Compare March 9, 2026 20:19
abueide and others added 3 commits March 9, 2026 15:24
These have been replaced by RetryManager which consolidates both
into a single state machine. Added comprehensive RetryManager tests
covering both 429 rate limiting and transient error scenarios.
- Update RetryManager class comment to not say 'unified'
- Update backoff/index.ts to export RetryManager only
- X-Retry-Count header on all uploads
- Event age pruning (drop events older than maxTotalBackoffDuration)
- Ignore 429 during BACKING_OFF state
- Server-side httpConfig from CDN with default merge and validation
- Config validation with relational clamping constraints
- Retry strategy (eager/lazy) for concurrent batch wait consolidation
- Auto-flush on retry ready with timer management
- Comprehensive JSDoc documenting architecture deviations from SDD

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abueide abueide changed the title TAPI error handling with parallel batches feat(core): implement TAPI backoff and retry handling Mar 9, 2026
Replace `!!id` with explicit nullish/empty check to satisfy
@typescript-eslint/strict-boolean-expressions rule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant