feat(core): implement TAPI backoff and retry handling by abueide · Pull Request #1154 · segmentio/analytics-react-native

abueide · 2026-03-09T19:51:10Z

Summary

Implements TAPI error handling specification for the React Native SDK. Adds proper retry logic for rate limits (429), transient errors (5xx), and handling of permanent errors (4xx).

Background

Currently the SDK has minimal error handling - failed events stay in the queue and retry indefinitely. This causes several issues:

No backoff for transient errors
No handling of 429 rate limits
Permanent errors (400, 404) never get dropped and block the queue
No retry limits

Solution

Added a RetryManager that handles retry logic for both rate limits and transient errors.

RetryManager - Manages retry state for uploads

Handles 429 rate limit responses using server-provided Retry-After wait time
Handles transient errors (5xx, network failures) using exponential backoff
Three states: READY, RATE_LIMITED, BACKING_OFF
Blocks all uploads when in RATE_LIMITED or BACKING_OFF state
Tracks retry count and enforces max retry limits
State persisted across app restarts

How it works with flush policies

Flush policies (TimerFlushPolicy, CountFlushPolicy) trigger flushes based on their criteria (time interval, event count). The RetryManager acts as a gate in sendEvents():

async sendEvents(events) {
  // Check if blocked by rate limit or backoff
  if (!await retryManager.canRetry()) return;

  // Upload batches in parallel
  const batches = chunk(events, 100, 500KB);
  const results = await Promise.all(batches.map(batch => uploadBatch(batch)));
  
  // Aggregate errors from all batches
  const aggregation = aggregateErrors(results);

  // Update retry manager based on errors
  if (aggregation.has429) {
    await retryManager.handle429(aggregation.longestRetryAfter);
  }
  if (aggregation.hasTransientError) {
    await retryManager.handleTransientError();
  }

  // Remove successful and permanently failed events
  if (aggregation.successfulMessageIds.length > 0) {
    await queuePlugin.dequeueByMessageIds(aggregation.successfulMessageIds);
    await retryManager.reset();
  }
  if (aggregation.permanentErrorMessageIds.length > 0) {
    await queuePlugin.dequeueByMessageIds(aggregation.permanentErrorMessageIds);
  }
}

When uploads are blocked:

Flush policies continue triggering at their normal interval
sendEvents() checks retryManager.canRetry() and returns early if blocked
Next flush attempt occurs on the next policy trigger
Once the backoff/rate limit expires, the flush succeeds

Error classification

Status Code	Classification	Action
200	Success	Dequeue events, reset retry count
429	Rate limit	Keep in queue, increment retry count, wait for Retry-After
5xx, 408, 410, 460	Transient	Keep in queue, increment retry count, exponential backoff
400, 401, 403, 404	Permanent	Drop events immediately
Network error	Transient	Keep in queue, increment retry count, exponential backoff

Implementation details

Error aggregation - When multiple batches are uploaded in parallel and multiple fail:

Retry count increments by 1 (not by the number of failed batches)
For multiple 429s, uses the longest Retry-After value
Successful events are dequeued even if some batches failed

MessageId-based dequeue - Events are removed from the queue by messageId instead of object reference, stable across re-chunking.

Flush serialization - Concurrent flush calls are serialized to prevent uploading the same events multiple times.

Thread safety - State machine updates use atomic dispatch to prevent race conditions from async interleaving.

Configuration

The feature is opt-in via the httpConfig parameter:

const client = createClient({
  writeKey: 'YOUR_WRITE_KEY',
  httpConfig: {
    rateLimitConfig: {
      enabled: true,
      maxRetryCount: 10,
      maxRetryInterval: 1800,      // max 30min retry-after
      maxRateLimitDuration: 86400, // give up after 24 hours
    },
    backoffConfig: {
      enabled: true,
      maxRetryCount: 10,
      baseBackoffInterval: 0.5,
      maxBackoffInterval: 30,
      maxTotalBackoffDuration: 86400,
      jitterPercent: 25,
      default4xxBehavior: 'drop',
      default5xxBehavior: 'retry',
      statusCodeOverrides: {},
    },
  },
});

If httpConfig is not provided, behavior is unchanged from current implementation.

Testing

All tests pass (69 test suites, 422 tests). Added comprehensive tests for retry manager state transitions, backoff calculation, and error aggregation logic.

Add the UploadStateMachine component for managing global rate limiting state for 429 responses, along with supporting types and config validation. Components: - RateLimitConfig, UploadStateData, HttpConfig types - validateRateLimitConfig with SDD-specified bounds - UploadStateMachine with canUpload/handle429/reset/getGlobalRetryCount - Core test suite (10 tests) and test helpers - backoff/index.ts barrel export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Improvements to UploadStateMachine: - Add comprehensive JSDoc comments for all public methods - Add input validation in handle429() for negative/large retryAfterSeconds - Add logging when transitioning from RATE_LIMITED to READY - Add edge case tests for negative, zero, and very large retry values - Fix linting issues (template literal expressions, unsafe assignments) All 13 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add global transient error backoff manager that replaces the per-batch BatchUploadManager. Uses same exponential backoff formula from the SDD but tracks state globally rather than per-batch, which aligns with the RN SDK's queue model where batch identities are ephemeral. Components: - BackoffConfig, BackoffStateData types - Expanded HttpConfig to include backoffConfig - validateBackoffConfig with SDD-specified bounds - BackoffManager with canRetry/handleTransientError/reset/getRetryCount - Core test suite (12 tests) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…Manager Improvements to BackoffManager: - Add comprehensive JSDoc comments for all public methods - Add logging when transitioning from BACKING_OFF to READY - Fix template literal expression errors with unknown types - Add edge case tests for multiple status codes and long durations - Fix linting issues (unsafe assignments in mocks) All 14 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

CRITICAL FIX: Fixed off-by-one error in exponential backoff calculation - Was using newRetryCount (retry 1 → 2^1 = 2s), now uses state.retryCount (retry 1 → 2^0 = 0.5s) - Now correctly implements SDD progression: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s DOCUMENTATION: Added design rationale for global vs per-batch backoff - Documented deviation from SDD's per-batch approach - Explained RN SDK architecture constraints (no stable batch identities) - Provided rationale: equivalent in practice during TAPI outages TESTS: Updated all tests to verify SDD-compliant behavior - Fixed exponential progression test (was testing wrong values) - Added comprehensive SDD formula validation test (7 retry progression) - Fixed jitter test to match new 0.5s base delay - All 15 tests pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add classifyError and parseRetryAfter functions for TAPI error handling, plus production-ready default HTTP configuration per the SDD. Components: - ErrorClassification type - classifyError() with SDD precedence: overrides -> 429 special -> defaults -> permanent - parseRetryAfter() supporting seconds and HTTP-date formats - defaultHttpConfig with SDD defaults (rate limit + backoff configs) - maxPendingEvents preserved (used by analytics.ts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Improvements: - Add comprehensive JSDoc comments for classifyError and parseRetryAfter - Create comprehensive test suite (33 tests) covering all edge cases - Test SDD-specified error code behavior (408, 410, 429, 460, 501, 505) - Test override precedence and default behaviors - Test Retry-After parsing (seconds and HTTP-date formats) - Test edge cases (negative codes, invalid inputs, past dates) All 33 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add validation to reject negative seconds in parseRetryAfter() - Update test to verify negative values are handled correctly - Negative strings fall through to date parsing (acceptable behavior) All 33 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Refactor UploadStateMachine.handle429() to use atomic dispatch - Refactor BackoffManager.handleTransientError() to use atomic dispatch - Prevents lost retry count increments from concurrent errors - Takes longest wait time when already rate limited (most conservative) Fixes async interleaving issues when multiple batches fail in parallel. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ggregation Parallel batch processing (PR #3): - Upload batches in parallel with Promise.all() for performance - Collect structured BatchResult for each batch - Aggregate errors by type (success, 429, transient, permanent) - Extract messageIds from each batch State machine integration (PR #4): - Check canUpload() and canRetry() gates before sending - Handle 429 ONCE per flush with longest retry-after - Handle transient errors ONCE per flush - Dequeue events by messageId (success + permanent) - Reset state machines on success - Comprehensive logging Key design decisions: - Single retry increment per flush (not per batch) - Longest 429 wins (most conservative) - MessageId-based dequeue (stable across re-chunking) - Parallel processing preserved (performance) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Move SDD document to notes/ - Move sdk-research/ to notes/ - Move tapi-*.md files to notes/ - Move e2e-cli to notes/ (testing only) - Add notes/ to .gitignore These are working notes/research and should not be in version control. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add httpConfig?: HttpConfig to Config type - Remove unused checkResponseForErrors import - Pass correct flattened config to classifyError() - TypeScript now compiles cleanly Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…yManager - Merge duplicate state machine code into single RetryManager - handle429() uses server-provided Retry-After wait time - handleTransientError() uses calculated exponential backoff - Single unified state machine with READY/RATE_LIMITED/BACKING_OFF states - Reduces code duplication (~400 LOC to ~250 LOC) - All tests passing

Will clean up wiki/ in a separate PR

- Fix strict-boolean-expressions for nullable config checks - Fix nullable string check in QueueFlushingPlugin - Run prettier on all modified files

These have been replaced by RetryManager which consolidates both into a single state machine. Added comprehensive RetryManager tests covering both 429 rate limiting and transient error scenarios.

- Update RetryManager class comment to not say 'unified' - Update backoff/index.ts to export RetryManager only

- X-Retry-Count header on all uploads - Event age pruning (drop events older than maxTotalBackoffDuration) - Ignore 429 during BACKING_OFF state - Server-side httpConfig from CDN with default merge and validation - Config validation with relational clamping constraints - Retry strategy (eager/lazy) for concurrent batch wait consolidation - Auto-flush on retry ready with timer management - Comprehensive JSDoc documenting architecture deviations from SDD Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace `!!id` with explicit nullish/empty check to satisfy @typescript-eslint/strict-boolean-expressions rule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abueide changed the title ~~feat: TAPI-compliant error handling with parallel batch processing~~ TAPI error handling with parallel batches Mar 9, 2026

abueide and others added 17 commits March 9, 2026 15:19

docs: add canonical implementation summary to notes/

5fa2e30

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

test: fix api.test expectations and move test-helpers out of __tests__

795a3aa

revert: restore TAPI wiki files to keep PR focused

797cab9

Will clean up wiki/ in a separate PR

fix: resolve linting errors and format code

c7ba971

- Fix strict-boolean-expressions for nullable config checks - Fix nullable string check in QueueFlushingPlugin - Run prettier on all modified files

abueide force-pushed the feature/tapi-backoff-core branch from ead5bfa to c7ba971 Compare March 9, 2026 20:19

abueide and others added 3 commits March 9, 2026 15:24

refactor: remove BackoffManager and UploadStateMachine

1c64124

These have been replaced by RetryManager which consolidates both into a single state machine. Added comprehensive RetryManager tests covering both 429 rate limiting and transient error scenarios.

docs: remove references to past implementation context

b5d1bc5

- Update RetryManager class comment to not say 'unified' - Update backoff/index.ts to export RetryManager only

abueide changed the title ~~TAPI error handling with parallel batches~~ feat(core): implement TAPI backoff and retry handling Mar 9, 2026

fix: resolve strict-boolean-expressions lint error in uploadBatch

90afac0

Replace `!!id` with explicit nullish/empty check to satisfy @typescript-eslint/strict-boolean-expressions rule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): implement TAPI backoff and retry handling#1154

feat(core): implement TAPI backoff and retry handling#1154
abueide wants to merge 21 commits intomasterfrom
feature/tapi-backoff-core

abueide commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abueide commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Solution

How it works with flush policies

Error classification

Implementation details

Configuration

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abueide commented Mar 9, 2026 •

edited

Loading