feat(core): implement TAPI backoff and retry handling#1154
Draft
feat(core): implement TAPI backoff and retry handling#1154
Conversation
Add the UploadStateMachine component for managing global rate limiting state for 429 responses, along with supporting types and config validation. Components: - RateLimitConfig, UploadStateData, HttpConfig types - validateRateLimitConfig with SDD-specified bounds - UploadStateMachine with canUpload/handle429/reset/getGlobalRetryCount - Core test suite (10 tests) and test helpers - backoff/index.ts barrel export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improvements to UploadStateMachine: - Add comprehensive JSDoc comments for all public methods - Add input validation in handle429() for negative/large retryAfterSeconds - Add logging when transitioning from RATE_LIMITED to READY - Add edge case tests for negative, zero, and very large retry values - Fix linting issues (template literal expressions, unsafe assignments) All 13 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add global transient error backoff manager that replaces the per-batch BatchUploadManager. Uses same exponential backoff formula from the SDD but tracks state globally rather than per-batch, which aligns with the RN SDK's queue model where batch identities are ephemeral. Components: - BackoffConfig, BackoffStateData types - Expanded HttpConfig to include backoffConfig - validateBackoffConfig with SDD-specified bounds - BackoffManager with canRetry/handleTransientError/reset/getRetryCount - Core test suite (12 tests) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Manager Improvements to BackoffManager: - Add comprehensive JSDoc comments for all public methods - Add logging when transitioning from BACKING_OFF to READY - Fix template literal expression errors with unknown types - Add edge case tests for multiple status codes and long durations - Fix linting issues (unsafe assignments in mocks) All 14 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CRITICAL FIX: Fixed off-by-one error in exponential backoff calculation - Was using newRetryCount (retry 1 → 2^1 = 2s), now uses state.retryCount (retry 1 → 2^0 = 0.5s) - Now correctly implements SDD progression: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s DOCUMENTATION: Added design rationale for global vs per-batch backoff - Documented deviation from SDD's per-batch approach - Explained RN SDK architecture constraints (no stable batch identities) - Provided rationale: equivalent in practice during TAPI outages TESTS: Updated all tests to verify SDD-compliant behavior - Fixed exponential progression test (was testing wrong values) - Added comprehensive SDD formula validation test (7 retry progression) - Fixed jitter test to match new 0.5s base delay - All 15 tests pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add classifyError and parseRetryAfter functions for TAPI error handling, plus production-ready default HTTP configuration per the SDD. Components: - ErrorClassification type - classifyError() with SDD precedence: overrides -> 429 special -> defaults -> permanent - parseRetryAfter() supporting seconds and HTTP-date formats - defaultHttpConfig with SDD defaults (rate limit + backoff configs) - maxPendingEvents preserved (used by analytics.ts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion Improvements: - Add comprehensive JSDoc comments for classifyError and parseRetryAfter - Create comprehensive test suite (33 tests) covering all edge cases - Test SDD-specified error code behavior (408, 410, 429, 460, 501, 505) - Test override precedence and default behaviors - Test Retry-After parsing (seconds and HTTP-date formats) - Test edge cases (negative codes, invalid inputs, past dates) All 33 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add validation to reject negative seconds in parseRetryAfter() - Update test to verify negative values are handled correctly - Negative strings fall through to date parsing (acceptable behavior) All 33 tests pass. Ready for review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Refactor UploadStateMachine.handle429() to use atomic dispatch - Refactor BackoffManager.handleTransientError() to use atomic dispatch - Prevents lost retry count increments from concurrent errors - Takes longest wait time when already rate limited (most conservative) Fixes async interleaving issues when multiple batches fail in parallel. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ggregation Parallel batch processing (PR #3): - Upload batches in parallel with Promise.all() for performance - Collect structured BatchResult for each batch - Aggregate errors by type (success, 429, transient, permanent) - Extract messageIds from each batch State machine integration (PR #4): - Check canUpload() and canRetry() gates before sending - Handle 429 ONCE per flush with longest retry-after - Handle transient errors ONCE per flush - Dequeue events by messageId (success + permanent) - Reset state machines on success - Comprehensive logging Key design decisions: - Single retry increment per flush (not per batch) - Longest 429 wins (most conservative) - MessageId-based dequeue (stable across re-chunking) - Parallel processing preserved (performance) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Move SDD document to notes/ - Move sdk-research/ to notes/ - Move tapi-*.md files to notes/ - Move e2e-cli to notes/ (testing only) - Add notes/ to .gitignore These are working notes/research and should not be in version control. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add httpConfig?: HttpConfig to Config type - Remove unused checkResponseForErrors import - Pass correct flattened config to classifyError() - TypeScript now compiles cleanly Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…yManager - Merge duplicate state machine code into single RetryManager - handle429() uses server-provided Retry-After wait time - handleTransientError() uses calculated exponential backoff - Single unified state machine with READY/RATE_LIMITED/BACKING_OFF states - Reduces code duplication (~400 LOC to ~250 LOC) - All tests passing
Will clean up wiki/ in a separate PR
- Fix strict-boolean-expressions for nullable config checks - Fix nullable string check in QueueFlushingPlugin - Run prettier on all modified files
ead5bfa to
c7ba971
Compare
These have been replaced by RetryManager which consolidates both into a single state machine. Added comprehensive RetryManager tests covering both 429 rate limiting and transient error scenarios.
- Update RetryManager class comment to not say 'unified' - Update backoff/index.ts to export RetryManager only
- X-Retry-Count header on all uploads - Event age pruning (drop events older than maxTotalBackoffDuration) - Ignore 429 during BACKING_OFF state - Server-side httpConfig from CDN with default merge and validation - Config validation with relational clamping constraints - Retry strategy (eager/lazy) for concurrent batch wait consolidation - Auto-flush on retry ready with timer management - Comprehensive JSDoc documenting architecture deviations from SDD Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `!!id` with explicit nullish/empty check to satisfy @typescript-eslint/strict-boolean-expressions rule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Mar 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements TAPI error handling specification for the React Native SDK. Adds proper retry logic for rate limits (429), transient errors (5xx), and handling of permanent errors (4xx).
Background
Currently the SDK has minimal error handling - failed events stay in the queue and retry indefinitely. This causes several issues:
Solution
Added a RetryManager that handles retry logic for both rate limits and transient errors.
RetryManager - Manages retry state for uploads
How it works with flush policies
Flush policies (TimerFlushPolicy, CountFlushPolicy) trigger flushes based on their criteria (time interval, event count). The RetryManager acts as a gate in sendEvents():
When uploads are blocked:
Error classification
Implementation details
Error aggregation - When multiple batches are uploaded in parallel and multiple fail:
MessageId-based dequeue - Events are removed from the queue by messageId instead of object reference, stable across re-chunking.
Flush serialization - Concurrent flush calls are serialized to prevent uploading the same events multiple times.
Thread safety - State machine updates use atomic dispatch to prevent race conditions from async interleaving.
Configuration
The feature is opt-in via the
httpConfigparameter:If
httpConfigis not provided, behavior is unchanged from current implementation.Testing
All tests pass (69 test suites, 422 tests). Added comprehensive tests for retry manager state transitions, backoff calculation, and error aggregation logic.