From 25759c44f50db9b2926f8e3dbe6596219a219c89 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Fri, 17 Oct 2025 16:03:30 -0700 Subject: [PATCH 1/8] add exponential backoff to withTransactions API --- .../tests/README.md | 6 ++++++ .../transactions-convenient-api.md | 21 ++++++++++++++++--- 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index a797a3182f..905c14a9ba 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -41,8 +41,14 @@ If possible, drivers should implement these tests without requiring the test run the retry timeout. This might be done by internally modifying the timeout value used by `withTransaction` with some private API or using a mock timer. +### Retry Backoff is Enforced + +Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms +and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. + ## Changelog +- 2025-10-17: Added Backoff test. - 2024-09-06: Migrated from reStructuredText to Markdown. - 2024-02-08: Converted legacy tests to unified format. - 2021-04-29: Remove text about write concern timeouts from prose test. diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 7d1864391a..fa7db0a09d 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -99,7 +99,8 @@ has not been exceeded, the driver MUST retry a transaction that fails with an er "TransientTransactionError" label. Since retrying the entire transaction will entail invoking the callback again, drivers MUST document that the callback may be invoked multiple times (i.e. one additional time per retry attempt) and MUST document the risk of side effects from using a non-idempotent callback. If the retry timeout has been exceeded, -drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. +drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying, +drivers MUST implement an exponential backoff with jitter following the algorithm described below. If an error bearing neither the UnknownTransactionCommitResult nor the TransientTransactionError label is encountered at any point, the driver MUST NOT retry and MUST allow `withTransaction` to propagate the error to its caller. @@ -129,7 +130,13 @@ This method should perform the following sequence of actions: 1. If the ClientSession is in the "starting transaction" or "transaction in progress" state, invoke [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, jump back to step two. + less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + 1. jitter is a random float between [0, 1) + 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed + 3. BACKOFF_INITIAL is 1ms + 4. BACKOFF_MAX is 500ms + + Then, jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return immediately. @@ -154,11 +161,18 @@ This method should perform the following sequence of actions: This method can be expressed by the following pseudo-code: ```typescript +var BACKOFF_INITIAL = 1 // 1ms initial backoff +var BACKOFF_MAX = 500 // 500ms max backoff withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch + var retry = 0 retryTransaction: while (true) { + if (retry > 0): + sleep(Math.random() * min(BACKOFF_INITIAL * (1.25**retry), + BACKOFF_MAX)) + retry += 1 this.startTransaction(options); // may throw on error try { @@ -324,7 +338,7 @@ exceed the user's original intention for `maxTimeMS`. The callback may be executed any number of times. Drivers are free to encourage their users to design idempotent callbacks. -A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that +A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; however, that puts the onus on avoiding very long (or infinite) retry loops on the application. We expect the most common cause of retry loops will be due to TransientTransactionErrors caused by write conflicts, as those can occur @@ -356,6 +370,7 @@ provides an implementation of a technique already described in the MongoDB 4.0 d ([DRIVERS-488](https://jira.mongodb.org/browse/DRIVERS-488)). ## Changelog +- 2025-10-17: withTransaction applies exponential backoff when retrying. - 2024-09-06: Migrated from reStructuredText to Markdown. From bdcd2ef0d8ba136d88515865c9228614dbd32173 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Mon, 20 Oct 2025 10:10:42 -0700 Subject: [PATCH 2/8] run pre-commit --- .../tests/README.md | 2 +- .../transactions-convenient-api.md | 27 +++++++++++-------- 2 files changed, 17 insertions(+), 12 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 905c14a9ba..dab2dfc544 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -44,7 +44,7 @@ private API or using a mock timer. ### Retry Backoff is Enforced Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms -and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. +and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. ## Changelog diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index fa7db0a09d..13c187adc5 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -99,7 +99,7 @@ has not been exceeded, the driver MUST retry a transaction that fails with an er "TransientTransactionError" label. Since retrying the entire transaction will entail invoking the callback again, drivers MUST document that the callback may be invoked multiple times (i.e. one additional time per retry attempt) and MUST document the risk of side effects from using a non-idempotent callback. If the retry timeout has been exceeded, -drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying, +drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying, drivers MUST implement an exponential backoff with jitter following the algorithm described below. If an error bearing neither the UnknownTransactionCommitResult nor the TransientTransactionError label is encountered at @@ -129,17 +129,21 @@ This method should perform the following sequence of actions: 6. If the callback reported an error: 1. If the ClientSession is in the "starting transaction" or "transaction in progress" state, invoke [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. + 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: - 1. jitter is a random float between [0, 1) - 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed - 3. BACKOFF_INITIAL is 1ms - 4. BACKOFF_MAX is 500ms - - Then, jump back to step two. + less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + + 1. jitter is a random float between \[0, 1) + 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed + 3. BACKOFF_INITIAL is 1ms + 4. BACKOFF_MAX is 500ms + + Then, jump back to step two. + 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return immediately. + 4. Otherwise, propagate the callback's error to the caller of `withTransaction` and return immediately. 7. If the ClientSession is in the "no transaction", "transaction aborted", or "transaction committed" state, assume the callback intentionally aborted or committed the transaction and return immediately. @@ -338,8 +342,8 @@ exceed the user's original intention for `maxTimeMS`. The callback may be executed any number of times. Drivers are free to encourage their users to design idempotent callbacks. -A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that -`withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; +A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate +that `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; however, that puts the onus on avoiding very long (or infinite) retry loops on the application. We expect the most common cause of retry loops will be due to TransientTransactionErrors caused by write conflicts, as those can occur regularly in a healthy application, as opposed to UnknownTransactionCommitResult, which would typically be caused by an @@ -370,7 +374,8 @@ provides an implementation of a technique already described in the MongoDB 4.0 d ([DRIVERS-488](https://jira.mongodb.org/browse/DRIVERS-488)). ## Changelog -- 2025-10-17: withTransaction applies exponential backoff when retrying. + +- 2025-10-17: withTransaction applies exponential backoff when retrying. - 2024-09-06: Migrated from reStructuredText to Markdown. From 48890a2c5b1b9f0e3f1b7ada38bd4b85329482e6 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Mon, 20 Oct 2025 16:07:36 -0700 Subject: [PATCH 3/8] add design rational for backoff --- .../transactions-convenient-api.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 13c187adc5..0c32812be4 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -356,6 +356,16 @@ non-configurable default and is intentionally twice the value of MongoDB 4.0's d parameter (60 seconds). Applications that desire longer retry periods may call `withTransaction` additional times as needed. Applications that desire shorter retry periods should not use this method. +### Backoff Benefits + +Previously, the driver would retry transactions immediately, which is fine for low levels of contention. But, as the +server load increases, immediate retries can result in retry storms, unnecessarily further overloading the server. + +Exponential backoff is well-researched and accepted backoff strategy that is simple to implement. A low initial backoff +(1-millisecond) and growth value (1.25x) were chosen specifically to mitigate latency in low levels of contention. +Empirical evidence suggests that 500-millisecond max backoff ensured that a transaction did not wait so long as to +exceed the 120-second timeout and reduced load spikes. + ## Backwards Compatibility The specification introduces a new method on the ClientSession class and does not introduce any backward breaking From 71ba1babe797f00e06b8ad98f442a097ed720e78 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Thu, 23 Oct 2025 10:19:51 -0700 Subject: [PATCH 4/8] fix prose test --- source/transactions-convenient-api/tests/README.md | 8 ++++++-- .../transactions-convenient-api.md | 11 +++++++---- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index dab2dfc544..78e3a3eafc 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -43,8 +43,12 @@ private API or using a mock timer. ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms -and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. +Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3 +retries. Ensure that: + +- 3 backoffs occurred +- each backoff was greater than or equal to 0 +- the total operation time took more than the sum of the individual backoffs ## Changelog diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 0c32812be4..df679859ec 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -138,7 +138,7 @@ This method should perform the following sequence of actions: 3. BACKOFF_INITIAL is 1ms 4. BACKOFF_MAX is 500ms - Then, jump back to step two. + Append this sleep duration to a list for testing purposes. Then, jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -170,12 +170,15 @@ var BACKOFF_MAX = 500 // 500ms max backoff withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch - var retry = 0 + var retry = 0; + this._transaction_retry_backoffs = []; // for testing purposes retryTransaction: while (true) { if (retry > 0): - sleep(Math.random() * min(BACKOFF_INITIAL * (1.25**retry), - BACKOFF_MAX)) + var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), + BACKOFF_MAX) + this._transaction_retry_backoffs.push(backoff) + sleep(backoff) retry += 1 this.startTransaction(options); // may throw on error From b6026066bec1077b353e7a1db163c2fb14b8f6c1 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Mon, 27 Oct 2025 15:07:36 -0700 Subject: [PATCH 5/8] fix pseudocode --- .../transactions-convenient-api.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index df679859ec..565523716a 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -131,14 +131,16 @@ This method should perform the following sequence of actions: [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + less than 120 seconds, calculate the backoff value to be + `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: 1. jitter is a random float between \[0, 1) 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed 3. BACKOFF_INITIAL is 1ms 4. BACKOFF_MAX is 500ms - Append this sleep duration to a list for testing purposes. Then, jump back to step two. + If the time elapsed thus far plus the backoff value would not exceed 120 seconds, then sleep for the backoff + value and jump back to step two, otherwise, raise last known error. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -171,20 +173,23 @@ withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch var retry = 0; - this._transaction_retry_backoffs = []; // for testing purposes retryTransaction: while (true) { - if (retry > 0): + if (retry > 0) { var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), - BACKOFF_MAX) - this._transaction_retry_backoffs.push(backoff) - sleep(backoff) + BACKOFF_MAX); + if (Date.now() + backoff - startTime >= 120000) { + throw last_error; + } + sleep(backoff); + } retry += 1 this.startTransaction(options); // may throw on error try { callback(this); } catch (error) { + var last_error = error; if (this.transactionState == STARTING || this.transactionState == IN_PROGRESS) { this.abortTransaction(); From 057fbbf1b25cc5dc88ac1a8788ee336d2491de77 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Tue, 28 Oct 2025 14:46:54 -0700 Subject: [PATCH 6/8] fix test --- source/transactions-convenient-api/tests/README.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 78e3a3eafc..4e0dd59f3c 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -43,12 +43,9 @@ private API or using a mock timer. ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3 -retries. Ensure that: - -- 3 backoffs occurred -- each backoff was greater than or equal to 0 -- the total operation time took more than the sum of the individual backoffs +Drivers should test that retries within `withTransaction` do not occur immediately. Optionally, set BACKOFF_INITIAL to a +higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries. Check that the total +time for all retries exceeded 1.25 seconds. ## Changelog From 42e4d94146cee87c3ef66539479185c3d417803c Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Wed, 29 Oct 2025 15:38:06 -0700 Subject: [PATCH 7/8] account for CSOT / timeoutMS in algorithm --- .../transactions-convenient-api.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 565523716a..6fb5114074 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -131,7 +131,7 @@ This method should perform the following sequence of actions: [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, calculate the backoff value to be + less than 120 seconds, calculate the backoffMS to be `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: 1. jitter is a random float between \[0, 1) @@ -139,8 +139,8 @@ This method should perform the following sequence of actions: 3. BACKOFF_INITIAL is 1ms 4. BACKOFF_MAX is 500ms - If the time elapsed thus far plus the backoff value would not exceed 120 seconds, then sleep for the backoff - value and jump back to step two, otherwise, raise last known error. + If timeoutMS is set and remainingTimeMS < backoffMS or timoutMS is not set and elapsed time + backoffMS > 120 + seconds then, raise last known error. Otherwise, sleep for backoffMS and jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -178,7 +178,10 @@ withTransaction(callback, options) { if (retry > 0) { var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX); - if (Date.now() + backoff - startTime >= 120000) { + if (timeoutMS is None) { + timeoutMS = 120000 + } + if (Date.now() + backoff - startTime >= timeoutMS) { throw last_error; } sleep(backoff); From c11aef816018cf5f2732408c9b8287088a3a8c69 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Wed, 29 Oct 2025 17:05:00 -0700 Subject: [PATCH 8/8] add more details to tests --- .../tests/README.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 4e0dd59f3c..1181fdbb23 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -44,8 +44,23 @@ private API or using a mock timer. ### Retry Backoff is Enforced Drivers should test that retries within `withTransaction` do not occur immediately. Optionally, set BACKOFF_INITIAL to a -higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries. Check that the total -time for all retries exceeded 1.25 seconds. +higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries like so: + +```json +{ + "configureFailPoint": "failCommand", + "mode": { + "times": 30 + }, + "data": { + "failCommands": ["commitTransaction"], + "errorCode": 24, + }, +} +``` + +Additionally, let the callback for the transaction be a simple `insertOne` command. Check that the total time for all +retries exceeded 1.25 seconds. ## Changelog