-
Notifications
You must be signed in to change notification settings - Fork 246
DRIVERS-1934: withTransaction API retries too frequently #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| callbacks. | ||
|
|
||
| A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that | ||
| A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I read it wrong, or maybe its a typo? Not sure.
| [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. | ||
| 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is | ||
| less than 120 seconds, jump back to step two. | ||
| less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know python uses ** as the exponential operator. I don't believe that is a standard across other programming languages. Is it clear that its exponent? Is there a different preferred symbol for exponent? (I know math commonly uses ^ but it typically also means bitwise XOR in code so I felt like that could be confusing.)
| ### Retry Backoff is Enforced | ||
|
|
||
| Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms | ||
| and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would not work as described because jitter is non-deterministic.
An alternative test would be to use a failpoint to fail the transaction X times and then assert the overall time is larger than some threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I realized that when implementing it. I set the fail point to fail 3 times and that seems to be consistently working (and without jitter failing 3 times would still cause the success to happen within 500ms)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: it was consistent locally but super flakey on atlas. I modified the with transaction code to append backoff values to make this easier to test and the test is now more stable.
|
|
||
| A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that | ||
| `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; | ||
| A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we add a section in the Design Rationale to cover why we think it's beneficial to introduce backoff. And another section explaining why we choose the parameters 1ms initial, 1.25 growth, and 500ms max.
|
|
||
| ### Retry Backoff is Enforced | ||
|
|
||
| Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 here was a bit of an arbitrary number. The most important part of this value is that its greater than 1
3 just felt like a small enough to be a quick test but big enough to conclude backoff is consistently happening.
If folks have more opinions on this number, I'm not attached to 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the _transaction_retry_backoffs concept is viable for testing here since:
- many languages don't have a way to implement it without making the attribute public
- It's an unbounded list that can grow forever.
- there's no prior art for something like this in the driver as far as I know.
Instead I'd suggest my test from earlier where we fail the transaction X times and assert the run time is greater than some threshold T. X should be large enough to reduce false positives where the test fails due to jitter resulting in a small delay for every retry.
We can calculate T by recording the command failed+succeeded events, summing their duration, and adding a fixed constant.
|
|
||
| ### Retry Backoff is Enforced | ||
|
|
||
| Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the _transaction_retry_backoffs concept is viable for testing here since:
- many languages don't have a way to implement it without making the attribute public
- It's an unbounded list that can grow forever.
- there's no prior art for something like this in the driver as far as I know.
Instead I'd suggest my test from earlier where we fail the transaction X times and assert the run time is greater than some threshold T. X should be large enough to reduce false positives where the test fails due to jitter resulting in a small delay for every retry.
We can calculate T by recording the command failed+succeeded events, summing their duration, and adding a fixed constant.
Please complete the following before merging:
clusters).