-
Notifications
You must be signed in to change notification settings - Fork 39
4.x: Stabilize AdvancedShardAwarenessIT #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: scylla-4.x
Are you sure you want to change the base?
4.x: Stabilize AdvancedShardAwarenessIT #676
Conversation
Seems like somehow some logs leak between methods. I've got to fix that too. |
|
2c1d3b7
to
e701f0d
Compare
e701f0d
to
63e75ac
Compare
This test class recently fails with various causes. The following modifications were made: - Instead of waiting fixed amount of time for some tests, each test will await until all pools are initialized before checking the logs or the test times out. - Tolerance for additional reconnections was slighlty increased. It is still set to much lower number than what appears when running scenario without advanced shard awareness. - Configured size of channel pools was halved in the tests that used to create over 200 total channels. This should alleviate occasional bind exceptions. - The reconnection delays were increased. They were intentionally low to cause congestion, but it was too much for github actions. - Scanning for patterns in the logs is now done using the copy to avoid exceptions related to concurrent modification. - Commented out general checks for "Reconnection attempt complete". Those are already covered by more speficic patterns that include cluster's IP and it seems this general version collides with logs from other tests. This should not be happening but that's a separate, larger issue. - Added full node ip addresses to log patterns to avoid matching with logs caused by different cluster.
63e75ac
to
fe07a8c
Compare
Looks green now. |
public void should_initialize_all_channels(boolean reuseAddress) { | ||
int poolLocalSizeSetting = 4; // Will round up to 6 due to not being divisible by 3 shards | ||
int expectedChannelsPerNode = 6; | ||
String node1 = CCM_RULE.getCcmBridge().getNodeIpAddress(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even better than what we discussed: you do not enforce the test to use some predefined IP prefix, but make an actual IP part of the regex pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Hardcoded ip would also work, but it could collide with something eventually.
.withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_LOW, 10000) | ||
.withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_HIGH, 60000) | ||
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 64) | ||
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, expectedChannelsPerNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also reduced number of channels to just 6
. Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In should_initialize_all_channels
I made it 6 because it does not matter that much. It's mainly a sanity check that it does the basic thing.
Here (should_see_mismatched_shard
) it is reduced to 33. Less than 64 but still enough to be sure that without advanced shard awareness it has pretty high chances to land on wrong shard several times.
@Bouncheck , I have couple of questions:
|
Why is currently unclear. First sighting seems to be github actions run after pushing this
Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone. Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely. |
Before merging this let's evaluate #682 . |
This test class recently fails with various causes. The following modifications were made:
already covered by more speficic patterns that include cluster's IP and
it seems this general version collides with logs from other tests.
This should not be happening but that's a separate, larger issue.
logs caused by different cluster.