4.x: Stabilize AdvancedShardAwarenessIT #676

Bouncheck · 2025-10-02T01:24:04Z

This test class recently fails with various causes. The following modifications were made:

Instead of waiting fixed amount of time for some tests, each test will await until all pools are initialized before checking the logs or the test times out.
Tolerance for additional reconnections was slighlty increased. It is still set to much lower number than what appears when running scenario without advanced shard awareness.
Configured size of channel pools was halved in the tests that used to create over 200 total channels. This should alleviate occasional bind exceptions.
The reconnection delays were increased. They were intentionally low to cause congestion, but it was too much for github actions.
Scanning for patterns in the logs is now done using the copy to avoid exceptions related to concurrent modification.
Commented out general checks for "Reconnection attempt complete". Those are
already covered by more speficic patterns that include cluster's IP and
it seems this general version collides with logs from other tests.
This should not be happening but that's a separate, larger issue.
Added full node ip addresses to log patterns to avoid matching with
logs caused by different cluster.

Bouncheck · 2025-10-02T02:11:40Z

Seems like somehow some logs leak between methods. I've got to fix that too.
Also channel pool can be null sometimes when checking the await's condition.

Bouncheck · 2025-10-02T10:49:17Z

~~Some of the reconnections come from RandomTokenVnodesIT which should not be running at all, since from what I see it does not have annotation with scylla backend and also has ScyllaSkip annotation.~~
This was incorrect, the reconnections are for the cluster created for the next test which is DefaultMetadataTabletMapIT.
RandomTokenVnodesIT startet taking a few seconds more even though its marked for skipping, so there was a slight change there too, but that does not make it the suspect here yet.

This test class recently fails with various causes. The following modifications were made: - Instead of waiting fixed amount of time for some tests, each test will await until all pools are initialized before checking the logs or the test times out. - Tolerance for additional reconnections was slighlty increased. It is still set to much lower number than what appears when running scenario without advanced shard awareness. - Configured size of channel pools was halved in the tests that used to create over 200 total channels. This should alleviate occasional bind exceptions. - The reconnection delays were increased. They were intentionally low to cause congestion, but it was too much for github actions. - Scanning for patterns in the logs is now done using the copy to avoid exceptions related to concurrent modification. - Commented out general checks for "Reconnection attempt complete". Those are already covered by more speficic patterns that include cluster's IP and it seems this general version collides with logs from other tests. This should not be happening but that's a separate, larger issue. - Added full node ip addresses to log patterns to avoid matching with logs caused by different cluster.

Bouncheck · 2025-10-02T14:41:03Z

Looks green now.
Created #678 to follow up on root cause.

nikagra · 2025-10-02T14:32:45Z

integration-tests/src/test/java/com/datastax/oss/driver/core/pool/AdvancedShardAwarenessIT.java

  public void should_initialize_all_channels(boolean reuseAddress) {
+    int poolLocalSizeSetting = 4; // Will round up to 6 due to not being divisible by 3 shards
+    int expectedChannelsPerNode = 6;
+    String node1 = CCM_RULE.getCcmBridge().getNodeIpAddress(1);


This is even better than what we discussed: you do not enforce the test to use some predefined IP prefix, but make an actual IP part of the regex pattern

Yes. Hardcoded ip would also work, but it could collide with something eventually.

nikagra · 2025-10-02T14:48:12Z

integration-tests/src/test/java/com/datastax/oss/driver/core/pool/AdvancedShardAwarenessIT.java

            .withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_LOW, 10000)
            .withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_HIGH, 60000)
-            .withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 64)
+            .withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, expectedChannelsPerNode)


I see you also reduced number of channels to just 6. Is it intentional?

Yes. In should_initialize_all_channels I made it 6 because it does not matter that much. It's mainly a sanity check that it does the basic thing.

Here (should_see_mismatched_shard) it is reduced to 33. Less than 64 but still enough to be sure that without advanced shard awareness it has pretty high chances to land on wrong shard several times.

dkropachev · 2025-10-02T17:08:47Z

@Bouncheck , I have couple of questions:

Why and when exactly it started failing, it was not like that two weeks ago ?
Why it is not failing on 6.1.5 and failing on other scylla versions, what is the difference ?

Bouncheck · 2025-10-02T18:30:20Z

@Bouncheck , I have couple of questions:
1. Why and when exactly it started failing, it was not like that two weeks ago ?

Why is currently unclear. First sighting seems to be github actions run after pushing this
ab2665f
However the final run visible under PR does not have the same failures
https://github.com/scylladb/java-driver/actions/runs/17986407382
I don't see anything significant in this commit that would explain the failures right now.

2. Why it is not failing on `6.1.5` and failing on other scylla versions, what is the difference ?

Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone.
It could be a mix of something incorrect in the test code that surfaced due to some change on the server side.

Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely.

Bouncheck · 2025-10-02T19:52:39Z

Before merging this let's evaluate #682 .
I think switching from hardcoded sleeps to awaits and using more specific patterns are still worthwhile changes, but maybe i should not be increasing number of reconnections tolerance here.

Bouncheck self-assigned this Oct 2, 2025

Bouncheck marked this pull request as draft October 2, 2025 02:10

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 2c1d3b7 to e701f0d Compare October 2, 2025 13:01

nikagra self-requested a review October 2, 2025 13:06

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from e701f0d to 63e75ac Compare October 2, 2025 13:44

Bouncheck changed the title ~~Stabilize AdvancedShardAwarenessIT~~ 4.x: Stabilize AdvancedShardAwarenessIT Oct 2, 2025

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 63e75ac to fe07a8c Compare October 2, 2025 14:05

nikagra requested a review from dkropachev October 2, 2025 14:29

Bouncheck marked this pull request as ready for review October 2, 2025 14:32

nikagra reviewed Oct 2, 2025

View reviewed changes

nikagra approved these changes Oct 2, 2025

View reviewed changes

Bouncheck mentioned this pull request Oct 6, 2025

Subset of fixes to AdvancedShardAwarenessIT #737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

4.x: Stabilize AdvancedShardAwarenessIT #676

4.x: Stabilize AdvancedShardAwarenessIT #676

Uh oh!

Bouncheck commented Oct 2, 2025 •

edited

Loading

Uh oh!

Bouncheck commented Oct 2, 2025 •

edited

Loading

Uh oh!

Bouncheck commented Oct 2, 2025 •

edited

Loading

Uh oh!

Bouncheck commented Oct 2, 2025

Uh oh!

nikagra Oct 2, 2025

Uh oh!

Bouncheck Oct 2, 2025

Uh oh!

nikagra Oct 2, 2025

Uh oh!

Bouncheck Oct 2, 2025 •

edited

Loading

Uh oh!

dkropachev commented Oct 2, 2025

Uh oh!

Bouncheck commented Oct 2, 2025 •

edited

Loading

Uh oh!

Bouncheck commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

4.x: Stabilize AdvancedShardAwarenessIT #676

Are you sure you want to change the base?

4.x: Stabilize AdvancedShardAwarenessIT #676

Uh oh!

Conversation

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025

Uh oh!

nikagra Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Bouncheck Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

nikagra Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Bouncheck Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkropachev commented Oct 2, 2025

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading