Skip to content

Conversation

Bouncheck
Copy link

@Bouncheck Bouncheck commented Oct 2, 2025

This test class recently fails with various causes. The following modifications were made:

  • Instead of waiting fixed amount of time for some tests, each test will await until all pools are initialized before checking the logs or the test times out.
  • Tolerance for additional reconnections was slighlty increased. It is still set to much lower number than what appears when running scenario without advanced shard awareness.
  • Configured size of channel pools was halved in the tests that used to create over 200 total channels. This should alleviate occasional bind exceptions.
  • The reconnection delays were increased. They were intentionally low to cause congestion, but it was too much for github actions.
  • Scanning for patterns in the logs is now done using the copy to avoid exceptions related to concurrent modification.
  • Commented out general checks for "Reconnection attempt complete". Those are
    already covered by more speficic patterns that include cluster's IP and
    it seems this general version collides with logs from other tests.
    This should not be happening but that's a separate, larger issue.
  • Added full node ip addresses to log patterns to avoid matching with
    logs caused by different cluster.

@Bouncheck Bouncheck self-assigned this Oct 2, 2025
@Bouncheck Bouncheck marked this pull request as draft October 2, 2025 02:10
@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

Seems like somehow some logs leak between methods. I've got to fix that too.
Also channel pool can be null sometimes when checking the await's condition.

@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

Some of the reconnections come from RandomTokenVnodesIT which should not be running at all, since from what I see it does not have annotation with scylla backend and also has ScyllaSkip annotation.
This was incorrect, the reconnections are for the cluster created for the next test which is DefaultMetadataTabletMapIT.
RandomTokenVnodesIT startet taking a few seconds more even though its marked for skipping, so there was a slight change there too, but that does not make it the suspect here yet.

@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 2c1d3b7 to e701f0d Compare October 2, 2025 13:01
@nikagra nikagra self-requested a review October 2, 2025 13:06
@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from e701f0d to 63e75ac Compare October 2, 2025 13:44
@Bouncheck Bouncheck changed the title Stabilize AdvancedShardAwarenessIT 4.x: Stabilize AdvancedShardAwarenessIT Oct 2, 2025
This test class recently fails with various causes.
The following modifications were made:
- Instead of waiting fixed amount of time for some tests, each test will await
  until all pools are initialized before checking the logs or the test
  times out.
- Tolerance for additional reconnections was slighlty increased. It is still
  set to much lower number than what appears when running scenario without
  advanced shard awareness.
- Configured size of channel pools  was halved in the tests that used to create
  over 200 total channels. This should alleviate occasional bind exceptions.
- The reconnection delays were increased. They were intentionally low to cause
  congestion, but it was too much for github actions.
- Scanning for patterns in the logs is now done using the copy to avoid
  exceptions related to concurrent modification.
- Commented out general checks for "Reconnection attempt complete". Those are
  already covered by more speficic patterns that include cluster's IP and
  it seems this general version collides with logs from other tests.
  This should not be happening but that's a separate, larger issue.
- Added full node ip addresses to log patterns to avoid matching with
  logs caused by different cluster.
@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 63e75ac to fe07a8c Compare October 2, 2025 14:05
@nikagra nikagra requested a review from dkropachev October 2, 2025 14:29
@Bouncheck Bouncheck marked this pull request as ready for review October 2, 2025 14:32
@Bouncheck
Copy link
Author

Looks green now.
Created #678 to follow up on root cause.

public void should_initialize_all_channels(boolean reuseAddress) {
int poolLocalSizeSetting = 4; // Will round up to 6 due to not being divisible by 3 shards
int expectedChannelsPerNode = 6;
String node1 = CCM_RULE.getCcmBridge().getNodeIpAddress(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even better than what we discussed: you do not enforce the test to use some predefined IP prefix, but make an actual IP part of the regex pattern

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Hardcoded ip would also work, but it could collide with something eventually.

.withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_LOW, 10000)
.withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_HIGH, 60000)
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 64)
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, expectedChannelsPerNode)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also reduced number of channels to just 6. Is it intentional?

Copy link
Author

@Bouncheck Bouncheck Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In should_initialize_all_channels I made it 6 because it does not matter that much. It's mainly a sanity check that it does the basic thing.

Here (should_see_mismatched_shard) it is reduced to 33. Less than 64 but still enough to be sure that without advanced shard awareness it has pretty high chances to land on wrong shard several times.

@dkropachev
Copy link

@Bouncheck , I have couple of questions:

  1. Why and when exactly it started failing, it was not like that two weeks ago ?
  2. Why it is not failing on 6.1.5 and failing on other scylla versions, what is the difference ?

@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

@Bouncheck , I have couple of questions:

1. Why and when exactly it started failing, it was not like that two weeks ago ?

Why is currently unclear. First sighting seems to be github actions run after pushing this
ab2665f
However the final run visible under PR does not have the same failures
https://github.com/scylladb/java-driver/actions/runs/17986407382
I don't see anything significant in this commit that would explain the failures right now.

2. Why it is not failing on `6.1.5` and failing on other scylla versions, what is the difference ?

Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone.
It could be a mix of something incorrect in the test code that surfaced due to some change on the server side.

Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely.

@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

Before merging this let's evaluate #682 .
I think switching from hardcoded sleeps to awaits and using more specific patterns are still worthwhile changes, but maybe i should not be increasing number of reconnections tolerance here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants