Skip to content

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Oct 20, 2025

For allocation simulation, allocation decisions should not be THROTTLE for moveShards and balance. Allocate unassigned shards should not see THROTTLE unless there is prior shard assignment. This PR makes this statement explicit by fixing test allocation decider and HasFrozenAllocationDecider as well as adding relevant assertions.

Relates: ES-12955

Other than HasFrozenCacheAllocationDecider, allocation decision should
not be THROTTLE unless there is prior shard assignment/movement. This PR
makes this statement explicit by adding relevant assertions and fixing
test allocation decider.

Relates: ES-12955
@ywangd ywangd added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v9.3.0 labels Oct 20, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member Author

ywangd commented Oct 20, 2025

Initially, I wanted to ensure that no THROTTLE can be returned at all unless there is prior shard movement. But HasFrozenCacheAllocationDecider is not compatible with the idea. We also have existing code explicitly checking for it. So it seems that we have to live with this exception. As a result, I added assertions and fixed RandomAllocationDecider to make this exception more explicit. I am open to suggestions.

Comment on lines 712 to 713
// 2. Or, a partial searchable snapshot index due to HasFrozenCacheAllocationDecider
assert allocation.isSimulating() == false || indexMetadata.isPartialSearchableSnapshot()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could check that the THROTTLE decision is actually from the HasFrozenCacheAllocationDecider. But it's not really feasible. Instead, we can check it's a frozen index. Not really ideal, but seems to be the best we can do.

Copy link
Contributor

@nicktindall nicktindall Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of putting knowledge of specific deciders in here, as discussed, I wonder if HasFrozenAllocationDecider is abusing THROTTLE anyway. My reasoning was:

In the canAllocateUnassigned logic we'll wait for the outcome of a lower-weight THROTTLE when we have higher-weight YES so that would mean this decider can defer the allocation of a partial searchable snapshot while a non-frozen node was initialising doesn't it?
I would think we'd return NO until the setting was loaded?

It seems like returning THROTTLED while we wait to find out if the node can host frozen shards could mean we defer allocating the shard to another node that we KNOW can host frozen shards, while waiting to find out if an initialising node is a frozen node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we defer allocating the shard to another node that we KNOW can host frozen shards, while waiting to find out if an initialising node is a frozen node.

That is true. I also wonder whether it is intentional since otherwise the shards will be assigned to an existing node only to relocate later momentarily. That said, after reading through code around HasFrozenAllocationDecider, I believe it is OK to return NO since it should not change the beahviour. That is because we always call allocateExistingUnassignedShards before calling into the allocator. This method uses SearchableSnapshotAllocator to allocate unassigne shards first. It also checks cache status and reject allocation when cache state is still being fetched. It in turns places the shard in the ignored list and BalancedShardAllocator ignores them as well. In short, the allocation decision for paritially mounted shards are made earlier in a different place so that it should not matter for the decider to return NO.

Comment on lines 72 to 77
if (allocation.isSimulating() && relocatingShards >= 2) {
// This branch should no longer run after https://github.com/elastic/elasticsearch/pull/134786
assert false : "allocation simulation should have returned earlier and not hit throttling";
// BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2).
// (See https://github.com/elastic/elasticsearch/issues/87279)
// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can potentially remove this branch in future. For now, adding the assertion seems to be a good first step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can't confidently get rid of this code, then we aren't really sure we're not throttling with the changes. This should at least have a planned follow up action that is more certain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is safe to remove. My usage of "potential" may be confusing. It does not imply less confidence, was just bad wording.

Adding assert false ensures it does not happen in tests and is equivalent to deprecation. I'd prefer to have it first to run through CIs for sometime before finally removing it. It is also because the existing comment has an issue linked to it #87279 and it'd be great to spend some time separately to see whether it can be closed while removing the branch. I raised ES-13282 specifically for deleting this branch.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this change LGTM pending the question of whether we can make HasFrozenCacheAllocationDecider return NO while we're initialising the node so that we can remove the special knowledge of it from here?

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this task is to gain confidence that the THROTTLE decision is not encountered, but right now this patch doesn't provide that confidence. This isn't in the hot-path, so I don't think we need to rush the change. I'd rather sort out the Decider before proceeding.

// happen in tests because we have decider like RandomAllocationDecider that can randomly return THROTTLE
// when there is no existing relocation.
// Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE.
shardBalanced = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify the code by making this

assert ....
if (completeEarlyOnShardAssignmentChange && routingNodes.getRelocatingShardCount() > 0) {
    return true;
}
continue;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. We still want to track shardBalanced when it is not simulation.

Comment on lines 72 to 77
if (allocation.isSimulating() && relocatingShards >= 2) {
// This branch should no longer run after https://github.com/elastic/elasticsearch/pull/134786
assert false : "allocation simulation should have returned earlier and not hit throttling";
// BalancedShardAllocator is prone to perform unnecessary moves when cluster_concurrent_rebalance is set to high values (>2).
// (See https://github.com/elastic/elasticsearch/issues/87279)
// Above allocator is used in DesiredBalanceComputer. Since we do not move actual shard data during calculation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can't confidently get rid of this code, then we aren't really sure we're not throttling with the changes. This should at least have a planned follow up action that is more certain.

@ywangd ywangd marked this pull request as draft October 20, 2025 23:12
@ywangd ywangd marked this pull request as ready for review October 21, 2025 07:17
@ywangd ywangd changed the title Tighten when THROTTLE decision can be returned Tighten on when THROTTLE decision can be returned Oct 21, 2025
@ywangd
Copy link
Member Author

ywangd commented Oct 21, 2025

@nicktindall @DiannaHohensee The PR is ready for another look. In summary, I have taken Nick's suggestion for returning NO from HasFrozenAllocationDecider.

I spent quite some time reading through relevant code for searchable snapshot indices allocation. I believe it is safe to do so.

  1. Unassigned searchable snapshot shards are allocated elsewhere, see also this comment.
  2. For relocations, throttle is effectively a NO and unlike unassigned shards, there is no urgency for them.

Comment on lines -86 to +99
default -> STILL_FETCHING;
case FETCHING -> allocation.isSimulating() ? NO_STILL_FETCHING : STILL_FETCHING;
case UNKNOWN -> UNKNOWN_NODE;
Copy link
Member Author

@ywangd ywangd Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasFrozenAllocationDecider does not differentiate between fetching node state (FETCHING) and no-such-node (UNKNOWN). The later can only happen when a node leaves the cluster which seems more justified to say NO. In any case, they do not really impact SearchableSnapshotAllocator which is what matters for allocating unassigned searchable snapshot shards.
Btw, I can also undo the split if it is considered unrelated.

Comment on lines 488 to 498
if (hasChanges == false && info.lastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_THROTTLED) {
// Unassigned ignored shards must be based on the provided set of ignoredShards
assert ignoredShards.contains(discardAllocationStatus(shard))
|| ignoredShards.stream().filter(ShardRouting::primary).anyMatch(primary -> primary.shardId().equals(shard.shardId()))
: "ignored shard "
+ shard
+ " unexpectedly has THROTTLE status and no counterpart in the provided ignoredShards set "
+ ignoredShards;
// Simulation could not progress due to missing information in any of the deciders.
// Currently, this could happen if `HasFrozenCacheAllocationDecider` is still fetching the data.
// Progress would be made after the followup reroute call.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about HasFrozenCacheAllocationDecider had me concerned for some time. Adding this assertion to make it clear that throttled status can only come from DesiredBalanceInput#ignoredShards, i.e. they are not produced by BalancedShardAllocator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants