Fix ML tests failing with "no shards available" #136800

jan-elastic · 2025-10-20T08:54:30Z

Disclaimer: I'm not entirely sure this is the root cause of the failing tests, because I haven't been able to reproduce it in a controlled environment.

Important observations:

The failing tests are not failing in isolation. E.g. the test MlWithSecurityIT test {yaml=ml/trained_model_cat_apis/Test cat trained models} itself reads from the index pattern ml-stats-*, but never creates it. This means in isolation it will always find no results (no matching indices), and the test succeeds.
Therefore, an ml-stats-000001 must be lingering from a previous test. However, all tests start with wiping Elasticsearch, including all indices (see: ESRestTestCase#wipeCluster).

I think the following is the root cause of the failing tests:

TrainedModelStatsService collecting inference stats and writing them to Elasticsearch via a scheduler that triggers every second.

This leads to the following failing sequence of events:

Some test executes an infer trained model call, and schedules writing stats, which will happen between 0 and 1 secs.
Test finishes.
New test starts, and wipes the cluster including the ML stats index.
Stats writing triggers, which triggers creating the ML stats index; during index creation, the index is temporarily in a state during which reading from it fails.
Concurrently, however, the new test reads from this index, leading to the "all shards failed" failure.

The following should fix it (in theory):

MachineLearning#cleanUpFeature clearing the trained model stats queue.
Note that before each test, the feature reset API is called before deleting all indices, which is the correct order for this fix to work.

Fixes: #62699 #121726 #123034 #123200 #124168 #125641 #125642 #134257 #125750 #125909 #126299 #127625 #131364 #133440

elasticsearchmachine · 2025-10-20T09:01:25Z

Pinging @elastic/ml-core (Team:ML)

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

davidkyle

LGTM

* fix debug output in TransportGetDataFrameAnalyticsStatsAction * clear TrainedModelStatsService's queue upon MachineLearning reset * unmute tests * rename ResetAuditorActions -> ResetMlComponentsAction * Move clearing stats queue to reset action

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Oct 20, 2025

jan-elastic added >test Issues or PRs that are addressing/adding tests :ml Machine learning Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Oct 20, 2025

jan-elastic force-pushed the fix-ml-tests-no-shards-available branch from f4200c7 to 4b0ed69 Compare October 20, 2025 09:04

jan-elastic commented Oct 20, 2025

View reviewed changes

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java Outdated Show resolved Hide resolved

jan-elastic requested a review from davidkyle October 20, 2025 09:06

jan-elastic added 5 commits October 20, 2025 16:24

fix debug output in TransportGetDataFrameAnalyticsStatsAction

63ed5af

clear TrainedModelStatsService's queue upon MachineLearning reset

ff1a00b

unmute tests

38e1262

rename ResetAuditorActions -> ResetMlComponentsAction

46bc323

Move clearing stats queue to reset action

3a929ab

jan-elastic force-pushed the fix-ml-tests-no-shards-available branch from 4b0ed69 to 3a929ab Compare October 20, 2025 14:24

davidkyle approved these changes Oct 20, 2025

View reviewed changes

jan-elastic enabled auto-merge (squash) October 20, 2025 15:00

jan-elastic merged commit 4b89f22 into elastic:main Oct 20, 2025
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ML tests failing with "no shards available" #136800

Fix ML tests failing with "no shards available" #136800

Uh oh!

jan-elastic commented Oct 20, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Oct 20, 2025

Uh oh!

Uh oh!

davidkyle left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix ML tests failing with "no shards available" #136800

Fix ML tests failing with "no shards available" #136800

Uh oh!

Conversation

jan-elastic commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 20, 2025

Uh oh!

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jan-elastic commented Oct 20, 2025 •

edited

Loading