Skip to content

Conversation

@jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Oct 20, 2025

Disclaimer: I'm not entirely sure this is the root cause of the failing tests, because I haven't been able to reproduce it in a controlled environment.

Important observations:

I think the following is the root cause of the failing tests:

  • TrainedModelStatsService collecting inference stats and writing them to Elasticsearch via a scheduler that triggers every second.

This leads to the following failing sequence of events:

  • Some test executes an infer trained model call, and schedules writing stats, which will happen between 0 and 1 secs.
  • Test finishes.
  • New test starts, and wipes the cluster including the ML stats index.
  • Stats writing triggers, which triggers creating the ML stats index; during index creation, the index is temporarily in a state during which reading from it fails.
  • Concurrently, however, the new test reads from this index, leading to the "all shards failed" failure.

The following should fix it (in theory):

  • MachineLearning#cleanUpFeature clearing the trained model stats queue.
  • Note that before each test, the feature reset API is called before deleting all indices, which is the correct order for this fix to work.

Fixes: #62699 #121726 #123034 #123200 #124168 #125641 #125642 #134257 #125750 #125909 #126299 #127625 #131364 #133440

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Oct 20, 2025
@jan-elastic jan-elastic added >test Issues or PRs that are addressing/adding tests :ml Machine learning Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Oct 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@jan-elastic jan-elastic force-pushed the fix-ml-tests-no-shards-available branch from f4200c7 to 4b0ed69 Compare October 20, 2025 09:04
@jan-elastic jan-elastic requested a review from davidkyle October 20, 2025 09:06
@jan-elastic jan-elastic force-pushed the fix-ml-tests-no-shards-available branch from 4b0ed69 to 3a929ab Compare October 20, 2025 14:24
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jan-elastic jan-elastic enabled auto-merge (squash) October 20, 2025 15:00
@jan-elastic jan-elastic merged commit 4b89f22 into elastic:main Oct 20, 2025
34 checks passed
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Oct 24, 2025
* fix debug output in TransportGetDataFrameAnalyticsStatsAction

* clear TrainedModelStatsService's queue upon MachineLearning reset

* unmute tests

* rename ResetAuditorActions -> ResetMlComponentsAction

* Move clearing stats queue to reset action
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] DeleteExpiredDataIT testDeleteExpiredDataWithStandardThrottle fails with "all shards failed"

3 participants