Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/136759.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 136759
summary: Avoid counting snapshot failures twice in SLM
area: ILM+SLM
type: bug
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -496,7 +496,8 @@ public ClusterState execute(ClusterState currentState) throws Exception {
final SnapshotLifecyclePolicyMetadata.Builder newPolicyMetadata = SnapshotLifecyclePolicyMetadata.builder(policyMetadata);
SnapshotLifecycleStats newStats = snapMeta.getStats();

if (registeredSnapshots.contains(snapshotId) == false) {
final boolean snapshotIsRegistered = registeredSnapshots.contains(snapshotId);
if (snapshotIsRegistered == false) {
logger.warn(
"Snapshot [{}] not found in registered set after snapshot completion. This means snapshot was"
+ " recorded as a failure by another snapshot's cleanup run.",
Expand Down Expand Up @@ -563,7 +564,11 @@ public ClusterState execute(ClusterState currentState) throws Exception {
exception.map(SnapshotLifecycleTask::exceptionToString).orElse(null)
)
);
newPolicyMetadata.incrementInvocationsSinceLastSuccess();
// If the snapshot was not registered, it means it was already counted as a failure by another snapshot's cleanup run
// so we should not increment the invocationsSinceLastSuccess again.
if (snapshotIsRegistered) {
newPolicyMetadata.incrementInvocationsSinceLastSuccess();
}
} else {
newStats = newStats.withTakenIncremented(policyName);
newPolicyMetadata.setLastSuccess(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -528,7 +528,8 @@ public void testCleanUpRegisteredInitiatedByFailure() throws Exception {
inferredFailureSnapshot,
snapshotInfoSuccess.snapshotId(),
snapshotInfoFailure1.snapshotId(),
snapshotInfoFailure2.snapshotId()
snapshotInfoFailure2.snapshotId(),
initiatingSnapshot
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing with my other changes. I think this test was originally wrong; since this WriteJobStatus runs for initiatingSnapshot, it should also be present in the registered snapshots. Please correct me if my reasoning is wrong!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, the test was passing before because of this exact bug that you are fixing.

)
);
var inProgress = Map.of(policyId, List.of(stillRunning));
Expand Down