Restart ingestor couse downtime #8411
Replies: 10 comments
-
To survive an ingester outage you need to ensure that more than a quorum of ingesters are healthy at any one time. With
the quorum of writes that must succeed is 1. That means you cannot tolerate any outages, which is why you are seeing errors on rollout. Try setting |
Beta Was this translation helpful? Give feedback.
-
I have set --receive.replication-factor=1 because I can tolerate some data loss. I don't have a problem losing a few minutes of metrics (the time necessary to restart the pod), and I don't want to introduce the quorum factor into play. The issue arises after the ingestor becomes available again. Even though the ingestor is up and running, I'm not receiving metrics from the restart time onward. The error:
suggest that there isn't connectivity between the receive-router and the ingestor, or in any case, that the ingestor doesn't reply to the receive-router's requests. However, I don't understand why this is happening if, before the restart, the receive-router was able to contact the ingestor. And i don't understend why the ingestor stop to accept metrics from all the promethus source also those that do not send data to restarted ingestor pod. |
Beta Was this translation helpful? Give feedback.
-
I think your analysis is correct, but the solution is incorrect. You must set |
Beta Was this translation helpful? Give feedback.
-
Please check this : #7274 |
Beta Was this translation helpful? Give feedback.
-
Warning : the quorum computation was changed in version 0.37 Now quorum=1 for RF=2 !!! |
Beta Was this translation helpful? Give feedback.
-
I Have tried to set --receive.replication-factor=2 but this don't solve my problem. the forword fail with the same error
|
Beta Was this translation helpful? Give feedback.
-
is your hashring dynamic ? |
Beta Was this translation helpful? Give feedback.
-
Seems like the issue is somewhere in the configuration so I am converting this into a discussion. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if my case is a configuration problem. I've seen several similar issues and tried the recommended solutions, but they don't seem to work. In the end, I managed to stabilize it by adding a probe that restarts the receive router based on the results from queries to Thanos query. This has made it stable, but I believe it's a problem that deserves analysis. |
Beta Was this translation helpful? Give feedback.
-
it may be helpful to specify which helm chart you are using and which version of the chart. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, in my environment, I use Thanos installed on Kubernetes. In particular, I have a series of Prometheus instances that send metrics to a central Thanos via the receive-router, which then distributes the metrics to the ingestors (deployed as Kubernetes StatefulSets and accessible thanks to a headless service). Under normal conditions, everything works correctly. The problem arises when one of the ingestor replicas is restarted. In that case, the receive router fails to send the metrics, returning the following error:
caller=handler.go:600 level=debug component=receive component=receive-handler tenant=default-tenant msg="failed to handle request" err="forwarding request to endpoint {thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos-ns.svc.cluster.local:10901 thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos-ns.svc.cluster.local:19391 }: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
The consequence is that all metrics subsequent to that moment will be lost util the receive router will be restarted.
Thanos version: 0.37.3
Receive Router configuration:
Receive Ingestor configuration:
Beta Was this translation helpful? Give feedback.
All reactions