Restart ingestor couse downtime #8411

ro-distefano · 2025-06-12T11:09:53Z

ro-distefano
Jun 12, 2025

Hi, in my environment, I use Thanos installed on Kubernetes. In particular, I have a series of Prometheus instances that send metrics to a central Thanos via the receive-router, which then distributes the metrics to the ingestors (deployed as Kubernetes StatefulSets and accessible thanks to a headless service). Under normal conditions, everything works correctly. The problem arises when one of the ingestor replicas is restarted. In that case, the receive router fails to send the metrics, returning the following error:

caller=handler.go:600 level=debug component=receive component=receive-handler tenant=default-tenant msg="failed to handle request" err="forwarding request to endpoint {thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos-ns.svc.cluster.local:10901 thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos-ns.svc.cluster.local:19391 }: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"

The consequence is that all metrics subsequent to that moment will be lost util the receive router will be restarted.

Thanos version: 0.37.3
Receive Router configuration:

        - receive
        - --log.level=debug
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --receive.replication-factor=1
        - --receive.forward.async-workers=1000
        - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
        - --label=replica="$(NAME)"
        - --label=receive="true"
        - --tsdb.retention=10m
        - --tsdb.out-of-order.time-window=10m
        - --tsdb.too-far-in-future.time-window=10m
        - --receive.hashrings-algorithm=ketama
        - --receive-forward-timeout=120s

Receive Ingestor configuration:

        - receive
        - --log.level=debug
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --receive.replication-factor=1
        - --tsdb.path=/var/thanos/receive
        - --label=replica="$(NAME)"
        - --label=receive="true"
        - --receive.local-endpoint=$(NAME).thanos-receive-ingestor-default.$(NAMESPACE).svc.cluster.local:10901
        - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
        - --tsdb.retention=10m
        - --tsdb.out-of-order.time-window=10m
        - --tsdb.too-far-in-future.time-window=10m
        - --receive.hashrings-algorithm=ketama

verejoel · 2025-06-17T15:46:04Z

verejoel
Jun 17, 2025

To survive an ingester outage you need to ensure that more than a quorum of ingesters are healthy at any one time. With

--receive.replication-factor=1

the quorum of writes that must succeed is 1. That means you cannot tolerate any outages, which is why you are seeing errors on rollout.

Try setting --receive.replication-factor=2. Note that this will effectively double the number of head series you are dealing with, so you will probably need to scale up and/or out your ingester fleet. However, when you next rollout, you should be able to continue ingesting data without interruption.

0 replies

ro-distefano · 2025-06-17T16:11:00Z

ro-distefano
Jun 17, 2025
Author

I have set --receive.replication-factor=1 because I can tolerate some data loss. I don't have a problem losing a few minutes of metrics (the time necessary to restart the pod), and I don't want to introduce the quorum factor into play. The issue arises after the ingestor becomes available again. Even though the ingestor is up and running, I'm not receiving metrics from the restart time onward.

The error:

rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout

suggest that there isn't connectivity between the receive-router and the ingestor, or in any case, that the ingestor doesn't reply to the receive-router's requests. However, I don't understand why this is happening if, before the restart, the receive-router was able to contact the ingestor. And i don't understend why the ingestor stop to accept metrics from all the promethus source also those that do not send data to restarted ingestor pod.

0 replies

julienlau · 2025-06-25T08:04:06Z

julienlau
Jun 25, 2025

To survive an ingester outage you need to ensure that more than a quorum of ingesters are healthy at any one time. With
--receive.replication-factor=1
the quorum of writes that must succeed is 1. That means you cannot tolerate any outages, which is why you are seeing errors on rollout.

Try setting --receive.replication-factor=2. Note that this will effectively double the number of head series you are dealing with, so you will probably need to scale up and/or out your ingester fleet. However, when you next rollout, you should be able to continue ingesting data without interruption.

I think your analysis is correct, but the solution is incorrect. You must set --receive.replication-factor=3 in order to be fault tolerant to 1 replica down.
I don't think --receive.replication-factor=2 has any benefits.

0 replies

julienlau · 2025-06-25T08:04:47Z

julienlau
Jun 25, 2025

Please check this : #7274

0 replies

julienlau · 2025-07-03T09:22:19Z

julienlau
Jul 3, 2025

Warning : the quorum computation was changed in version 0.37
#7669

Now quorum=1 for RF=2 !!!

0 replies

ro-distefano · 2025-07-04T08:35:45Z

ro-distefano
Jul 4, 2025
Author

I Have tried to set --receive.replication-factor=2 but this don't solve my problem. the forword fail with the same error

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout

0 replies

julienlau · 2025-07-04T17:07:11Z

julienlau
Jul 4, 2025

is your hashring dynamic ?
maybe you check how it was affected after changing replication factor.

0 replies

GiedriusS · 2025-08-05T08:12:20Z

GiedriusS
Aug 5, 2025
Maintainer

Seems like the issue is somewhere in the configuration so I am converting this into a discussion.

0 replies

RobyBobby24 · 2025-08-30T10:20:10Z

RobyBobby24
Aug 30, 2025

I'm not sure if my case is a configuration problem. I've seen several similar issues and tried the recommended solutions, but they don't seem to work. In the end, I managed to stabilize it by adding a probe that restarts the receive router based on the results from queries to Thanos query. This has made it stable, but I believe it's a problem that deserves analysis.

0 replies

julienlau · 2025-09-01T08:28:27Z

julienlau
Sep 1, 2025

it may be helpful to specify which helm chart you are using and which version of the chart.
On my side I use the bitnami 17.2.0

0 replies

Restart ingestor couse downtime #8411

Uh oh!

Uh oh!

Replies: 10 comments

Uh oh!

Uh oh!

ro-distefano Jun 17, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ro-distefano Jul 4, 2025 Author

Uh oh!

Uh oh!

GiedriusS Aug 5, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

ro-distefano
Jun 17, 2025
Author

ro-distefano
Jul 4, 2025
Author

GiedriusS
Aug 5, 2025
Maintainer