fix(stream): propagate backpressure instead of silently dropping dial requests #6223

sneaxhuh · 2025-11-30T14:18:28Z

When opening hundreds of streams per second, dial requests were being silently dropped because try_send() would fail when the channel was full, causing Control::open_stream() to hang indefinitely.

This commit implements backpressure propagation as suggested by maintainers:

Increased dial_sender channel buffer from 0 (rendezvous) to 32 to handle burst traffic without blocking on every request.
Replaced try_send() with send().await in the dial path to propagate backpressure when >32 dial requests are pending, instead of silently dropping them.
Refactored Shared::sender() to return a cloned dial_sender, allowing the async send operation to happen outside the mutex lock to prevent deadlocks.

With these changes:

Up to 32 concurrent dial requests queue immediately
Additional requests block until space is available (backpressure)
Errors are properly propagated instead of silent failures
No more indefinite hangs on dropped dial requests

When opening hundreds of streams per second, dial requests were being silently dropped because `try_send()` would fail when the channel was full, causing `Control::open_stream()` to hang indefinitely. This commit implements backpressure propagation as suggested by maintainers: 1. Increased dial_sender channel buffer from 0 (rendezvous) to 32 to handle burst traffic without blocking on every request. 2. Replaced `try_send()` with `send().await` in the dial path to propagate backpressure when >32 dial requests are pending, instead of silently dropping them. 3. Refactored `Shared::sender()` to return a cloned dial_sender, allowing the async send operation to happen outside the mutex lock to prevent deadlocks. With these changes: - Up to 32 concurrent dial requests queue immediately - Additional requests block until space is available (backpressure) - Errors are properly propagated instead of silent failures - No more indefinite hangs on dropped dial requests Signed-off-by: sneax <[email protected]>

elenaf9

Thanks for opening the PR @sneaxhuh.

elenaf9 · 2025-11-30T15:33:03Z

protocols/stream/src/shared.rs

    }

-    pub(crate) fn sender(&mut self, peer: PeerId) -> mpsc::Sender<NewStream> {
+    pub(crate) fn sender(&mut self, peer: PeerId) -> (mpsc::Sender<NewStream>, Option<(PeerId, mpsc::Sender<PeerId>)>) {


Can we make this function async do the send in the function instead of returning the Sender?

please have a look now!

Thanks for the quick follow-up.
Why is it not possible to make the sender fn async?

Shared::sender needs a &mut self, and we only have that while holding the MutexGuard if we make this whole thing async, the guard stays alive across an .await, which basically blocks other parts of the code that need Shared and sometimes the compiler don’t even allow it so we grab what we need under the lock, drop it, and then do the async send after

Yes you're right, we'd need to return an manual -> impl Future<mpsc::Sender<NewStream>> and then clone the dial_sender before returning the async block.

However, the larger issue with this is that each clone of the dial_sender increases the channel's capacity by one. So effectively, we create an unbounded channel with this and prevent backpressure.

Still, I agree that we shouldn't hold the lock while blocking on the future.
So I guess the way to do this would be to create a Shared::poll_sender(PeerId, &mut Context<'_>) -> Poll<mpsc::Sender> function and poll that in Control::open_stream with poll_fn(|cx|Shared::lock(...).poll_sender(cx)) or something like that. Then we don't need to clone dial_sender.

I tried implementing the poll_sender approach but ran into waker handling issues with the mutex - tasks would hang because the waker registered during one lock acquisition didn't properly wake on subsequent polls. I switched to an unbounded channel approach instead. It fixes the silent drop problem and avoids the mutex/waker complexity. Would you prefer I continue debugging the poll_sender approach, or is the unbounded solution acceptable?

my bad you already said we don't want an unbounded channel, i tried using poll_fn with poll_ready() to drop the lock between polls: poll_fn(|cx| Shared::lock(&self.shared).poll_send_dial(cx, peer)).await
However, when poll_ready() returns Poll::Pending, the waker is registered while holding the mutex. When the channel becomes ready and tries to wake the task, the waker needs to acquire the same mutex to make progress causing deadlock, is there anything i am missing on how to handle this? guidance would be much appreciated

This reverts commit f06ec8c.

sneaxhuh added 2 commits November 30, 2025 19:44

updated-tests

636106d

elenaf9 reviewed Nov 30, 2025

View reviewed changes

sneaxhuh added 3 commits November 30, 2025 21:41

stream:async sender

d44dc95

unbounded approach

f06ec8c

Revert "unbounded approach"

9a145b1

This reverts commit f06ec8c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(stream): propagate backpressure instead of silently dropping dial requests #6223

fix(stream): propagate backpressure instead of silently dropping dial requests #6223

Uh oh!

sneaxhuh commented Nov 30, 2025

Uh oh!

elenaf9 left a comment

Uh oh!

elenaf9 Nov 30, 2025

Uh oh!

sneaxhuh Nov 30, 2025

Uh oh!

elenaf9 Nov 30, 2025

Uh oh!

sneaxhuh Nov 30, 2025

Uh oh!

elenaf9 Nov 30, 2025 •

edited

Loading

Uh oh!

sneaxhuh Dec 2, 2025

Uh oh!

sneaxhuh Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(stream): propagate backpressure instead of silently dropping dial requests #6223

Are you sure you want to change the base?

fix(stream): propagate backpressure instead of silently dropping dial requests #6223

Uh oh!

Conversation

sneaxhuh commented Nov 30, 2025

Uh oh!

elenaf9 left a comment

Choose a reason for hiding this comment

Uh oh!

elenaf9 Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

sneaxhuh Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

elenaf9 Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

sneaxhuh Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

elenaf9 Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneaxhuh Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

sneaxhuh Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elenaf9 Nov 30, 2025 •

edited

Loading