Skip to content

feat(proto): Better NAT probe retry intervals#623

Merged
flub merged 3 commits intomainfrom
flub/nat-probe-retry-timing
May 6, 2026
Merged

feat(proto): Better NAT probe retry intervals#623
flub merged 3 commits intomainfrom
flub/nat-probe-retry-timing

Conversation

@flub
Copy link
Copy Markdown
Collaborator

@flub flub commented Apr 29, 2026

Description

This improves the NAT probing intervals: it now starts at 33ms by
default and does an exponential backoff capping at 2s intervals. There
are now 7 probe attempts spanning a period 4 seconds.

Fixes #569

Breaking Changes

n/a

Notes & open questions

4s is a bit longer than I initially thought, but I think it's
fine. Maybe we could remove 1 or even 2 further attempts. Removing 1
gets you to 2s though. Alternatively the interval cap could be lowered
a bit and the initial interval a tiny bit higher. But really I'm only
writing this down for nice git history, I think the current tradeoff
is a good start.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.

This improves the NAT probing intervals: it now starts at 33ms by
default and does an exponential backoff capping at 2s intervals. There
are now 7 probe attempts spanning a period 4 seconds.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/623/docs/noq/

Last updated: 2026-05-06T08:23:27Z

@flub
Copy link
Copy Markdown
Collaborator Author

flub commented Apr 29, 2026

Patchbay is sad until n0-computer/iroh#4213 is merged

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

Performance Comparison Report

4759df8196f8770ebe6760aeb1d3b80c2abce083 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5837.7 Mbps 7930.8 Mbps -26.4% 97.0% / 98.6%
medium-concurrent 5561.1 Mbps 7758.9 Mbps -28.3% 95.7% / 97.5%
medium-single 4109.1 Mbps 4620.5 Mbps -11.1% 95.7% / 98.1%
small-concurrent 3909.1 Mbps 5112.4 Mbps -23.5% 97.4% / 99.6%
small-single 3572.8 Mbps 4687.6 Mbps -23.8% 96.2% / 98.4%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3114.8 Mbps 4062.0 Mbps -23.3%
lan 782.4 Mbps 810.3 Mbps -3.5%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 23.0% slower on average

---
6e2ebb5cea895e55eb52ea1e4a4974959bb51291 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5878.5 Mbps 7903.9 Mbps -25.6% 96.5% / 151.0%
medium-concurrent 5515.6 Mbps 7641.0 Mbps -27.8% 96.1% / 150.0%
medium-single 3826.4 Mbps 4749.5 Mbps -19.4% 97.7% / 149.0%
small-concurrent 3833.1 Mbps 5299.1 Mbps -27.7% 93.4% / 102.0%
small-single 3562.7 Mbps 4847.0 Mbps -26.5% 94.8% / 102.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3114.5 Mbps 4064.6 Mbps -23.4%
lan 782.4 Mbps 810.4 Mbps -3.5%
lossy 69.9 Mbps 67.3 Mbps +3.8%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 24.8% slower on average

---
99fb41ccfc9a6f28f6db544a8ecf4a56fe7f8d24 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5417.7 Mbps 7885.4 Mbps -31.3% 94.7% / 99.0%
medium-concurrent 5434.2 Mbps 7681.9 Mbps -29.3% 95.3% / 101.0%
medium-single 3864.8 Mbps 4749.0 Mbps -18.6% 98.8% / 152.0%
small-concurrent 3830.3 Mbps 5494.3 Mbps -30.3% 90.9% / 99.2%
small-single 3457.7 Mbps 4890.4 Mbps -29.3% 93.3% / 103.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3075.0 Mbps 4004.5 Mbps -23.2%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 27.1% slower on average

@n0bot n0bot Bot added this to iroh Apr 29, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh Apr 29, 2026
@flub flub moved this from 🚑 Needs Triage to 👀 In review in iroh Apr 29, 2026
@flub flub added this to the iroh v1.0.0-rc.0 milestone Apr 29, 2026
Comment on lines +439 to +440
pub(crate) fn queue_retries(&mut self, ipv6: bool) {
self.attempt += 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have both attempt as well as ProbeState::Active(remaining).

Attempts are counted up, the remaining attempts for each probe are counted down.

Can we make it so that we only have attempt? I think all remaining values are basically the same across all probes, unless one of them is IPv6 in an ipv4-only socket.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major design choice is to keep sending probes to all remote candidates at the same time, rather than keep a timer for each probe individually. You can gain a new address mid-round: e.g. if the REACH_OUT frames are spread over two packets you'll get new probes mid-round. In this case the retry timers are all set from the first probe, not from later added ones. This is also the reason to keep attempt outside of the remaining, because they track very distinct numbers.

The other choice would be to track the delay for each probe individually. This would be technically be better, but the cost is that we need to set the timer a lot more. The way I see this working is to have the retry_delay function iterate over all the probes, compute the delay for each and return the shortest delay. Then when you queue retries you do the same: iterate over all probes and retry all those for which the timer has expired. And if we do all that we wouldn't have to keep two fields around.

I was kind of shying away from that 2nd version as I worried it might just be too much. But then maybe as we improve our incremental probing we'll end up having to do that anyway so maybe it is better to go straight for that? What do you think after this whole explanation?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it reminds me of the PTO (or perhaps because of that) I'd say the second version is overkill.

I still don't think it's necessary to keep remaining around. IMO it can be a basic "succeeded/not succeeded".
When we receive the REACH_OUT on the second packet, well then we just sent the probes a little later.
If we receive it after the first retry attempt, well then we'll have one less retry with those probes, that's fine, we're not being very precise with the retry timing here anyways (after all, we don't do the second version!).

I can see how attempt and remaining could be separate values, but IMO the complexity is not worth it. The only effect of having them separate seems to be sending the later-arriving REACH_OUT candidates for potentially a couple more retires.

Copy link
Copy Markdown
Member

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot - but I'd prefer to fix the canonicalization issue ASAP, if possible, and then get the iroh PR with the noq update in, and then see if this gets through patchbay tests fine.

@flub flub enabled auto-merge May 6, 2026 08:45
@flub flub disabled auto-merge May 6, 2026 09:00
@flub flub enabled auto-merge May 6, 2026 09:00
Copy link
Copy Markdown
Member

@Frando Frando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(did not actually review)

@flub flub added this pull request to the merge queue May 6, 2026
Merged via the queue into main with commit 262697c May 6, 2026
64 of 65 checks passed
@flub flub deleted the flub/nat-probe-retry-timing branch May 6, 2026 09:16
@github-project-automation github-project-automation Bot moved this from 👀 In review to ✅ Done in iroh May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Improve NAT Traversal probing intervals

3 participants