feat(proto): Better NAT probe retry intervals#623
Conversation
This improves the NAT probing intervals: it now starts at 33ms by default and does an exponential backoff capping at 2s intervals. There are now 7 probe attempts spanning a period 4 seconds.
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/623/docs/noq/ Last updated: 2026-05-06T08:23:27Z |
|
Patchbay is sad until n0-computer/iroh#4213 is merged |
Performance Comparison Report
|
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5837.7 Mbps | 7930.8 Mbps | -26.4% | 97.0% / 98.6% |
| medium-concurrent | 5561.1 Mbps | 7758.9 Mbps | -28.3% | 95.7% / 97.5% |
| medium-single | 4109.1 Mbps | 4620.5 Mbps | -11.1% | 95.7% / 98.1% |
| small-concurrent | 3909.1 Mbps | 5112.4 Mbps | -23.5% | 97.4% / 99.6% |
| small-single | 3572.8 Mbps | 4687.6 Mbps | -23.8% | 96.2% / 98.4% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3114.8 Mbps | 4062.0 Mbps | -23.3% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.5% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 23.0% slower on average
6e2ebb5cea895e55eb52ea1e4a4974959bb51291 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5878.5 Mbps | 7903.9 Mbps | -25.6% | 96.5% / 151.0% |
| medium-concurrent | 5515.6 Mbps | 7641.0 Mbps | -27.8% | 96.1% / 150.0% |
| medium-single | 3826.4 Mbps | 4749.5 Mbps | -19.4% | 97.7% / 149.0% |
| small-concurrent | 3833.1 Mbps | 5299.1 Mbps | -27.7% | 93.4% / 102.0% |
| small-single | 3562.7 Mbps | 4847.0 Mbps | -26.5% | 94.8% / 102.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3114.5 Mbps | 4064.6 Mbps | -23.4% |
| lan | 782.4 Mbps | 810.4 Mbps | -3.5% |
| lossy | 69.9 Mbps | 67.3 Mbps | +3.8% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 24.8% slower on average
99fb41ccfc9a6f28f6db544a8ecf4a56fe7f8d24 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5417.7 Mbps | 7885.4 Mbps | -31.3% | 94.7% / 99.0% |
| medium-concurrent | 5434.2 Mbps | 7681.9 Mbps | -29.3% | 95.3% / 101.0% |
| medium-single | 3864.8 Mbps | 4749.0 Mbps | -18.6% | 98.8% / 152.0% |
| small-concurrent | 3830.3 Mbps | 5494.3 Mbps | -30.3% | 90.9% / 99.2% |
| small-single | 3457.7 Mbps | 4890.4 Mbps | -29.3% | 93.3% / 103.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3075.0 Mbps | 4004.5 Mbps | -23.2% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 27.1% slower on average
| pub(crate) fn queue_retries(&mut self, ipv6: bool) { | ||
| self.attempt += 1; |
There was a problem hiding this comment.
We now have both attempt as well as ProbeState::Active(remaining).
Attempts are counted up, the remaining attempts for each probe are counted down.
Can we make it so that we only have attempt? I think all remaining values are basically the same across all probes, unless one of them is IPv6 in an ipv4-only socket.
There was a problem hiding this comment.
The major design choice is to keep sending probes to all remote candidates at the same time, rather than keep a timer for each probe individually. You can gain a new address mid-round: e.g. if the REACH_OUT frames are spread over two packets you'll get new probes mid-round. In this case the retry timers are all set from the first probe, not from later added ones. This is also the reason to keep attempt outside of the remaining, because they track very distinct numbers.
The other choice would be to track the delay for each probe individually. This would be technically be better, but the cost is that we need to set the timer a lot more. The way I see this working is to have the retry_delay function iterate over all the probes, compute the delay for each and return the shortest delay. Then when you queue retries you do the same: iterate over all probes and retry all those for which the timer has expired. And if we do all that we wouldn't have to keep two fields around.
I was kind of shying away from that 2nd version as I worried it might just be too much. But then maybe as we improve our incremental probing we'll end up having to do that anyway so maybe it is better to go straight for that? What do you think after this whole explanation?
There was a problem hiding this comment.
Although it reminds me of the PTO (or perhaps because of that) I'd say the second version is overkill.
I still don't think it's necessary to keep remaining around. IMO it can be a basic "succeeded/not succeeded".
When we receive the REACH_OUT on the second packet, well then we just sent the probes a little later.
If we receive it after the first retry attempt, well then we'll have one less retry with those probes, that's fine, we're not being very precise with the retry timing here anyways (after all, we don't do the second version!).
I can see how attempt and remaining could be separate values, but IMO the complexity is not worth it. The only effect of having them separate seems to be sending the later-arriving REACH_OUT candidates for potentially a couple more retires.
matheus23
left a comment
There was a problem hiding this comment.
I like this a lot - but I'd prefer to fix the canonicalization issue ASAP, if possible, and then get the iroh PR with the noq update in, and then see if this gets through patchbay tests fine.
Description
This improves the NAT probing intervals: it now starts at 33ms by
default and does an exponential backoff capping at 2s intervals. There
are now 7 probe attempts spanning a period 4 seconds.
Fixes #569
Breaking Changes
n/a
Notes & open questions
4s is a bit longer than I initially thought, but I think it's
fine. Maybe we could remove 1 or even 2 further attempts. Removing 1
gets you to 2s though. Alternatively the interval cap could be lowered
a bit and the initial interval a tiny bit higher. But really I'm only
writing this down for nice git history, I think the current tradeoff
is a good start.
Change checklist