Skip to content

perf(connection): raise MAX_TRANSMIT_SEGMENTS to 40 and MAX_TRANSMIT_DATAGRAMS to 80#636

Open
poka-IT wants to merge 1 commit inton0-computer:mainfrom
poka-IT:warren-perf-tier1-clean
Open

perf(connection): raise MAX_TRANSMIT_SEGMENTS to 40 and MAX_TRANSMIT_DATAGRAMS to 80#636
poka-IT wants to merge 1 commit inton0-computer:mainfrom
poka-IT:warren-perf-tier1-clean

Conversation

@poka-IT
Copy link
Copy Markdown

@poka-IT poka-IT commented May 5, 2026

Description

Raises MAX_TRANSMIT_SEGMENTS from 10 to 40 and MAX_TRANSMIT_DATAGRAMS from 20 to 80 in noq/src/connection.rs.

At MTU 1280 this means 51.2 KB per sendmsg(UDP_SEGMENT) call instead of 12.8 KB. The Linux kernel hard limit UDP_MAX_SEGMENTS is 64, so 40 stays comfortably within bounds.

On uplink-saturated benchmarks I measured a meaningful throughput improvement on a Hetzner CCX23 box (single iperf3 -P 1, MTU 1280). I noticed this while looking at quinn issue 1572 and the ETHZ NSG 2024 thesis section 5.4 ("Patching Quinn"), which mentions that raising similar constants doubles msquic-style throughput in their bench.

Breaking Changes

None. Internal constants only.

Notes & open questions

Memory usage per drive call grows by 4x (40 vs 10 segments pre-allocated), which is an acceptable tradeoff for the throughput gain on saturated workloads. Lower-traffic connections still allocate on demand and should not see a difference.

Happy to add a benchmark to bench/ if useful, but the change is self-contained and the rationale matches the existing comment.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant. (No new behavior, constants only.)
  • All breaking changes documented.

…DATAGRAMS to 80

At MTU 1280 this means 51.2 KB per sendmsg(UDP_SEGMENT) call instead of
12.8 KB. The Linux kernel hard limit UDP_MAX_SEGMENTS is 64, so 40 stays
within bounds.

On uplink-saturated benchmarks I measured a meaningful throughput
improvement on a Hetzner CCX23 box (single iperf3 -P 1, MTU 1280).
@n0bot n0bot Bot added this to iroh May 5, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh May 5, 2026
@flub
Copy link
Copy Markdown
Collaborator

flub commented May 5, 2026

Thanks for the PR, could you post your benchmarks from before and after, including instruction on how to recreate them? These kind of changes tend to be delicate, so we'll want to try this out in some scenarios and see how comparable the results are.

@poka-IT
Copy link
Copy Markdown
Author

poka-IT commented May 5, 2026

Thanks for taking a look. Here's the data, with enough context to reproduce it.

Hardware / network setup

Two Hetzner CCX23 dedicated x86 VMs in fsn1-dc14:

  • 4 vCPU dedicated, 16 GB RAM
  • iperf3 VPS to VPS direct: ~16-17 Gbps symmetric (REF, no tunnel)
  • Sysctl on both sides:
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    net.core.netdev_max_backlog = 5000
    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr
    
  • Tunnel MTU 1280, GSO enabled, no other QUIC tuning beyond defaults
  • Stack on top of noq: a small QUIC VPN that runs N=32 noq::Connections per client, raw datagrams (RFC 9221), no streams. Exit side spawns one pump task per Connection. Client side has a single uplink task that dispatches by 5-tuple hash to one of the N Connections.

iperf3 driver: iperf3 -P 8 -t 30 -c <target> (8 parallel TCP streams), 5 runs uplink + 3 runs downlink.

Numbers

Bench A is vanilla noq 0.18.0. Bench B is the same source tree with the patched constants. Different VM pair, different day, same provisioning script and sysctls. Cross-session, not paired same-session, so see the variance caveat at the bottom.

Uplink (client to exit), 5x30s, P=8:

Run A (10/20) B (40/80)
1 1676 1953
2 1956 1898
3 1911 1814
4 1909 1853
5 1869 1932
avg 1864 Mbps 1890 Mbps

Downlink (exit to client), 3x30s, P=8:

Run A (10/20) B (40/80)
1 1949 2399
2 1924 2494
3 1948 2368
avg 1940 Mbps 2420 Mbps (+24.7%)

Within-session stddev: B uplink ±46 Mbps (2.5%), B downlink ±53 Mbps (2.2%). Tight enough that the +25% downlink is well above noise.

CPU client: ~210-230% (≈2.2 cores). RSS: 27 MB (A) vs 30 MB (B), so the 4x pre-alloc growth per drive call is small enough that it doesn't show up materially at N=32 connections.

Why it's asymmetric

Uplink shows no real change because our client-side architecture has a single sequential pump task. It's already drive-bound on its own loop, not on sendmsg capacity. Bumping the segment cap doesn't help when only one task is calling poll_transmit at a time.

Downlink benefits because the exit side has 32 parallel pump tasks, each driving its own Connection. Each one's sendmsg(UDP_SEGMENT) payload jumps from 12.8 KB (10 x 1280) to 51.2 KB (40 x 1280). That's where the ~25% comes from.

So the gain visible here is heavily workload-dependent: it shows up when several Connections drive transmit concurrently and the link has headroom. A single Connection doing bulk send on a saturated 25 GbE box (your typical bench setup) would probably show a different shape. Happy to run that variant if you want a more noq-native scenario.

Reproduce

Patch (already in this PR):

-const MAX_TRANSMIT_DATAGRAMS: usize = 20;
+const MAX_TRANSMIT_DATAGRAMS: usize = 80;
-const MAX_TRANSMIT_SEGMENTS: NonZeroUsize = NonZeroUsize::new(10).expect("known");
+const MAX_TRANSMIT_SEGMENTS: NonZeroUsize = NonZeroUsize::new(40).expect("known");

Apply against noq-v0.18.0. We pin via [patch.crates-io] in our workspace Cargo.toml:

[patch.crates-io]
noq       = { path = "vendor/noq-fork/noq" }
noq-proto = { path = "vendor/noq-fork/noq-proto" }
noq-udp   = { path = "vendor/noq-fork/noq-udp" }

(Patching all three is required, otherwise iroh-relay pulls in two copies of noq-proto types and the workspace fails to compile.)

Bench loop (runs on the client VM, exit running iperf3 -s on :49200):

for i in 1 2 3 4 5; do
  iperf3 -J -c <exit-tunnel-ip> -p 49200 -t 30 -P 8 \
    | jq -r '.end.sum_sent.bits_per_second / 1e6'
  sleep 1
done

Repeat with -R for downlink. The full multi-run wrapper we use parses min/max/stddev from iperf3 JSON via jq if that shape is useful.

Caveats I noticed

The two benches above are cross-session (different VM pair, different day). Hetzner inter-session network capacity varies a lot: on a follow-up run, the same provisioning gave a REF that dropped from 16.9 Gbps to 7.4 Gbps downlink, and the tunnel throughput scaled down with it. Anything below ~10% delta needs paired same-session runs to be trustworthy. The +25% downlink here is well above that threshold, but I would not claim a stable absolute Mbps number across hardware. If a paired same-session A/B with a flag-toggled binary would carry more weight for you, I can run that.

Memory: at N=32 Connections, the additional pre-allocated transmit space per drive call (4x growth) added ~3 MB total RSS in our setup. On something with thousands of Connections it would be more visible.

Regressions

We have 11 multi-conn QUIC integration tests covering the pump and accept loops; all pass unchanged after the patch. No new test added for the constants since there's no behavior change to assert, but happy to add a memory-footprint sanity test if you want one.

Let me know if you want flamegraphs, raw iperf3 JSON, or a different shape of bench (single Connection, lossy link, low-bandwidth path). I can also run the same A/B on a kernel-WireGuard pairing for sanity check if that's useful.

@flub
Copy link
Copy Markdown
Collaborator

flub commented May 6, 2026

If I followed your description it seems these benches are from using your entire stack and tests a TCP connection established by i3perf which itself is tunnelled into a QUIC connection using the QUIC datagram extension. Unfortunately I have no idea what your entire stack is, you haven't told me yet :).

I think ideally we'd be able to write a small self-contained perf tool in rust that only uses noq with no other components involved to compare the performance of these.

There is already a perf binary in the noq source tree that does some of this, but it may do things differently on your setup and only tests streams I think. Ideally you can demonstrate the per difference already with this, but if you need to make adjustments to how the connections are run and if the QUIC datagrams are important then maybe it is still a good start to modify things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🚑 Needs Triage

Development

Successfully merging this pull request may close these issues.

2 participants