Skip to content

perf(fanout): optimize publish fan-out performance#20

Open
bherbruck wants to merge 4 commits intomainfrom
develop
Open

perf(fanout): optimize publish fan-out performance#20
bherbruck wants to merge 4 commits intomainfrom
develop

Conversation

@bherbruck
Copy link
Copy Markdown
Contributor

Summary

  • Optimize publish fan-out with Arc<str> and thread-local deduplication
  • Add pre-serialization for QoS 1/2 PUBLISH packets to avoid redundant encoding
  • Implement zero-copy message routing with SharedWriter for improved throughput

Test plan

  • Run cargo test to verify all tests pass
  • Run cargo bench or load test to verify performance improvements
  • Test with multiple subscribers to verify fan-out behavior

🤖 Generated with Claude Code

bherbruck and others added 4 commits December 28, 2025 01:28
… dedup

Two key optimizations for high-fanout scenarios:

1. Thread-local HashMap for deduplication
   - Reuse allocation across publishes instead of allocating per-publish
   - Eliminates ~6000 HashMap allocations per minute under load

2. Arc<str> for Publish.topic
   - Cloning topic for each subscriber is now O(1) instead of O(n)
   - For 2000 subscribers, eliminates 2000 string clones per publish

Based on analysis comparing VibeMQ to FlashMQ's hot path optimizations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend CachedPublish to support all QoS levels with in-place byte
patching for first_byte (dup/qos/retain) and packet_id fields.

Key changes:
- Add CachedPublish struct with offset tracking for patchable fields
- Change InflightMessage from struct to enum (Cached/Full variants)
- Update route_message to use CachedPublish for all QoS levels
- Update retry and session resume to use cached writes with dup=true
- Add optimization-guide.md documenting FlashMQ analysis

Performance improvement (QoS 2 fan-out benchmark):
- Mean latency: 274ms → 216ms (21% faster)
- P99 latency: 536ms → 458ms (14% faster)
- Memory: 457MB → 405MB (11% reduction)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major optimizations for high fan-out scenarios:

- Add RawPublish for zero-copy fan-out using incoming wire bytes
  - Patches first_byte and packet_id without re-encoding
  - Falls back to CachedPublish when protocol version differs

- Add SharedWriter for direct buffer writes
  - Bypasses channel overhead for message delivery
  - Notification coalescing (only notify on empty→non-empty)
  - Reduces wake-ups during burst fan-out

- Eliminate redundant session lookups in route_message
  - Use SharedWriter.protocol_version() instead of session lookup
  - Saves ~4000 lock acquisitions per 2000-subscriber fan-out

- Reduce buffer sizes from 4KB to 2KB
  - SharedWriter, buffer pool, WebSocket buffers

- Store Arc<CachedPublish> for atomic clone instead of allocation

Results on QoS 2 fan-out (100 pub, 2000 sub):
- Latency P50: 272ms → 122ms (-55%)
- CPU max: 245% → 186% (-24%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Box large Publish types in enum variants to reduce size differences:
  - ResendInfo::Full in connect.rs
  - RetryInfo::Full in qos.rs
  - InflightMessage::Full in session/mod.rs
  - OutboundMessage::Packet in broker/mod.rs
- Fix unnecessary mutable references in decoder.decode() calls
- Use struct initializer syntax instead of field reassignment
- Apply cargo fmt formatting fixes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant