perf(fanout): optimize publish fan-out performance by bherbruck · Pull Request #20 · vibesrc/vibemq

bherbruck · 2025-12-30T17:04:40Z

Summary

Optimize publish fan-out with Arc<str> and thread-local deduplication
Add pre-serialization for QoS 1/2 PUBLISH packets to avoid redundant encoding
Implement zero-copy message routing with SharedWriter for improved throughput

Test plan

Run cargo test to verify all tests pass
Run cargo bench or load test to verify performance improvements
Test with multiple subscribers to verify fan-out behavior

🤖 Generated with Claude Code

… dedup Two key optimizations for high-fanout scenarios: 1. Thread-local HashMap for deduplication - Reuse allocation across publishes instead of allocating per-publish - Eliminates ~6000 HashMap allocations per minute under load 2. Arc<str> for Publish.topic - Cloning topic for each subscriber is now O(1) instead of O(n) - For 2000 subscribers, eliminates 2000 string clones per publish Based on analysis comparing VibeMQ to FlashMQ's hot path optimizations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extend CachedPublish to support all QoS levels with in-place byte patching for first_byte (dup/qos/retain) and packet_id fields. Key changes: - Add CachedPublish struct with offset tracking for patchable fields - Change InflightMessage from struct to enum (Cached/Full variants) - Update route_message to use CachedPublish for all QoS levels - Update retry and session resume to use cached writes with dup=true - Add optimization-guide.md documenting FlashMQ analysis Performance improvement (QoS 2 fan-out benchmark): - Mean latency: 274ms → 216ms (21% faster) - P99 latency: 536ms → 458ms (14% faster) - Memory: 457MB → 405MB (11% reduction) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Major optimizations for high fan-out scenarios: - Add RawPublish for zero-copy fan-out using incoming wire bytes - Patches first_byte and packet_id without re-encoding - Falls back to CachedPublish when protocol version differs - Add SharedWriter for direct buffer writes - Bypasses channel overhead for message delivery - Notification coalescing (only notify on empty→non-empty) - Reduces wake-ups during burst fan-out - Eliminate redundant session lookups in route_message - Use SharedWriter.protocol_version() instead of session lookup - Saves ~4000 lock acquisitions per 2000-subscriber fan-out - Reduce buffer sizes from 4KB to 2KB - SharedWriter, buffer pool, WebSocket buffers - Store Arc<CachedPublish> for atomic clone instead of allocation Results on QoS 2 fan-out (100 pub, 2000 sub): - Latency P50: 272ms → 122ms (-55%) - CPU max: 245% → 186% (-24%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Box large Publish types in enum variants to reduce size differences: - ResendInfo::Full in connect.rs - RetryInfo::Full in qos.rs - InflightMessage::Full in session/mod.rs - OutboundMessage::Packet in broker/mod.rs - Fix unnecessary mutable references in decoder.decode() calls - Use struct initializer syntax instead of field reassignment - Apply cargo fmt formatting fixes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

bherbruck and others added 4 commits December 28, 2025 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fanout): optimize publish fan-out performance#20

perf(fanout): optimize publish fan-out performance#20
bherbruck wants to merge 4 commits intomainfrom
develop

bherbruck commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bherbruck commented Dec 30, 2025

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant