Skip to content

infra(Tier 1): event dedup + per-provider latency metrics + docker host networking #110

@0xfandom

Description

@0xfandom

Context

Latency infrastructure audit (April 2026) identified three quick wins that ship short-term operational visibility and remove a known avoidable overhead. This issue bundles them into a single workstream.

Scope — Tier 1 wins

1. Event dedup + first_from{provider} metric

Currently three node providers subscribe the same filter (config/nodes.yaml), but no cross-provider deduplication exists. vec.dedup() inside engine.rs only collapses address lists, not event identity. We lose the ability to measure which provider is actually fastest per event.

  • New module: crates/ingestion/src/dedup.rs
  • Dedup key: (block_number, log_index) with TTL (e.g., 60 s)
  • New Prometheus counter: aether_event_first_from_total{provider="alchemy_ws"|"quicknode_ws"|"local_reth"}
  • Increment on first-arrival only

2. Per-provider event-stream latency histogram

aether_detection_latency_ms (crates/grpc-server/src/metrics.rs:27-42) is unlabeled. Tail latency hides provider-specific degradation.

  • New labeled histogram: aether_event_latency_ms{provider, event_type} with buckets [1, 5, 10, 25, 50, 100, 250, 500] ms
  • Measurement: block.timestamp → local receive time
  • Wire through crates/ingestion/src/node_pool.rs (latency_ms already tracked at lines 33, 59-63 but never exported)

3. Docker production network mode

deploy/docker/docker-compose.yml uses default bridge network. Adds ~100 µs per packet vs host mode.

  • Add network_mode: host for the production compose variant (keep dev compose with bridge for port-mapping clarity)
  • Verify port bindings still resolve without ports: block
  • Smoke test metrics + gRPC UDS after change

Acceptance criteria

  • /metrics exposes aether_event_first_from_total with all three provider labels populated during a replay run
  • /metrics exposes aether_event_latency_ms histogram with non-zero values per provider
  • Grafana panel added: "Winning provider share" (stacked by label, over 1h)
  • Grafana panel added: "Event latency p50/p95/p99 per provider"
  • docker compose -f deploy/docker/docker-compose.prod.yml up starts with host networking; services reachable
  • Unit tests for EventDedup::on_event covering first-arrival, duplicate, TTL expiry

Files touched (expected)

  • crates/ingestion/src/dedup.rs (new)
  • crates/ingestion/src/lib.rs (re-export)
  • crates/ingestion/src/subscription.rs (wire dedup into broadcast path)
  • crates/grpc-server/src/metrics.rs (new labeled histogram + counter)
  • crates/grpc-server/src/provider.rs (call-site for dedup + latency emit)
  • deploy/docker/docker-compose.prod.yml (network_mode)
  • deploy/grafana/dashboards/*.json (panel definitions)

Out of scope

  • Private mempool integration (deferred, under discussion)
  • Socket-level tuning (tracked in Tier 2 issue)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions