Consolidate develop → main: V3 virtual-reserves fix + mempool/deploy line + decoder/cache + observability#185
Merged
Conversation
Adds PgLedger backed by sqlx::PgPool with fire-and-forget tokio spawns on the engine hot path. Wires insert_arb on every ARB PUBLISHED and insert_pool on every register_pool. Bootstrap from DATABASE_URL falls back to NoopLedger when unset.
Replaces fire-and-forget tokio::spawn per write with a single dedicated
writer task draining a bounded mpsc::channel(1024). Every Ledger trait
method becomes a non-awaiting try_send; saturation drops the row and
bumps a metric instead of fanning out unbounded background tasks while
Postgres is slow.
Adds LedgerMetrics, registered on the engine's existing prometheus
Registry via the new EngineMetrics::registry() accessor so a single
/metrics endpoint emits both engine and ledger families:
aether_ledger_writes_total{op, result} Counter
aether_ledger_drops_total{op} Counter
aether_ledger_queue_depth Gauge
aether_ledger_write_latency_ms{op} Histogram
The writer task observes per-op latency from dequeue to query
completion, decrements queue_depth on dequeue, and tags every write
ok/err. A failing write logs and drops — never propagates to the
engine.
Channel size sized for ~3 s of bursty inserts at 200 arbs/s peak, so
saturation is the alert signal that Postgres is the bottleneck. The
sqlx pool stays bounded at 8 connections, so even if every slot stalls
the channel still drops cleanly rather than blocking.
Refs PR #115 follow-up; PR-1 of the 3-PR plan to close issue #95.
…relation Three reviewer-blockers from PR #119 round 1: 1. arbs.protocols was bound via format!("{:?}", h.protocol) while pool_registry.protocol used the pinned protocol_label. Renaming a ProtocolType variant would silently rewrite the JSONB array but not the TEXT column, breaking joins. protocol_label is now pub and used on both writers — single source of truth for on-disk protocol names. 2. Clock-authority policy. The migration declares event-time columns (arbs.ts, inclusion_results.resolved_at) CLIENT-SET; the previous PgLedger queries omitted both bindings, so the schema's DEFAULT now() silently fired and back-tests skewed by the dequeue lag. Adds ts: DateTime<Utc> to NewArb and resolved_at: DateTime<Utc> to InclusionUpdate, binds both, and updates the inclusion upsert branch to set resolved_at = EXCLUDED.resolved_at instead of now(). 3. arb_id was Uuid::new_v4() per row, so log lines (which print ArbOpportunity::id) had no join key into the trade ledger. Derives arb_id deterministically from opp.id via UUIDv5 against a stable namespace, and emits arb_id as a structured field on the ARB PUBLISHED log line so grep <id> logs/* | psql works. Adds chrono to aether-grpc-server deps and the v5 feature to the uuid workspace dep. Refs PR #119 review.
…e + dead-code doc Forward-looking items from PR #119 review: - Writer was sequential on a pool of 8 connections — every conn but one sat idle while INSERTs serialised on the dispatcher's await. Adds a Semaphore(LEDGER_MAX_INFLIGHT=8) gate around tokio::spawn fan-out so the pool actually runs at capacity. Pool size and inflight cap are named constants tuned in lockstep. - Channel-capacity comment math was wrong (1024 / 200 ≈ 5.12 s, not ~3 s). Also surfaces the constants in the connect log line so operators can see the tuning at startup. - Connect acquire_timeout 5 s → 2 s. Misconfigured DATABASE_URL should fail fast and degrade to NoopLedger via ledger_from_env, not stall engine boot. Logs already cover the reason. - u256_to_decimal: replace .unwrap_or(BigDecimal::from(0)) with .expect. U256::to_string is a base-10 digit sequence which BigDecimal::from_str accepts by definition; a failure here means the alloy/bigdecimal contract changed. Failing loudly beats a silent zero in arb economics. - Documents update_inclusion as engine-side dead code (Go owns inclusion writes today) so future readers do not look for a Rust caller. Refs PR #119 review.
PR-2 of the trade-ledger plan. Mirrors the Rust PgLedger surface so a
unified /metrics scrape across both binaries surfaces a single set of
aether_ledger_* families, and brings 'rows on a fork run' from
half-functional (Rust side only) to end-to-end.
internal/db/ledger_pg.go — new
- PgLedger on pgxpool. Hot path is non-blocking try-send onto a bounded
chan (cap 1024), drained by a dispatcher goroutine that fans out via
a counting semaphore (limit 8) so the pgx pool runs at capacity
without acquire-queueing. Saturation drops the row and bumps
aether_ledger_drops_total{op}; never blocks the executor.
- 2 s connect timeout so a misconfigured DATABASE_URL fails fast and
LedgerFromEnv falls back to NoopLedger instead of stalling boot.
- Idempotent INSERTs for bundles + ON CONFLICT upsert for inclusion +
pnl_daily delta accumulation (multiple writers can contribute to the
same day without lost updates; NUMERIC(78,0) preserves U256 economics
losslessly).
internal/db/metrics.go — new
- LedgerMetrics over the default Prometheus registry. Names mirror the
Rust LedgerMetrics::register surface exactly:
aether_ledger_writes_total{op, result}
aether_ledger_drops_total{op}
aether_ledger_queue_depth
aether_ledger_write_latency_ms{op}
internal/db/ledger.go — extended
- ArbIDNamespace + BundleIDNamespace constants. ArbIDNamespace is
byte-identical to the Rust ARB_ID_NAMESPACE in
crates/grpc-server/src/engine.rs so log↔DB join keys are symmetric
across the gRPC boundary; a TestArbIDNamespaceMatchesRust pin guards
drift.
- ArbIDFromOppID(string) -> uuid.UUID and BundleIDFor(uuid, block)
-> uuid.UUID for deterministic ids. Same (arb, block) pair always
produces the same bundle_id so resubmission ON CONFLICT is naturally
idempotent.
cmd/executor/main.go — wired
- LedgerFromEnv constructed once at startup, deferred Close() flushes
in-flight writes on shutdown.
- processArb gains a db.Ledger param. After bundle build, derives
arb_db_id + bundle_id deterministically and emits both as structured
fields on the submission log line so operators can grep id -> psql
WHERE arb_id = ... straight from logs.
- Shadow path persists the bundle row (IsShadow=true, no builders).
- Live path persists bundle + per-builder inclusion rows. Critical
semantic note: Included on these rows reflects builder *acceptance*
for next-block inclusion, not on-chain inclusion — the future
GetBundleStats poll loop UPSERTs the same (bundle_id, builder) row
with included_block + landed_tx_hash populated.
- pnl_daily inline roll-up: bundle_count bumps every submit; on
acceptance realized_profit_wei + inclusion_count accumulate. Adds
~30 LOC to keep the table populated during fork runs without a
separate cron.
Tests
- Existing 14 processArb / 2 consumeArbStream call sites updated to
pass db.NewNoopLedger().
- New deterministic-id tests pin the ArbIDNamespace and
BundleIDFor / ArbIDFromOppID semantics.
- go test ./... all green.
go.mod — adds github.com/jackc/pgx/v5/pgxpool.
bundles.builders is JSONB (per migration 0001) but the previous bind sent a Go []string straight through, which pgx maps to Postgres text[] — every live insert would fail with 'column "builders" is of type jsonb but expression is of type text[]'. Self-review caught it before any user-facing run hit the path. Marshal []string → []byte JSON via encoding/json and bind with an explicit $8::jsonb cast so the wire format matches the column type regardless of nil / empty Builders slices. Refs PR #122 self-review (CRIT 1).
The Rust engine and Go executor each own one half of the trade-ledger
write surface and both run fire-and-forget through their own bounded
mpsc → writer task. There is no cross-process ordering between Rust's
'ARB PUBLISHED → insert_arb' and Go's 'bundle signed → insert_bundle';
under load the Go bundle insert lands first and the immediate FK check
fails. Self-review flagged this as CRIT 2 — it would surface as a
steady stream of bundle drops on every busy block and mask real ledger
health.
Drops bundles_arb_id_fkey via an idempotent migration so both writers
can race freely. Trade-off: a transient Rust connection blip can
produce an orphan bundle row; that is already metered as
aether_ledger_writes_total{op="insert_arb",result="err"}, and
downstream LEFT JOIN queries treat the NULL arb side as informative.
Comment in the migration spells out the future path (coordinator or
reconciliation worker) for re-adding the FK once cross-process
ordering exists.
Refs PR #122 self-review (CRIT 2).
Two issues caught in self-review: 1. Close() called wg.Wait() unconditionally. A wedged Postgres would hang executor shutdown forever. Now caps the drain at ledgerCloseDrainTimeout (5s), and on timeout cancels the writer dispatcher's context so any in-flight queries abort cleanly before pool.Close() runs. 2. dispatch() used ctx for both the dispatcher loop and per-op runs, shared with the caller's ctx. Caller cancellation would also kill in-flight queries, defeating Close()'s drain. Splits the dispatcher onto an independent context, owned by PgLedger and only cancelled from Close() on timeout. 3. Reorganises dispatch() so defer inflight.Wait() runs on every exit path (channel close, ctx cancel, future error branches). Previously a ctx-cancel return path skipped the wait and left dangling writer goroutines holding pool connections after the dispatcher had already reported wg.Done(). Refs PR #122 self-review (CRIT 3, MED 1).
… loop `inclusion_results.included` is the on-chain outcome the schema's `WHERE included` partial index expects, not the JSON-RPC ACK from a builder. The previous wiring set Included=r.Success on the submit-time row, which would make dashboards think every accepted bundle landed — silently inflating inclusion-rate metrics by an order of magnitude. Submit-time row now writes Included=false unconditionally with the builder's error string preserved on failure. The future GetBundleStats poll loop UPSERTs the same (bundle_id, builder) row with the real on-chain truth (Included=true, IncludedBlock, LandedTxHash) once the target block lands. Dashboards can distinguish "builder rejected" (Error != NULL, Included=false) from "never landed" (Error=NULL, Included=false) on day one. Drops the included-branch from the pnl_daily inline roll-up for the same reason: realized_profit_wei + inclusion_count are deferred to the poll loop. bundle_count still bumps every submit so the table stays useful for "how many bundles did we attempt" queries during fork runs. Refs PR #122 self-review (HIGH 2).
…pace doc Bundle of small self-review nits: - enqueue() flips Inc/send order so the gauge can't briefly read negative when the dispatcher's Dec() lands between a slow enqueuer's send and its Inc(). Failed sends now Revert via Dec(), keeping the gauge consistent with actual channel depth. - write_latency_ms histogram adds 0.1 / 0.25 ms buckets so local- Postgres inserts (~150-300 µs) show up at p50 instead of being flattened into the 0.5 ms bucket. - gasSpentApprox switches from big.Float intermediate to direct big.Int arithmetic so the value round-trips into NUMERIC(78,0) without precision drift across cumulative pnl_daily updates. - BundleIDNamespace gains a doc note explaining why no parallel Rust pin test exists today (Rust does not write bundles), and what to add if a future chain-backfill reconciliation worker on the Rust side starts producing bundle ids. Refs PR #122 self-review (MED 2, MED 4, NIT histogram, NIT gas).
…ync builders WS-1.6 / WS-1.7 partial — implements the config-only half of #71. Pool registry: 3 → 76 curated high-TVL Ethereum mainnet pools tiered hot/warm/cold across all 6 supported protocols (UniV2, UniV3, Sushi, Curve, Balancer V2, Bancor V3). Auto-population to the full 5,000+ target tracked separately in #120 (discovery pipeline + per-builder Prometheus metrics). Builders: add Eden + Rsync to builders.yaml using the existing api_key auth pattern. Default to auth_type "none" so executor startup stays clean for default deployments — operators flip to api_key + set EDEN_API_KEY / RSYNC_API_KEY in .env to activate. Tests: relax engine bootstrap_pools assertion from exact-3 to >=3 (anchor USDC/WETH pools) so the test stays robust as pools.toml grows. Update internal/config builders test to expect the four configured builders. Refs: #71, #120
First step toward the V3 / Curve / Balancer mempool post-state simulator tracked under issue #124. Adds a pure analytical predictor on UniswapV3Pool that returns: V3PostState { new_sqrt_price_x96 // pool price after victim swap, Q96 units new_liquidity // unchanged for single-tick swaps amount_out // victim's output amount, post-fee single_tick // true here; cross-tick math is a follow-up } The math reuses the same constant-liquidity formula as the existing `compute_swap_within_tick` (which only returned amount_out and discarded the new sqrt_price). The new method returns both, so the mempool post-state simulator can update its graph-edge cache without an extra RPC round-trip. Single-tick assumption is hard-coded for now — `single_tick = true` always. The follow-up commit detects when a swap would cross an initialised tick boundary and clears the flag so the caller can fall back to an EVM fork-replay. Tests: - predict_post_state_none_for_zero_amount - predict_post_state_none_for_unknown_token - predict_post_state_none_for_uninitialised_pool - predict_post_state_token0_to_token1_lowers_sqrt_price - predict_post_state_token1_to_token0_raises_sqrt_price - predict_post_state_amount_out_matches_compute_swap_within_tick (parity guard against the legacy code path) Net: +207 lines, no behaviour change to existing callers.
Wire `single_tick` to a real check instead of the hard-coded `true` from
the previous commit. The post-swap `sqrt_price_x96` is projected to a
tick index via `sqrt_price_x96_to_tick` (f64-precision; one tick is a
~1 bp multiplicative step which f64's ~15-digit mantissa resolves with
margin). The result is compared to the active tick bucket
`[bucket_low, bucket_low + tick_spacing)` aligned with `i32::div_euclid`
so negative ticks behave correctly.
When the post-swap tick lands outside that bucket, `single_tick` is
flipped to `false`. The caller is then expected to fall back to an EVM
fork-replay (next commit set) rather than trust the analytical values —
which is also why the function still returns the cross-tick estimate
instead of `None`: a low-precision answer is more useful to the
fallback path than no answer at all.
Tests:
- predict_post_state_large_swap_crosses_tick_bucket single_tick=false
- predict_post_state_token0_to_token1_lowers_sqrt_price (small swap, single_tick=true)
- predict_post_state_token1_to_token0_raises_sqrt_price (small swap, single_tick=true)
- sqrt_price_x96_to_tick_at_unity_is_zero
- sqrt_price_x96_to_tick_handles_zero_input
- new fixture `setup_v3_pool_mid_bucket` seats sqrt_price halfway
through the active tick bucket so a small swap stays inside and a
large one crosses out (sitting on a tick boundary trips the check
on any direction of price movement, which is mathematically correct
but useless for testing the bucket logic itself)
The legacy `setup_v3_pool` fixture (sqrt_price corresponding to
~tick 199985 but `tick: 0` field) is kept as-is so the older
get_amount_out tests do not need to move; the
`predict_post_state_amount_out_matches_compute_swap_within_tick` parity
guard uses it deliberately to prove the math is identical to the
legacy single-tick path regardless of bucket alignment.
Net: +96 lines on top of commit 36160ec, no public API change.
Reuses the existing on-chain-faithful Newton iteration for D / get_y
(no math changes there) and exposes the post-swap balances + output
amount as a struct the mempool post-state simulator can feed into the
graph-edge cache.
CurvePostState {
i, j // input / output token indices
new_balance_in // = balances[i] + amount_in
new_balance_out // = balances[j] - amount_out (fee retained)
amount_out // user-visible payout, post-fee
analytical // mirrors V3 single_tick — true on convergence
}
Pinned at 2-coin pools to match the existing get_amount_out semantics;
3-coin / tricrypto / metapool variants return `None` so the caller can
fall through to the existing `protocol_unsupported` skip path until a
proper N-coin predictor lands.
Tests:
- predict_post_state_none_for_zero_amount
- predict_post_state_none_for_unknown_token
- predict_post_state_none_for_uninitialised_pool
- predict_post_state_balances_shift_correctly (direction sanity)
- predict_post_state_amount_out_matches_get_amount_out (parity guard)
- predict_post_state_reverse_direction
- predict_post_state_3coin_pool_unsupported
Net: +118 lines, no behaviour change to existing callers. The fee-
retention model (pool keeps the fee, so new_balance_out > y_new) is
documented inline since it is non-obvious and will surface on the
graph-edge update side once the cache lands.
Wraps the existing weighted-product math from `get_amount_out` and
exposes the post-swap balances + output amount as a struct the mempool
post-state simulator can feed into the graph-edge cache.
BalancerPostState {
new_balance0 // = balance0 ± input/output
new_balance1 // = balance1 ∓ output/input
amount_out // user payout, post-fee
analytical // true on equal-weight (exact),
// false on unequal-weight (Taylor approx)
}
The `analytical` flag mirrors V3's `single_tick`: equal-weight (50/50)
pools use the exact constant-product formula and pass `true`; unequal-
weight pools (80/20 etc.) use a first-order Taylor approximation in the
existing `get_amount_out` and pass `false` so the caller can fall back
to an EVM fork-replay rather than trust the approximation under heavy
weight skew.
Tests:
- predict_post_state_none_for_zero_amount
- predict_post_state_none_for_unknown_token
- predict_post_state_none_for_uninitialised_pool
- predict_post_state_balances_shift_correctly_equal_weight
- predict_post_state_unequal_weight_signals_approximation
- predict_post_state_amount_out_matches_get_amount_out
- predict_post_state_reverse_direction
Net: +118 lines, no behaviour change to existing callers. Composable /
stable / weighted-N-token (N>2) pools fall through to `None` and remain
out of scope for this issue.
Introduces the live pool-state cache the V3 / Curve / Balancer mempool
post-state simulator needs to read accurate per-protocol state.
Population on bootstrap + refresh on PoolEvent lands in the next
commit; this commit limits scope to the type + the empty engine field
+ the accessor so the diff is reviewable in isolation.
New types in `crates/pools/src/lib.rs`:
pub enum PoolState {
UniswapV2(UniswapV2Pool),
UniswapV3(UniswapV3Pool),
SushiSwap(UniswapV2Pool),
Curve(CurvePool),
Balancer(BalancerPool),
}
pub type PoolStateCache = Arc<DashMap<Address, Arc<PoolState>>>;
pub fn new_pool_state_cache() -> PoolStateCache;
`AetherEngine` gains a `pool_states: PoolStateCache` field and a
`pool_states(&self) -> &PoolStateCache` accessor. Marked `dead_code`
until the mempool decode pipeline (currently on PR #118) wires
SimContext to read from it, or the parity-test commit on this branch
exercises it.
The cache wraps each entry in an outer `Arc` so readers can clone a
snapshot cheaply while the engine writes new entries in place. DashMap
shards keys, so writers updating pool A don't block readers reading
pool B — important once 1000+ pools are in flight on the hot path.
Distinct from `pool_registry` (static metadata only); this cache owns
the *mutable* protocol state required by `predict_post_state` on each
pool family.
Tests:
- pool_state_cache_starts_empty
- pool_state_cache_round_trips_v3
- pool_state_protocol_dispatch_covers_every_variant
Net: +155 lines, no behaviour change in the production hot path —
nothing reads from the cache yet.
Wires the V2-family branches of bootstrap and event handling into the PoolStateCache introduced in 961ae56. Bootstrap-time `getReserves()` results and live `PoolEvent::ReserveUpdate` events now both upsert a `PoolState::UniswapV2` (or `PoolState::SushiSwap`) entry in the cache alongside the existing graph-edge update. Both writes share a small `build_v2_pool_state` helper on `AetherEngine` so the protocol → variant mapping is in one place. The helper takes a `&PoolMetadata` reference and constructs a fresh `UniswapV2Pool` with the configured fee + tokens, then calls `update_state(reserve0, reserve1)` to seed the live reserves. Cache writes are idempotent: a repeated update for the same pool replaces the prior `Arc<PoolState>` rather than accumulating, so readers always see the latest reserves with no stale fan-out. V3 / Curve / Balancer cache population stays out of this commit: - V3 needs a parallel `liquidity()` RPC call that is not currently part of `fetch_initial_reserves` (slot0 alone returns sqrt_price + tick + observability fields, no liquidity); landing that needs a second concurrent RPC fan-out and is its own commit. - Curve needs `A()` + per-coin `balances(i)` calls and an ABI that does not yet exist in the engine. - Balancer needs Vault-side `getPoolTokens()` + `getNormalizedWeights()` plus a separate ABI surface. Tests: - test_v2_pool_state_cache_populated_on_reserve_update — registers a real WETH/USDC V2 pool, dispatches a ReserveUpdate, asserts the cache entry's reserves match the event payload exactly. - test_sushiswap_pool_state_cache_uses_sushiswap_variant — same shape against a Sushi pool, asserts the variant routing is correct (we intentionally distinguish Sushi from V2 in the cache so the dispatcher does not need a second protocol lookup). Net: +97 lines on engine.rs, no behaviour change for existing graph-edge consumers — cache writes are additive.
Wires the V3 branches of bootstrap and event handling into the
PoolStateCache. After this commit V3 pools land in the cache with
sqrt_price_x96 + tick at bootstrap and the full triple
(sqrt_price + liquidity + tick) on every PoolEvent::V3Update.
Two structural changes feed the V3 path:
1. PoolMetadata grows a `tick_spacing: Option<i32>` field so the
cache can construct a valid `UniswapV3Pool` (which needs
tick_spacing to know where the active tick bucket ends). Non-V3
pools store `None`; the field is ignored on the graph-edge update
path that does not care about ticks. A new
`register_pool_with_tick_spacing` constructor accepts the value;
the original `register_pool` delegates with `None` so existing
callers (PoolEvent::PoolCreated auto-register, tests) compile
unchanged.
2. `fetch_initial_reserves` decodes the int24 tick from slot0 bytes
32..64 alongside the existing sqrt_price_x96 read. The decode is
sign-extended manually rather than via `slot0Call::abi_decode_returns`
to keep the existing concurrent-RPC fan-out shape; the result is
threaded through `ReserveResult::V3` to the cache writer.
Bootstrap V3 cache entries are seeded with `liquidity = 0` because
slot0 alone does not expose liquidity — a separate `liquidity()` RPC
would require a second concurrent fan-out and is deferred. Real
liquidity arrives via `PoolEvent::V3Update` once block-level event
ingestion is online; until then `predict_post_state` on a
zero-liquidity entry correctly returns `None`.
Tests:
- test_v3_pool_state_cache_populated_on_v3_update — registers a
real USDC/WETH 0.05% V3 pool, dispatches a V3Update event with
sqrt_price = Q96 + a non-zero liquidity, asserts the cached
UniswapV3Pool has matching sqrt + liquidity + tick + tick_spacing.
- existing test_pool_metadata_fee_factor updated for the new
`tick_spacing: None` field.
Net: +141 / -7 on engine.rs. No behaviour change on the existing
graph-edge path; cache writes are additive. Curve / Balancer cache
population needs new ABI surface (Curve.A() + balances(i),
BalancerV2 Vault.getPoolTokens + getNormalizedWeights) and lands in
follow-up commits.
Extends the bootstrap RPC fan-out with Curve + Balancer V2 branches so
the post-state cache covers all four protocols the predictor side
already supports. After this commit, `fetch_initial_reserves` writes
`PoolState::Curve` and `PoolState::Balancer` entries for every
2-coin / 2-token pool registered in `config/pools.toml`.
Curve:
- 3 sequential RPCs per pool: A() → balances(0) → balances(1)
- 2-coin pinned (matches the existing `predict_post_state` scope)
- amplification cast U256 → u64 with saturation; on-chain values
fit in u64 today (Curve pool A is typically a few hundred to a
few thousand)
Balancer V2:
- 3 sequential RPCs per pool:
pool.getPoolId() → bytes32 pool id
vault.getPoolTokens(poolId) → (tokens, balances, _)
pool.getNormalizedWeights() → uint256[] (e18 fixed)
- Vault address hard-coded to the canonical
`0xBA12222222228d8Ba445958a75a0704d566BF2C8`
- 2-token pinned (`predict_post_state` only handles equal-weight
and unequal-weight 2-token shapes today)
- weights cast U256 → u64 with saturation; weights below 1e18 fit
in u64 fine
Both branches drop the pool with a `Skipped` result on any RPC or
decode failure so a single bad pool never tanks the whole bootstrap
fan-out. Graph-edge population for Curve / Balancer stays out of
this commit — the existing graph code path doesn't have Curve /
Balancer edge construction yet, and threading those onto the cache
write would scope-creep this PR. The cache entry alone is what the
mempool post-state simulator needs to call `predict_post_state`.
Tests:
- test_curve_pool_state_cache_can_be_populated_directly
- test_balancer_pool_state_cache_can_be_populated_directly
These exercise the cache shape (variant + state fields round-trip)
without driving the RPC fan-out, which would require an Anvil mock.
End-to-end RPC parity is covered by the live mainnet shadow run that
ships separately.
Net: +180 / -2 on engine.rs.
…metric
Closes the analytical-vs-EVM dispatch loop the V3 / Curve / Balancer
predictors set up via their `single_tick` / `analytical` confidence
flags. Two pieces:
1. `predict_post_state_with_fallback(state, token_in, amount_in, on_fallback)`
in `aether-pools` crate. Dispatches over `PoolState`, runs the
typed analytical predictor for the variant, and:
- returns `Some(UnifiedPostState::*)` when confidence is high
- returns `None` and invokes `on_fallback(reason)` when confidence
is low — the caller is expected to wire `on_fallback` to a
metric bump and (eventually) to a real EVM fork-replay path
The function takes the fallback callback as a closure rather than
pulling in `aether_grpc_server::EngineMetrics` directly, so the
pools crate stays free of engine deps.
New unified return type `UnifiedPostState` wraps `V3PostState`,
`CurvePostState`, `BalancerPostState` so callers (mempool decode
pipeline, future analytics) have one match to write graph-edge
updates against. Pool-family-specific fields stay on their own
variants; conflating V3 sqrt_price into reserve0/reserve1 would
lose precision the post-state graph update needs.
2. `aether_sim_evm_fallback_total{reason}` counter in `EngineMetrics`
with stable label set documented inline:
- v3_tick_crossed
- curve_unconverged
- balancer_unequal_weight
- unknown_protocol
Plus `inc_sim_evm_fallback(reason)` for callers and
`sim_evm_fallback_count(reason)` for tests + Grafana queries.
A real EVM fork-replay implementation (calling `EvmSimulator` with
the victim's swap pre-applied and reading the affected pool's
post-state) is its own substantial feature and is not part of this
commit. The fallback path here is a documented stub: it bumps the
metric so dashboards show how often the analytical predictor is
insufficient, then returns `None` so the caller skips the candidate
rather than acting on stale state.
V2 / Sushi pools route through `unknown_protocol` because the V2
analytical predictor lives in the mempool decode pipeline (separate
branch, PR #118) and is trivially exact — wrapping it through this
fallback path would add no value.
Tests:
- predict_with_fallback_v3_returns_state_for_single_tick
- predict_with_fallback_v3_escalates_on_tick_cross
- predict_with_fallback_curve_returns_state_for_converged_pool
- predict_with_fallback_balancer_unequal_weight_escalates
- predict_with_fallback_v2_routes_to_unknown_protocol
Net: +271 lines across pools + grpc-server. No public API removed,
no behaviour change in the existing graph-edge or detection path.
The revm-backed fork backend issues one-shot `eth_getStorageAt` / `eth_getBalance` requests via AlloyDB's HTTP transport. `ETH_RPC_URL`, however, is shared with the streaming subscription path (newHeads, logs, pending tx) which requires a `wss://` / `ws://` scheme for the persistent connection. Without normalisation, `connect_http` on a wss URL fails every detected cycle with `Transport error: builder error for url (wss://...)` and floods the log on each block. Mempool tracking (analytical predictors) is unaffected, but the regular block-by-block arb validation path is silently broken. Fix: add `normalize_to_http_scheme` which rewrites `wss://` -> `https://` and `ws://` -> `http://` literally, leaving host / path / query intact. Apply at the boundary between `config.rpc_url` and the `ProviderBuilder::connect_http` call. Major providers (Alchemy, Infura, QuickNode) expose both transports on the same hostname so the rewritten URL routes to the correct endpoint. Operators keep a single env var. Unknown schemes (`ipc://`, `file://`, etc.) pass through unchanged so downstream parser surfaces a clear error rather than a silent miscast. Also dedents `sim_evm_fallback_total` doc-list continuation lines from 24 spaces to 4 to satisfy `-D clippy::doc_overindented_list_items` -- required to keep `cargo clippy --workspace -- -D warnings` green on rust 1.94. Tests: 6 unit tests covering wss, ws, https, http, unknown-scheme, and query-string preservation. 488 workspace tests, clippy clean.
Adds a gating layer between Bellman-Ford cycle detection and EVM simulation. Catches corruption signatures (stale graph edges with zero reserves, repeated identical profit_factor across many sibling cycles, f64-overflow profit values) that the existing detector + simulator silently passed through, polluting the candidate stream and wasting 100-500 ms of revm fork sim per bogus cycle. ## Background A 1h 20m live mainnet capture on PR #118's branch surfaced cycles reporting `expected_net_eth = 21,539,349,978.775558` ETH — 21.5 billion ETH per cycle, the same value repeated across 444 cycles within a single block. Root cause: at least one pool in the registry had stale reserves on its price-graph edge; Bellman-Ford walked that edge through many path permutations and every cycle inherited the same astronomical implied rate. revm sim then ran ~37 k attempts in 80 minutes against these poisoned candidates, each call costing one or more RPC reads to Alchemy and pushing the provider into rate-limit territory. The detector was correct; the graph snapshot was wrong; the existing pipeline had no defense between the two. ## What ships `crates/grpc-server/src/cycle_gating.rs` (new) — five gates, each keyed to a corruption signature observed in real or plausible runs, plus one soft-warn band for operator audit: 1. **TVL** — every edge in the cycle must have `reserve_in >= GatingConfig::min_reserve_f64` (default 1e6 wei). Empty pools cannot produce real arbs regardless of the rate the graph thinks they have. 2. **Multi-cycle fingerprint** — `profit_factor` values quantised to 1e-6 and bucketed. When >= `fingerprint_min_cluster` cycles land in the same bucket (default 5), the cluster is dropped — the signature of a single corrupt edge feeding many paths. 3. **Hard sanity cap** — `profit_factor > 100.0` (10000%) or non-finite (NaN, +/-inf) is dropped immediately. UST's peak depeg was 90%, so this band is unequivocally math broken. 4. **Soft warn** — `profit_factor > 0.5` logs a `warn` with cycle details for operator audit; never drops on its own. Surfaces suspicious but possibly real depeg / launch events. 5. **Post-sim revm cross-check** — after fork simulation, compare `sim_result.profit_wei` against the detector's `expected_net_wei`. If the two differ by more than `revm_profit_mismatch_threshold` (default 0.5 fractional), trust revm and drop the candidate. Catches the residual case where the local graph was stale at detection time but the pre- sim gates could not see it. Each drop bumps `aether_cycle_gate_dropped_total{reason="..."}` with a label drawn from a fixed set of four (`profit_factor_impossible`, `reserves_too_low`, `fingerprint_cluster`, `revm_contradicts`), so dashboards stay enumerable without label churn. ## Engine integration `AetherEngine` cycle loop (`engine.rs` ~line 1325) builds the fingerprint index once per detection pass via `build_fingerprint_index` (O(N) batch) so the per-cycle gate is O(1) — without this the multi-cycle gate would be O(N) per cycle, O(N^2) across the batch, blowing the 3 ms detection budget on dense graphs. Post-sim gate fires immediately after `if !sim_result.success` so it costs at most one HashMap probe + one f64 ratio per simulated cycle. `EngineConfig` gains a `gating: GatingConfig` field with strict production defaults; tests that exercise the detection cycle with synthetic graphs (`add_edge` without seeded reserves) override with `GatingConfig::permissive()` so they assert detection behaviour rather than gating behaviour. Four existing test fixtures updated accordingly; one test `test_engine_custom_config` left untouched since it inspects `min_profit_threshold_wei` only. ## Tests `crates/grpc-server/src/cycle_gating.rs` ships 12 unit tests covering: - `fingerprint_bucket` quantisation correctness, including NaN / +inf / -inf collapsing into a single bucket (so the fingerprint gate also catches the f64-overflow signature even when the hard cap somehow does not). - `build_fingerprint_index` skipping unprofitable cycles. - Pre-sim gate behaviour for impossible profit, NaN profit, low reserves, fingerprint cluster, and normal arbs. - Post-sim gate behaviour for matching profits, zero actual vs nonzero expected, large mismatch (the 21B vs 1e15 case), and the `expected == 0` no-op pass. 488 workspace tests pass. `cargo clippy --workspace -- -D warnings` clean. ## Out of scope - Edge freshness tracking (gate 2 of the original design doc): requires adding `last_update_block: u64` to `PriceEdge` and propagating block numbers through ~30 call sites in `engine.rs` + tests. Filed for a separate PR. - Cross-pool agreement check: needs a new pool-registry iteration helper. Filed for a separate PR. - Configurable thresholds via env / config file: defaults are calibrated for the blue-chip pool registry shipped in `config/pools.toml`. Operators monitoring long-tail or freshly-launched pools may need to relax `min_reserve_f64` and `profit_factor_impossible` once we expose the knobs. ## Verification plan After merge, re-run the 10-min capture script from PR #118's branch (`scripts/mempool_capture.sh`). Expect: - `aether_cycle_gate_dropped_total{reason="reserves_too_low"}` or `{reason="fingerprint_cluster"}` ticking up where the 21B-ETH cycles were previously surviving. - `aether_pending_arb_candidates_total` numbers landing in the realistic 1-200 bps profit band instead of the `gt_200bps" >> 7000` we saw. - `aether_simulations_run_total` rate dropping proportionally, with the saved RPC budget freed for the cycles that survive gating.
…t-state sim + V3/Balancer wiring ## Summary - Lands [E5] live mempool tracking — Phase A → E end to end (#117). Pipeline subscribes to Alchemy `alchemy_pendingTransactions`, decodes router calldata, runs **analytical V2 post-state simulation** through Bellman-Ford, and emits Prometheus metrics. Strictly log-only — no executor wiring, no bundle build, no submission. - `MEMPOOL_TRACKING=1` gates the entire path; binaries boot identically when unset. - Co-runs Flashbots MEV-Share SSE consumer in the Go monitor; counts hint volume and calldata/log enrichment for cross-source coverage analysis. - Validated on live mainnet: 13 V2 swaps decoded, 2 ran sim end-to-end (`no_profitable_cycle` — expected on the 3-pool WETH-USDC registry), 4899 MEV-Share hints, 0 errors, p99 detection latency `< 0.5 ms`. ## Phase breakdown | Phase | Crate / file | What | |---|---|---| | **A** — Pending-tx ingestion | `crates/ingestion/src/mempool.rs` | `MempoolSource` trait + `AlchemyMempool` impl. WS subscribe with `toAddress` filter on 6 routers (UniV2, UniV3 SwapRouter / SwapRouter02, Sushi, Curve registry, Balancer Vault). Per-source dedup, lock-free broadcast channel. Reuses `node_pool.rs` reconnect FSM. | | **B** — Calldata decoder | `crates/pools/src/router_decoder.rs` | `alloy::sol!` ABIs for V2 router (`swapExactTokensForTokens` + 5 fee-on-transfer variants), V3 single-hop (`exactInputSingle`, `exactOutputSingle`). Decodes to `(token_in, token_out, amount_in)`. Long-tail unsupported selectors emit `aether_pending_decode_errors_total{reason}`. | | **C-lite** | `crates/grpc-server/src/mempool_pipeline.rs` | Decode loop, metric emission, structured logs. | | **C-full (Path A — analytical V2)** | same file | Full post-state cycle scan without revm. Predict V2 reserves with constant-product math (`dy = dx*0.997*y / (x + dx*0.997)`, `x' = x + dx`, `y' = y - dy`). Clone graph, mutate the 2 affected edges, run Bellman-Ford `detect_from_affected`, bucket profit (`<10bps / 10-50 / 50-200 / >200bps`), emit `aether_pending_arb_candidates_total{router, profit_bucket}`. Skip reasons: `protocol_unsupported`, `token_*_unknown`, `pool_not_registered`, `graph_edge_missing`, `reserves_zero`, `no_profitable_cycle`. Path B (full revm fork sim for V3/Curve/Balancer victims) is a follow-up issue — analytical covers V2/Sushi which dominate retail mempool volume at ~50 µs per tx vs 5–20 ms for revm. | | **D** — MEV-Share consumer | `cmd/monitor/mev_share.go` | SSE stream from `https://mev-share.flashbots.net`. Decodes hints (`tx_hash`, optional calldata, optional logs). Emits `aether_mev_share_hints_total`, `_with_calldata_total`, `_with_logs_total`, `_errors_total`. | | **E** — Metrics + Grafana | `deploy/docker/grafana/dashboards/mempool.json` | 9-panel dashboard: pending DEX rate (1m), decode success %, hints/s, hints-with-logs %, decode failures by reason, arb candidates rate, arb candidates by profit bucket, sim skipped by reason, MEV-Share stream health. | ## Files Changed (Phase C-full slice — `e227e73`) | File | Change | |---|---| | `crates/grpc-server/src/main.rs` | Build `SimContext` (pool registry + token index + snapshot manager + BF detector), pass to `spawn_mempool_pipeline`. Logs `sim=true` at startup when registry is wired. | | `crates/grpc-server/src/mempool_pipeline.rs` | New `SimContext` struct, `try_post_state_scan` function, `predict_v2_post_state` math helper, `profit_bucket` mapper. 7 unit tests covering the math edges (zero reserves, zero input, fee_factor zero, normal V2 swap, profit bucket boundaries). | | `crates/grpc-server/src/metrics.rs` | `aether_pending_arb_candidates_total{router, profit_bucket}` `IntCounterVec`, `aether_pending_arb_sim_skipped_total{reason}` `IntCounterVec`, `inc_pending_arb_candidates` and `inc_pending_arb_sim_skipped` helper methods. | | `deploy/docker/grafana/dashboards/mempool.json` | 3 new panels for Phase C-full signal: arb candidates 1m rate stat, arb candidates by profit bucket time series, sim skipped by reason time series. | ## Acceptance Criteria (#117) - [x] `MEMPOOL_TRACKING=1` gates the path end-to-end; unset = identical boot - [x] Alchemy `alchemy_pendingTransactions` WS subscription with router filter - [x] V2 / Sushi calldata decode (95%+ within supported routers; long-tail selectors logged) - [x] Post-state cycle scan emits `aether_pending_arb_candidates_total` (analytical Path A; verified end-to-end) - [x] MEV-Share SSE consumer emits hint counters + calldata/log enrichment % - [x] Grafana dashboard renders 9 charts including arb candidates and sim skipped - [x] Zero executor submission — no bundle build, no `eth_sendBundle` call from this path - [x] `cargo build --workspace --release` clean - [x] `cargo clippy --workspace --all-targets --release -- -D warnings` clean - [x] `cargo test --workspace --release` — 436 passed, 0 failed - [x] `go build ./...`, `go vet ./...` clean - [x] `go test ./... -race -count=1` — all packages ok - [x] `forge test` — 59 passed, 2 skipped, 0 failed - [x] `docker compose config` validates - [x] Live mainnet smoke: pipeline produced signal end-to-end with zero errors over a 5-min sample Deferred to follow-up issues (out of scope per #117 itself): - 95% decode rate gate — needs UniV3 multicall (universal-router) + Curve + 1inch decoders - Non-zero candidates over a 1-hour run — needs pool registry expansion (current `config/pools.toml` ships 3 pools; a non-trivial fraction of pending V2 swaps skip on `token_*_unknown` against this narrow set, exactly as expected) - First-seen latency delta histogram (Alchemy ↔ MEV-Share) - `MEMPOOL_TRACKING` README + docs site update - Path B revm fork sim for V3/Curve/Balancer victims - Backrun bundle construction — explicitly out of scope per #117 ("keep this scaffold log-only") ## Live mainnet evidence (5-min sample, ETH mainnet) ``` mempool decoded: 13 UniswapV2 swaps sim ran end-to-end: 2 (graph clone → V2 prediction → BF scan → no_profitable_cycle) sim skipped: 11 (token_in_unknown / token_out_unknown — 3-pool registry too narrow) decode errors: 1 (unknown_selector long-tail) mev-share hints: 4899 mev-share with calldata: 1799 (37%) mev-share with logs: 1655 (34%) mev-share errors: 0 engine detection p99: < 0.5 ms ``` Two end-to-end sim runs prove the pipeline plumbs Alchemy → decode → token_index → pool_registry → snapshot → V2 math → graph mutation → Bellman-Ford → metric emission against live mainnet traffic. Zero candidates is the correct signal on a registry of 3 WETH-USDC pools — that pair is the most-arbed on Ethereum and retail-sized swaps don't crack the 30 bps fee on UniV2 vs Sushi vs UniV3. ## Test plan - [x] `cargo build --workspace --release` — `Finished release profile [optimized] target(s) in 9.61s` - [x] `cargo clippy --workspace --all-targets --release -- -D warnings` — clean - [x] `cargo test --workspace --release` — 436 passed, 0 failed, 0 ignored - [x] `go build ./...` — clean - [x] `go vet ./...` — clean - [x] `go test ./... -race -count=1` — `executor`, `monitor`, `pooldiscovery`, `internal/config`, `internal/risk` all `ok` - [x] `forge test` — 59 passed, 2 skipped (4 suites, 159 ms) - [x] `docker compose -f deploy/docker/docker-compose.yml config` — valid - [x] Live mainnet 5-min run: rust + monitor + Prometheus + Grafana — pipeline produced the signal table above with 0 errors ## Operator usage ```bash set -a && source .env && set +a MEMPOOL_TRACKING=1 \ MEMPOOL_WS_URL="wss://eth-mainnet.g.alchemy.com/v2/${ALCHEMY_API_KEY}" \ RUST_METRICS_PORT=9092 \ RUST_LOG=info,aether_grpc_server::mempool_pipeline=info \ ./target/release/aether-rust MEMPOOL_TRACKING=1 METRICS_PORT=9094 ./bin/aether-monitor ``` Sample after a soak window: ```bash curl -s http://localhost:9092/metrics | grep -E "^aether_pending_(dex_tx|decode_errors|arb_candidates|arb_sim_skipped)_total" curl -s http://localhost:9094/metrics | grep "^aether_mev_share" ``` Grafana: `http://localhost:3000/d/aether-mempool/aether-mempool-testing-scaffold`. Closes #117
…bodies Strategy-class analysis (backrun/frontrun/sandwich), builder fan-out matrix, and the staged issue bodies for Phase 1 Part A/B/C of the public-mempool backrun execution path.
ValidatedArb gains ArbSource enum + victim_tx_hash + target_block; existing block-driven publishers default to BlockDriven. Add four new metrics (validation_latency, validated, rejected, sim_concurrent) + RAII guard so the gauge stays balanced on panic.
Generic over CacheDB<DB> so it consumes either RpcForkedState (prod) or ForkedState (tests). Applies victim via transact_commit, then arb via transact, measures ERC20 balance delta. Six reject reasons mapped 1:1 to the metric label set landed in 01149de.
SimContext gains arb_publisher + backrun fields. main.rs builds the validator config from env (AETHER_EXECUTOR_ADDRESS gate, defaults for WETH/profit_token/sim_concurrency). try_post_state_scan hands the best profitable cycle to run_backrun_validation; the call site short-circuits with sim_error until DynProvider plumbing lands.
BackrunValidatorConfig.provider is set from engine.rpc_provider(). run_backrun_validation replaces the short-circuit with a real RpcForkedState build + validate_backrun_rpc call. On accept it constructs a ProtoValidatedArb (source=MempoolBackrun) and publishes on the shared arb broadcast. Reject path bumps the per-reason metric with the exact RejectReason the simulator returned.
BuildMempoolBackrunBundle ships [victim_hash, our_arb] envelope with revertingTxHashes carrying only the arb hash (victim hash leak would risk adverse fill). bundles_built/submitted/included counters now carry a source label split block_driven vs mempool_backrun. Mempool priorityFee = max(suggested, basefee*10%, env-floor) so we don't lose the backrun auction with last-block's 50th percentile. processArb branches by ArbSource and uses the new BundleConstructor path.
Adds the third layer of the public-mempool backrun execution path:
mempool-specific risk gates (min profit, max tip share, victim
freshness, per-target-block dedup + in-flight cap) and a shadow-mode
JSON forensics dump so Stage A can run without any ETH at risk.
Gates run AFTER the shared PreflightCheck so block-driven limits still
fire first; rejections are bucketed in aether_mempool_risk_rejected_total
{reason}. Shadow path bumps aether_executor_bundles_shadow_blocked_total
{source} and writes a per-arb JSON to
${AETHER_REPORTS_DIR:-reports}/shadow_mempool_<ts>/bundles/<arb_id>.json
with the schema downstream tooling pins (envelope, risk_decisions,
expected_gross/net_profit_wei, base/priority fees, flashloan provider).
Synthetic sol!-encoded tests can mask encoder-vs-decoder drift because both sides use the same Rust macro. Pin the decoder against three real mainnet pending-tx payloads to catch that class of bug. Each fixture is the raw eth_getTransactionByHash → input bytes of a real Universal Router call, embedded as a hex constant: - 0xfbb854...8996 (UR V2, ops=[V2_SWAP_EXACT_IN, SWEEP]): WETH → arbitrary token; asserts V2 surfacing with canonical WETH as token_in and zero fee_bps. - 0x87a841...4d4e33 (UR V1.2, ops=[WRAP_ETH, V3_SWAP_EXACT_OUT, UNWRAP_WETH]): asserts the V3 exact-out path flip lands WETH on the token_in side and surfaces the 1% (10000 bps) fee tier. - 0x637397...0b1f (UR V2, ops=[V2_SWAP_EXACT_IN, UNWRAP_WETH]): token → WETH; asserts WRAP/UNWRAP opcodes are dropped silently and only the swap survives. All three fixtures use the production decode_pending_many entry point, so any future drift in the dispatcher, opcode parser, or sol! ABI shape regresses these tests.
The E5 fixture sweep showed that Universal Router V2 traffic is now dominated by opcode 0x10 (V4_SWAP), which wraps a Uniswap V4 PoolManager action stream. Roughly four out of every five UR V2 pending txs the captures touched had V4_SWAP at the top level. The prior catch-all skip dropped them indistinguishably from other unknown opcodes, hiding their volume in the dashboards. V4 PoolManager decode is a substantial separate workstream — new PoolKey shape, hooks-aware pricing, per-action ABI tables — and is out of scope for the current V2/V3 backrun target. This PR is deliberately minimal: - Add a named V4_SWAP = 0x10 constant in ur_commands with a doc comment that explains why it is currently un-decoded and what full support would entail. - Route the opcode through an explicit match arm that emits a debug-level trace event labelled 'ur_v4_swap_unsupported' so operators can size the missed-volume from logs ahead of any formal metric migration. - Unit test that confirms a mixed [V4_SWAP, V3_SWAP_EXACT_IN] call silently drops the V4 record and surfaces the V3 swap intact.
`2_500u64 * 10u64.pow(18)` = 2.5e21 overflows u64::MAX (1.84e19) and panics the test under debug-mode arithmetic checks. Lift the constant into U256 arithmetic so the literal evaluates safely.
`prewarm_state` fired N + M unbounded parallel `eth_getCode` / `eth_getStorageAt` calls via `join_all`. On free-tier RPC providers (notably Alchemy at 25 CU/s) this collapsed into 429 cascades: the last shadow run logged 721 throttle events, leaving the fork DB without bytecode or reserves and forcing 100% of simulations to revert against missing state. Replace `join_all` with `futures::StreamExt::buffer_unordered` driven by a shared `tokio::sync::Semaphore`. Both the code-fetch and storage- fetch streams acquire permits from the same gate, so the combined in-flight request count is bounded by a single ceiling. Default is 8; override via `AETHER_PREWARM_MAX_CONCURRENT`. Includes unit tests for env-var override, garbage-input fallback, zero-rejection (would deadlock with a 0-permit semaphore), and a runtime assertion that the semaphore actually caps peak concurrency. No new dependencies. No public API change beyond exporting `DEFAULT_PREWARM_MAX_CONCURRENT`.
Contract bytecode is content-addressable and immutable for the lifetime of an address. Today's pre-warm flow re-fetches every byte on every restart via `eth_getCode`, burning the most expensive RPC budget on data that hasn't changed since deployment. Last shadow run attributed ~95% of `eth_getCode` calls to addresses we'd already queried in earlier blocks. Introduce `BytecodeCache`: a two-tier (in-memory + disk) cache backed by `redb` (pure-Rust, ACID, no C dependencies). Disk schema has two tables — `addr_to_hash` and `hash_to_code` — so multiple addresses sharing the same bytecode (proxies, factories) deduplicate on disk. CREATE2 re-deploys at the same address that change the code hash transparently overwrite the cache entry; dangling `addr_to_hash` entries with no payload self-heal by falling back to RPC. Module is self-contained in this PR. Wiring into `prewarm_state` / `mempool_pipeline` ships separately to keep the diff reviewable.
Plumbs the persistent `BytecodeCache` introduced in the prior commit through to `prewarm_state` so the mempool-shadow and block-driven fork pre-warm steps both consult it before issuing `eth_getCode`. Freshly fetched bytecode is persisted back so subsequent block cycles serve cache hits. Wiring path: - `Engine` opens the cache at startup from `AETHER_BYTECODE_CACHE_PATH` (unset = disabled, identical to today's behaviour). - `Engine::bytecode_cache()` exposes the handle to `main.rs`. - `SimContext::with_bytecode_cache()` stores it on the mempool pipeline context. - `prewarm_state` partitions code addresses into cache hits and misses; only misses hit RPC. Two unit tests cover the new code path: one verifies that a fully warm cache lets `prewarm_state` succeed against an unreachable RPC endpoint, the other guards the `None`-cache backwards-compatible shape.
The engine already subscribes to every monitored pool's `Sync` event and decodes the reserve pair on every block. Today we discard the decoded values after updating the price-graph edge and pay a fresh `eth_getStorageAt` round-trip per V2 pool on every pre-warm cycle to refetch the very same numbers that just arrived for free over the WebSocket. Introduce `V2ReservesCache`: a lock-free `DashMap` keyed by pool address holding the latest `(reserve0, reserve1, block_number)`. The engine's `handle_pool_update` now records every UniV2 / Sushi `ReserveUpdate`; `prewarm_state` partitions V2 addresses into hits (synthesise packed slot 8 locally) and misses (RPC). Stale entries older than `V2_RESERVES_MAX_LAG_BLOCKS` (1 block) fall back to RPC so reorgs / WS reconnects can't poison the cache. Slot 8 layout: `reserve0 | (reserve1 << 112)`. We deliberately leave `blockTimestampLast` zero because `Sync` does not carry it and UniV2 swap math does not depend on it during quote derivation (only the post-swap `_update` writes the field). Consumers needing a live timestamp must still fall through to RPC. Wiring path mirrors the bytecode cache (G2-wire): - `Engine` owns the cache and exposes `v2_reserves_cache()`. - `handle_pool_update` writes on every V2 `ReserveUpdate`. - `SimContext::with_v2_reserves_cache()` carries it to the mempool pre-warm side. - `prewarm_state` takes an optional `&V2ReservesCache`. `main.rs` plumbs the engine handle into the mempool `SimContext` so both pre-warm paths read from the same WS-fed instance.
Caches landed in #173/#174/#178 had no Prometheus instrumentation; only a debug log line reported cycle outcomes. Adds five IntCounters so the shadow-mode dashboard can read RPC-reduction ratios directly. - aether_prewarm_bytecode_cache_hits_total - aether_prewarm_bytecode_rpc_fetches_total - aether_prewarm_v2_reserves_cache_hits_total - aether_prewarm_v2_reserves_cache_stale_total - aether_prewarm_v2_reserves_cache_missing_total Implementation surfaces a PrewarmStats struct from prewarm_state and splits V2ReservesCache::get_fresh into a V2CacheLookup enum so the stale vs missing distinction reaches the counter without changing the fast path. Engine and mempool refresher both call record_prewarm_stats after every cycle.
Demo telemetry showed the WS-fed V2 reserves cache stuck at 0 hits even after 50+ blocks: 94% of detector lookups landed on pools the WS Sync writer had never observed. Active high-volume pools (which Sync often) sit in equilibrium and rarely appear in arb cycles; low-liquidity pools (which the detector picks) Sync rarely. Population pattern and access pattern barely overlap. Fix: after every successful eth_getStorageAt in the pre-warm storage stream, decode the packed slot via the new unpack_slot8 helper and write the (reserve0, reserve1, target_block) snapshot back into the cache. The next pre-warm cycle for the same pool then hits locally. This converts every RPC fetch into a one-shot lazy populate. After one warm-up block the steady-state miss rate collapses to "pools the detector has not yet looked at," which is bounded by the registry size.
Adds 8 panels covering every cache counter shipped in the recent pre-warm work: - bytecode cache hit ratio gauge (cumulative) - v2 reserves cache hit ratio gauge (cumulative) - RPC calls avoided in the last 5 minutes (headline win metric) - mempool refresher warm-pool gauge - bytecode hits vs RPC fetches per minute timeseries - v2 hits / stale / missing per minute timeseries - mempool pre-warm refresh latency p50/p95/p99 - mempool pre-warm inject hit / miss per minute Auto-loaded via the existing /var/lib/grafana/dashboards mount; the 30-second provisioning poll picks it up without a Grafana restart.
…utput Demo telemetry showed every mempool-backrun validation failing with reason="victim_reverted" but no further detail — the revert output bytes were destructured with `..` and dropped at the match site. Triaging required guessing whether the cause was an ERC20 allowance failure, slippage protection, deadline expiry, or a custom DEX error. Adds `decode_revert_reason(&[u8]) -> String` covering the two Solidity-standard revert shapes: - `Error(string)` (selector 0x08c379a0) → original ASCII message - `Panic(uint256)` (selector 0x4e487b71) → "Panic(0xNN)" - Custom errors → "custom_error_0x<selector>" for 4byte.directory lookup - Empty / short → stable marker tokens Wires it into both revert paths (victim + arb leg) so the existing `debug!` lines carry a `reason=...` field and `output_hex=...` for post-hoc replay. No metric label changes — the reason strings are unbounded so they only land in logs. Four unit tests cover the standard shapes plus the empty-revert edge case that `require()` with no message produces.
README had drifted ~6 weeks behind develop. Sweep covers everything that landed (or is in flight) since the April mermaid-diagrams commit: - DEX list extended to 1inch v6 and the Uniswap Universal Router - Repository structure adds `crates/integration-tests`, `cmd/pooldiscovery`, `cmd/reconciler`, `config/executor.yaml`, staging / replay pool configs, and the new scripts (`canary.py`, `mempool_*`, `historical_replay_e2e.sh`, `staging_test.sh`, `db_migrate.sh`, `test_integration.sh`) - Architecture diagram adds the mempool decoder + post-state predictor + revm backrun validator alongside the block-driven detector, plus the cache tier (bytecode disk cache, V2 reserves WS cache, per-protocol pool-state cache) - New "Mempool Backrun" section documents the shadow / canary / live rollout flow and links the runbook subdocs - New "Cache Layers" section sketches the RPC-reduction stack - "Configuration" gains a key environment-variables table covering the mempool, cache, sim, ledger, and OTel knobs - Monitoring section drops PagerDuty / Telegram / Discord — alerts are Slack-only via `cmd/monitor/alerter.go` - Documentation links pick up `docs/runbook/`, `docs/research/`, `docs/demo/` - Security section calls out Ownable2Step ownership rotation and the `setPaused` circuit breaker added in the April contract refactor Pure documentation change. No code paths touched.
Collapses N eth_getStorageAt round-trips into a single eth_call against the canonical Multicall3 deployment. Each batch costs 26 CU regardless of N, so it crosses over the per-pool path at N>=2 and saves ~95% of Alchemy CU at typical batch sizes. Also drops the in-flight RPC count from N to 1, which is the real win under burst load. Three new counters surface the impact: aether_prewarm_multicall_batches_total aether_prewarm_multicall_v2_slots_total aether_prewarm_multicall_fallbacks_total
Production overlay (docker-compose.prod.yml) layered over the base compose: health checks on aether-rust/aether-go, restart: unless-stopped across the stack, bounded json-file logging, aether-go waits for aether-rust healthy, and the operational env wired for a shadow-mode mempool-backrun run. Mounts the forge artifact into aether-rust so the simulator can inject the AetherExecutor bytecode (the container otherwise only sees config/). Adds .env.example documenting every required var, remote-deploy.sh to rsync + build + bring the stack up on a remote host, and wget in the rust image for the metrics health check.
The public-mempool backrun path either detected phantom cycles or reverted after detection: it used three inconsistent states for one arb (block-N snapshot + analytical victim overlay for detection/sizing, chain HEAD for the sim fork, real victim replay) and skipped the cycle_gating layer the block-driven path relies on. - pin the backrun fork to the detected block (was new_at_latest / HEAD); AETHER_BACKRUN_FORK_LATEST=1 restores latest for Anvil forks - run build_fingerprint_index + gate_pre_sim before sim and gate_post_sim before publish, in both the single-leg and Bancor multihop scans - sort detected cycles by profit_factor and hand the validator the best one that survives gating (was profitable.first()) - cycle_to_swap_steps picks the min-weight edge Bellman-Ford scored on (was the first matching edge), so the executed pool matches the scored one - publish the optimizer-sized flashloan_amount, not the fixed default Also checkpoints in-flight mempool/backrun WIP (cycle_gating, hot_token, pool_admission, fee_on_transfer, router_decoder, predictors) these build on.
…ig into go Three issues surfaced bringing the stack up on the shared remote host: - aether-rust crash-looped with `GLIBC_2.38 not found`: the binary built on rust:1.94.1-slim (glibc 2.38) cannot run on debian:bookworm-slim (2.36). Bump the runtime stage to debian:trixie-slim. - grafana could not bind host port 3000 (owned by another project on the box); remap to 3001 via a `!override` ports entry. - aether-go crash-looped on a missing config/executor.yaml: the base compose mounts config/ into rust only. Mount it into aether-go and set AETHER_CONFIG_DIR=/app/config.
The engine reads AETHER_POOLS_CONFIG or falls back to the RELATIVE "config/pools.toml" (main.rs:86). In the container, workdir is / and the registry is mounted at /app/config, so the relative path resolved to /config/pools.toml -> not found -> 0 pools loaded -> every mempool swap was dropped not_in_registry -> 0 backrun candidates. The base compose set AETHER_NODES_CONFIG absolute but omitted the pools path. Set it explicitly (now loads 118 pools). Also thread AETHER_MEMPOOL_MIN_PROFIT_WEI through to both services so shadow runs can lower the pre-sim profit gate for observation.
The deployed branch's revm backrun validator published 0 arbs. Root causes, now fixed: - EIP-3607: the in-sim caller defaulted to the executor address (which gets executor bytecode injected), so revm rejected every arb leg as sim_error. Enable revm `optional_eip3607` + set `disable_eip3607` so the synthetic caller is accepted; default a code-free searcher caller. - 429s / stalls: the fork provider was a bare connect_http. Wrap it in an alloy RetryBackoffLayer (self-throttle to AETHER_RPC_CUPS + exponential 429 backoff retry) plus a per-request reqwest timeout, so Alchemy rate-limits no longer propagate into revm as DB errors and a cold fetch can no longer stall the synchronous fork ~16s and starve the sim semaphore. One provider site feeds the block-driven sim, the mempool validator, and the prewarm refresher. - Observability: classify revm DB errors as rpc_transport vs sim_error, add sim_timeout, and add a per-sim bounded retry that rebuilds the fork on a transport error and relabels a slow failure as sim_timeout. - Missed arbs: generalize exact input sizing to every protocol via poolstate_quote so phantom cycles drop pre-sim and genuine multi-venue arbs are sized correctly (was UniswapV2/Sushi-only -> blind fallback). - Deploy: expose the RPC/sim tuning knobs in the prod overlay and .env.example. Verified: cargo clippy --workspace clean; simulator 64 pass; grpc-server 73 pass (only the pre-existing localhost:8545 provider-shutdown test fails). compose config validates.
…r emits main.rs built the Alchemy mempool source with `new(cfg)` instead of `with_metrics(...)`, leaving MempoolIngestMetrics unregistered. The EIP-2718 victim raw-tx re-encode gate then dropped a corrupted txs[0] capture with only a warn!, and the backrun-funnel "Raw re-encode mismatches" panel was permanently empty — masking a bundle-integrity blind spot. Register the metrics on the engine's shared registry and hand the source a handle. Also document the new rpc_transport / sim_timeout / revm_contradicts reject labels in the funnel dashboard panel descriptions.
…address Registry expansion to increase cross-DEX arb frequency. All 10 additions verified on-chain (token0/token1/fee for UniV3; symbol + coins(uint256) for Curve): weETH/WETH (UniV3 0.05%/0.01% + Curve), USDC/USDT 0.05% UniV3 (unlocks the stable triangle — pair was single-venue), WBTC/cbBTC + USDC/cbBTC UniV3, USDe/USDC (UniV3 + Curve), crvUSD/USDC + crvUSD/USDT Curve. Also fixes the wstETH/WETH 0.01% entry: prior address 0x109830a6...0d70F has no contract code on mainnet (a lookalike) and produced a dead edge; corrected to the real pool 0x109830a1...7dAa. 118 -> 128 pools, parses clean, no duplicate addresses.
UniswapV3 edges carry a synthetic (reserve_in, reserve_out) = (1.0, spot_price) seed (V3 has no x*y reserves). Two gates misread that 1.0 as dust and silently removed deep V3 pools: - cycle_gating.rs reserves_too_low (1e6 raw-reserve floor) dropped any cycle whose hop was served only by V3 pools (e.g. WBTC/cbBTC, USDC/cbBTC). Now protocol-aware: V3 edges pass when live (reserve_in > 0.0); V2/Sushi/Curve/Balancer keep the raw-reserve floor. - price_graph.rs min_liquidity_weth filter converted the synthetic 1.0 to a ~0 human-WETH reserve and set filtered=true, removing WETH-paired V3 edges from Bellman-Ford detection entirely. Now skips V3 synthetic edges. Corrupt-placeholder rejection (reserve_in == 0.0) still holds for every protocol. Adds regression tests on both sides.
…pot) V3 price-graph edges were seeded with synthetic constant-product reserves (reserve_in, reserve_out) = (1.0, spot). The optimizer's const-product profit function dy = dx*fee*y/(x + dx*fee) ran on those, but x=1.0 makes the curve infinitely shallow — output saturates at y for any real input, so the optimal input and net profit for every cycle containing a V3 hop were wrong. Bellman-Ford detection was unaffected: the edge weight depends only on the marginal rate y/x = spot. Seed V3 edges with virtual constant-product reserves from L and sqrtPrice: x_v = L * 2^96 / sqrtPx96 (token0 raw) y_v = L * sqrtPx96 / 2^96 (token1 raw) Const-product on (x_v, y_v) is algebraically identical to the V3 single-tick swap math (compute_swap_within_tick); y_v/x_v = spot keeps the edge weight, and therefore detection, bit-for-bit unchanged. Only the curve depth — hence profit sizing — is corrected. - add aether_pools::uniswap_v3::virtual_reserves helper (+ exactness tests) - engine bootstrap now also fetches liquidity(); seed virtual reserves on the bootstrap and live V3Update paths - mempool scorer unified_to_post_reserves builds virtual reserves from the post-state sqrtPrice + liquidity - thread V3 liquidity through historical::PoolState for the replay and profit-scorer graphs and state_to_graph_reserves - exclude V3 from the optimizer wei input-size cap: its liquidity field is sqrt-liquidity L, not a wei amount (edge_caps_optimizer_input + tests)
Consolidate the divergent `+` line (24/7 deploy stack, mempool backrun correctness fixes — fork-pin, gating, sizing, loan — 10-pool registry, protocol-aware V3 liquidity gating, 429 hardening) on top of develop's decoder + sim-cache work. Conflict resolutions: - router_decoder.rs: both branches independently implemented the Universal Router decoder; kept develop's (richer: V4_SWAP named-skip, real-mainnet fixtures, more opcodes/tests). `+`'s UR addresses retained in mempool.rs. - fork.rs: kept develop's prewarm cache infra (bytecode + WS V2-reserves + Multicall3 batch); kept `+`'s inject_code_only factoring (stale-reserves fix). - mempool_backrun.rs: kept `+`'s richer revert decoder + selector-in-result + fork integration tests. - mempool_pipeline.rs: unioned SimContext fields/builders (develop's bytecode_cache/v2_reserves_cache + `+`'s hot_tokens/gating); prewarm refresh warms bytecode only (inject_code_only consumer) on develop's cache-aware API. - README: kept develop's tree.
Brings the UniswapV3 virtual-reserves rate fix (size V3 hops with virtual constant-product reserves from L+sqrtPrice instead of the (1.0, spot) seed; exclude V3 L from the optimizer wei cap). Auto-merged cleanly with develop's sim-cache + SimContext changes.
…o develop Brings main's 3 unique commits — grafana/prometheus/alertmanager observability, live mainnet shadow-mode runner + WS transport preference, and the fat-LTO release profile — so the eventual develop->main PR is a clean fast-forward. Conflict: main.rs pools-config bootstrap — kept main's `explicit_pools_config` form (required by the common empty-registry fail-fast below).
The + branch's from_parts test constructor predates develop's PrewarmStats field; add stats: PrewarmStats::default() so the merged tree compiles under --all-targets.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings
mainup to date with the consolidateddevelopline.develophad diverged frommain(110+ commits ahead, 0 behind after this consolidation) and the best code was split across three lines plus an in-flight fix — all now integrated and gated.What's included
(x_v = L·2^96/√P, y_v = L·√P/2^96)instead of the(1.0, spot)seed (infinitely-shallow curve → wrong V3 profit/optimal-input). Detection weight unchanged (y_v/x_v = spot). Also excludes V3Lfrom the optimizer's wei input cap (sqrt-liquidity ≠ wei).+) — 24/7 docker deploy stack + remote deploy, mempool backrun correctness (fork-pin, cycle gating, sizing/loan), 10-pool verified registry (+ dead-wstETH fix), protocol-aware V3 liquidity gating, 429 sim hardening.develop) — Universal Router / 1inch v6 / UniV2 calldata decoder coverage; persistent bytecode cache + WS-fed V2 reserves cache + Multicall3 batch prewarm.main's unique commits) — grafana dashboards + prometheus alerts + alertmanager routing, live mainnet shadow-mode runner + WS transport preference, fat-LTO release profile.Conflict resolutions (notable)
router_decoder.rs: both lines independently implemented the Universal Router decoder — kept develop's (richer: V4_SWAP named-skip, real-mainnet fixtures, more opcodes/tests).fork.rs: develop's prewarm cache infra ++'sinject_code_onlyfactoring (stale-reserves fix).mempool_pipeline.rs: unionedSimContext(develop's caches ++'s hot-tokens/gating); prewarm warms bytecode only on the cache-aware API.mempool_backrun.rs:+'s richer revert decoder + selector-in-result + fork integration tests.main.rs: keptexplicit_pools_configform (required by the empty-registry fail-fast).Verification (full workspace gate — all green)
cargo clippy --workspace --all-targetsclean;cargo test --workspaceall pass, 0 failed.go vet ./...clean;go test ./...5 packages ok, 0 fail.forge test59 passed, 0 failed, 2 skipped.