[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154
[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154gimballock wants to merge 40 commits into
Conversation
11b2560 to
88d8d1d
Compare
|
The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:
|
| | share/min | rate | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 83.3% | 10m | 12m | 21m | 25m | | ||
| | 12 | 95.4% | 10m | 10m | 20m | 25m | | ||
| | 30 | 99.5% | 10m | 10m | 15m | 25m | | ||
| | 60 | 100.0% | 10m | 10m | 10m | 20m | | ||
| | 120 | 100.0% | 10m | 10m | 10m | 15m | |
There was a problem hiding this comment.
The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!
the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.
| ## Settled accuracy (stable load, post-convergence) | ||
|
|
||
| `|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better. | ||
|
|
||
| | share/min | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | | ||
| | 6 | 0.0% | 4.9% | 23.6% | 70.3% | | ||
| | 12 | 0.0% | 0.0% | 12.3% | 26.9% | | ||
| | 30 | 0.0% | 0.0% | 0.8% | 15.6% | | ||
| | 60 | 0.0% | 0.0% | 0.0% | 3.1% | | ||
| | 120 | 0.0% | 0.0% | 0.0% | 0.0% | | ||
|
|
||
| ## Steady-state jitter (fires per minute) | ||
|
|
||
| Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load. | ||
|
|
||
| | share/min | p50 | p90 | p99 | mean | | ||
| | --- | --- | --- | --- | --- | | ||
| | 6 | 0.000 | 0.200 | 0.385 | 0.059 | | ||
| | 12 | 0.000 | 0.077 | 0.217 | 0.019 | | ||
| | 30 | 0.000 | 0.000 | 0.067 | 0.002 | | ||
| | 60 | 0.000 | 0.000 | 0.000 | 0.000 | | ||
| | 120 | 0.000 | 0.000 | 0.000 | 0.000 | |
There was a problem hiding this comment.
These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.
| ## Reaction time to a 50% drop (step at 15 min) | ||
|
|
||
| | share/min | reacted | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 69.7% | 1m | 3m | 5m | 5m | | ||
| | 12 | 54.8% | 1m | 3m | 5m | 5m | | ||
| | 30 | 32.6% | 2m | 4m | 5m | 5m | | ||
| | 60 | 16.3% | 3m | 5m | 5m | 5m | | ||
| | 120 | 8.6% | 4m | 5m | 5m | 5m | | ||
|
|
||
| ## Reaction sensitivity (P[fire within 5 min of step change]) | ||
|
|
||
| | Δ% | 6 | 12 | 30 | 60 | 120 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 | | ||
| | -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 | | ||
| | -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 | | ||
| | -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 | | ||
| | +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 | | ||
| | +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 | | ||
| | +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 | | ||
| | +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 | |
There was a problem hiding this comment.
These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.
The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.
The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.
| //! - **Convergence rate**: `current >= baseline - 0.01` | ||
| //! - **Convergence p90**: `current <= baseline * 1.10` | ||
| //! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15` | ||
| //! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero) | ||
| //! - **Jitter p95**: `current <= baseline * 1.25` | ||
| //! - **Reaction rate**: `current >= baseline - 0.02` | ||
| //! - **Reaction p50**: `current <= baseline * 1.20` | ||
| //! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02` | ||
| //! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05` |
There was a problem hiding this comment.
Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.
You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.
5cbed7c to
85d6f8b
Compare
|
after some optimization I got
|
I'm so excited to see people other people nerding out on vardiff with me! Thank you! A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other. |
2d10f57 to
414afbb
Compare
211bc98 to
2a88fde
Compare
|
Some learning's I had @gimballock
|
63a19d0 to
a18c3a3
Compare
|
Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of Phase 1: Basic metrics + simulation harness Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff Phase 2: Decomposed pipeline model I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach. Phase 3: Aggregate fitness metric To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression Phase 4: Realistic operating conditions After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and Current direction I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:
On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better |
|
Hi, this is quite a detailed analysis and great decomposition of relevant parts. Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs Would be interesting to see how that algorithm performs in your benchmarks. |
Thanks for the info @paratoxicdev , I will add this to my investigation. I know that is a sv1 native pool but i will see what bits of his research crossover to sv2 context and see if it's competitive! |
|
Here is a breakdown of the calibration comparisons I made between the real hashrate tests and the simulation results confirming the predictions with the understanding that; with this algorithm responsiveness scales with the age of the connection. Suggesting that this detail be included in the simulation so metrics are more directly comparable: |
006363a to
a58132b
Compare
… rejection Add ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS required by sv2-apps pool code when rejecting extended channel open requests on pools configured for standard jobs only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EwmaEstimator assumes one observe() call per tick (60s), but the pool calls increment_shares_since_last_update() once per share arrival. With 12+ shares per tick, each observe(1) applies a full EWMA decay, collapsing the rate estimate toward 1.0 regardless of actual throughput. This causes the algorithm to consistently see under-performance and ease. CumulativeCounter simply accumulates shares and computes realized SPM at snapshot time from the total count ÷ elapsed time — immune to the per-share vs per-tick calling convention difference. Retains AcceleratingPartialRetarget and AsymmetricCusumBoundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EwmaEstimator previously applied EWMA decay on every observe() call, assuming one call per tick. Production callers invoke observe(1) per share arrival (~12× per tick), causing the rate to collapse toward 1.0. Fix: observe() now accumulates into a pending counter. The EWMA decay is applied once per snapshot() call (one per vardiff tick), making the estimator produce identical results whether called as observe(12) once or observe(1) twelve times. Uses AtomicU64/AtomicU32 for interior mutability in snapshot() (the Estimator trait requires &self) to satisfy Send+Sync bounds. Also documents the calling convention contract in the Estimator trait: implementations MUST handle both per-share and per-tick observe patterns. Production VardiffState uses CumulativeCounter (immune to this issue) until EwmaEstimator is validated in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… is fixed EwmaEstimator is now safe for per-share observe() calls (pending shares accumulate, decay applied once per snapshot). Swap back from CumulativeCounter to get the EWMA's temporal smoothing benefits: better noise rejection and more stable estimates under Poisson variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_time simulation Port ckpool's vardiff algorithm (dual-window adaptive EWMA with per-share decay_time updates) to the three-stage pipeline. Key finding: the correct way to port a continuous-time EMA to a tick-based framework is to simulate per-share updates within each tick, not to apply time-bias correction factors. While tuning gets CkpoolRemedy within ~5% of FullRemedy's comprehensive fitness, it yields no Pareto improvement — the dual-window switching adds complexity without outperforming a single-window EWMA(120s). New components: - CkpoolEstimator: per-share decay_time() simulation with dual-window adaptive switching and configurable fast-threshold - HysteresisGate: binary fire/no-fire boundary with data gate + dead band - CkpoolRetarget: full retarget with oscillation guard - TimeBiasEwmaEstimator: single-window EWMA with time-bias correction Grid registrations: ckpool(), ckpool_remedy(), ckpool_remedy_ft(n), ckpool_narrow_hyst(), ckpool_with(), time_bias_remedy() Also updates PID_INVESTIGATION.md with structural analysis of why PID fails (stage conflation makes the 41% dead zone undiagnosable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e0213c8 to
6aa27a9
Compare
The upstream commit df4e764 added ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS, which our branch already defined. Remove the duplicate to fix the build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The baseline was stale — generated before the EwmaEstimator promotion (f4cd687) changed VardiffState's internal composition. Regenerated with default 1000 trials × 80 cells at the canonical seed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for analysis @gimballock! Why not consider share-based vardiff, if you've used it before? Since you're also Rust-based I would have a look at the vardiff.rs + decay.rs files with Claude. The type system has allowed much better encapsulation and reasoning about behaviour and I think that'll be easier to compare than the C code from ckpool. I do have to disagree with some of the things your Claude said:
Each channel has their own Vardiff so there's no shared data structure with contention. The pool is overall O(shares), where a couple of floating point operations for share based vardiff are completely dwarfed by the cost to deserialize a share from the wire and then do a double sha256 hash for nonce validation (which is done no matter what Vardiff you use). I'm not that caught up on SV2, so please correct me if that's not what it does.
So the ckpool algo not only specifies an estimator (exponentially weighted moving average, see decay.rs) but also a boundary (when to adjust difficulty). The two main parts which make up the boundary are the HYSTERESIS bounds and MIN_WINDOW_RATIO. /// Minimum window ratio before considering adjustment.
/// Fraction of expected time (or shares) per window.
/// Derived from ckpool: 240s / 300s window = 0.8
const MIN_WINDOW_RATIO: f64 = 0.8;
/// Only decrease difficulty when rate drops below this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_LOW: f64 = 0.5;
/// Only increase difficulty when rate exceeds this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_HIGH: f64 = 1.33;
#[derive(Debug, Clone)]
pub(crate) struct Vardiff {
period: Duration,
window: Duration,
min_shares_for_adjustment: u32,
min_time_for_adjustment: Duration,
dsps: DecayingAverage,
current_diff: Difficulty,
old_diff: Difficulty,
first_share: Option<Instant>,
last_diff_change: Instant,
shares_since_change: u32,
min_diff: Option<Difficulty>,
max_diff: Option<Difficulty>,
diff_change_job_id: Option<JobId>,
}So not only is the responsiveness tuneable, you can see its responsiveness working on real mining machines in the wild: https://stats.ckpool.org/. From experience I can say that it responds beautifully to any scale of hashrate, be it a cpu miner, a bitaxe, or 1 EH/s rental hashrate. I really like the theoretical work you're doing and it has deepened my understanding ( (I now know words like estimator and boundary) of what I originally just copied from ckpool. I feel like this algo should be the benchmark that any new vardiff is compared with, since its proven to work with real hashrate. That's just my two cents, like the work you're doing! |
…h SPM Replace the static AsymmetricCusumBoundary with AdaptivePoissonCusum, which selects the boundary based on the miner's configured share rate: - Below SPM 10: PoissonCI (prevents overshoot on sparse data) - At SPM 10+: AsymmetricCUSUM (fast reaction with abundant evidence) This eliminates the FullRemedy vs VardiffState trade-off — PoissonCI was better at low SPM (bitaxe/small miners) while CUSUM was better at high SPM (large hashrate). The adaptive boundary gets both. Also caps AcceleratingPartialRetarget η at 0.4 (was 0.6). The lower cap prevents cold-start overshoot while still accelerating convergence after step changes. Parameter sweep (compare_out7/) confirmed this is the Pareto-optimal cap. New components: - AdaptivePoissonCusum: SPM-threshold boundary selector - GuardedAccelRetarget: acceleration only after first direction reversal (experimental, not used in production) - sweep-adaptive.rs: parameter sweep binary for boundary tuning Simulation results vs previous production (comprehensive fitness): SPM 4: 0.706 vs 0.636 (+11%) SPM 8: 0.771 vs 0.737 (+5%) SPM 10: 0.783 vs 0.774 (+1%) SPM 15: 0.810 vs 0.803 (+1%) SPM 20: 0.894 vs 0.882 (+1%) SPM 25: 0.903 vs 0.894 (+1%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…estimator equivalence test Adds two follow-up investigations to CKPOOL_INVESTIGATION.md: 1. Hysteresis boundary sweep: tested band widths from [0.5,1.33] through [0.9,1.1] with varying data gates and update rules. Conclusion: hysteresis achieves excellent reaction rates (96-100%) but at 10× jitter cost vs statistical boundaries. No parameterization achieves competitive comprehensive fitness. 2. Estimator equivalence test (revised): paired CkpoolEstimator with the production-tuned boundary (AdaptivePoissonCusum) and update (AcceleratingPartialRetarget). Result: NOT equivalent — the per-share decay_time() simulation is noisier per tick than batch EWMA, and this difference is amplified by CUSUM's aggressive firing. EwmaEstimator(120s) is genuinely better, not just simpler. Also adds ckpool estimator + hysteresis variants to compare-algorithms binary for reproducibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@paratoxicdev You're right on both points — I should have posted my actual findings rather than a raw Claude transcript. On scalability: You're correct. Each channel owns its own Vardiff — no shared state, no contention. The FP cost of On "strictly better": Also wrong. The reality is more nuanced than either of us stated — I'll explain below. What I found after reading your I ported ckpool's algorithm and benchmarked it across SPM 4–30. Full writeup in commit 6aa27a9, updated in b77eff8. On your code specifically:
Why we didn't use ckpool's components in production: I tested all three stages of the pipeline (estimator, boundary, update) independently: Estimator: After correct porting via per-share simulation,
The per-share Boundary: I swept hysteresis from your native [0.5, 1.33] through [0.9, 1.1] with varying data gates:
Narrower bands get excellent reaction rates (96–100%, better than anything else), but at 5–10× jitter cost. The fundamental issue: hysteresis fires whenever the rate ratio crosses the band, with no evidence accumulation. Statistical boundaries (PoissonCI, CUSUM) require cumulative evidence before firing, so they distinguish real changes from noise. Update: ckpool's full retarget (η=1.0) overshoots in the tick framework. Our What we ended up with: The investigation led to a new production composition that adapts its boundary strategy based on the miner's share rate — conservative (PoissonCI) for low-SPM miners where data is sparse, aggressive (CUSUM) for high-SPM miners where evidence is abundant. This is analogous to what ckpool does with its dual-window adaptive switching, just at the boundary layer instead of the estimator layer. Your algorithm's idea of "be conservative when data is sparse, aggressive when data is abundant" is exactly right — we just implement it differently.
|
|
If you have any idea or suggestions to improve the comparison, let me know. |
The field was documented as a share count but actually used as an SPM threshold — align the name with the semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add AsymmetricPoissonCI: applies a tighten_multiplier to the PoissonCI
threshold when the miner is over-performing, reflecting the asymmetric
cost of tightening (rejects in-flight shares) vs loosening (free).
Parameter sweep results (500 trials/cell, SPM 4-30):
Comprehensive fitness at low SPM (where PoissonCI is active):
symmetric (t=1.0): SPM4=0.706, SPM6=0.744, SPM8=0.758
t=1.5: SPM4=0.743, SPM6=0.771, SPM8=0.784
t=2.0: SPM4=0.799, SPM6=0.821, SPM8=0.801
t=3.0: SPM4=0.858, SPM6=0.872, SPM8=0.882
t=3.0 matches CUSUM's tighten_multiplier and is the clear winner
(+21% at SPM 4, +17% at SPM 6). SPM 10+ is unchanged (CUSUM active).
New files:
- AsymmetricPoissonCI in boundary.rs
- sweep-asymmetric-poisson.rs: parameter sweep binary
- asymmetric_poisson_sweep.md: sweep results
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Pareto-optimal vardiff configs Names every composed algorithm by its three parts instead of opaque labels (FullRemedy/VardiffState), and uses that to drive a maximin-based search for balanced vardiff algorithms. Naming (drift-proof): - Add code() to the Estimator/Boundary/UpdateRule traits, implemented on every concrete type from its live params. naming::triple_name composes the display name "Estimator / Boundary / Update"; sanitize_filename derives a filesystem-safe form. AlgorithmSpec constructors now derive their names from the actual parts, so a name can never drift from what runs. Analysis tooling: - metrics: EqualWeightFitness — same 6 sub-metrics as OperationalFitness but uniform 1/6 weighting, so no cluster is privileged. - bin/radar-chart: SVG radar with maximin sort, best-in-class hull, and per-axis direction arrows. Defaults to the three headline contenders; VARDIFF_RADAR_FULL=1 plots the broad historical set. - bin/sweep-balanced, sweep-estimators, sweep-signpersist, sweep-signpersist-cotuned, sweep-voladapt: maximin-scored parameter sweeps. New boundary: - VolatilityAdaptiveBoundary: PoissonCI floor scaled by observed-vs-Poisson share-rate volatility. Kept as a documented negative result — it loosens during the drop it should catch, so it underperforms (see sweep-voladapt). - SignPersistenceCusumBoundary: gave it a param-distinctive code(). Two Pareto-optimal configs (AlgorithmSpec::balanced / react_priority): - balanced(): Ewma90 / AdaptPC-spm8[sensitive CUSUM] / Accel-0.3-0.6-0.2 — maximin 0.551, the best worst-axis characterized; beats production on both small-drop reaction and convergence. - react_priority(): Ewma90 / SignPersist / Accel — react-10% 0.696, far above the ~0.54 ceiling fixed boundaries hit, for fast failing-ASIC detection at the cost of convergence. The sweeps establish a ~0.55 maximin ceiling for the three-stage architecture: small-drop reaction and convergence trade against each other on a shared agility budget, confirmed from four independent directions (smoothing estimators, reactive boundary, volatility-adaptive boundary, sign-persistence). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…SPM collapse Per-SPM analysis revealed react_priority() (bare SignPersistenceCusumBoundary) collapses at low SPM: at 4 SPM its jitter/step-safety/convergence axes all fall to ~0 (maximin 0.000). Sparse-data Poisson noise produces spurious same-sign residual runs that trip the sign-persistence discount, causing constant false fires. balanced() was already immune — its AdaptivePoissonCusum dual-mode uses the conservative PoissonCI below the SPM threshold. - Generalize AdaptivePoissonCusum into AdaptiveBoundary<B: Boundary>: PoissonCI below spm_threshold, an arbitrary aggressive boundary B at/above. The CUSUM pairing is preserved as `type AdaptivePoissonCusum = AdaptiveBoundary<AsymmetricCusumBoundary>` with its original new/with_params, so all existing call sites are unchanged. - Add AdaptiveSignPersist alias + ::sign_persist constructor wrapping SignPersistenceCusumBoundary in the same low-SPM guard. - Repoint react_priority() at AdaptiveSignPersist(spm=8). Result: react_priority maximin at 4 SPM 0.000 → 0.292 (jitter 0.31→0.80, step 0.00→0.29, conv 0.00→0.60); high-SPM behavior unchanged (PoissonCI only engages below 8 SPM); aggregate maximin now competitive with balanced. The dual-mode code() always shows the high boundary's own code(), so names read "Adapt-spm8[AsymCusum-...]" / "Adapt-spm8[SignPersist-...]". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
VardiffState had been pointed at the tuned EWMA stack (EwmaEstimator(120) + AdaptivePoissonCusum + AcceleratingPartialRetarget) — scope creep that bundled an algorithm change into what is meant to be the diagnostic simulation framework. Revert production to the real upstream classic algorithm by delegating to composed::classic_composed (CumulativeCounter + StepFunction classic table + FullRetargetWithClamp). Delegating to the existing classic_composed factory (rather than re-spelling Composed::new) makes production and the sim's reference construction literally identical, so the monolith and ClassicComposed are now fire-for-fire equivalent by construction — restoring the truth of the "Cumul / Step / FullClamp*" name and the equivalence doc comment in grid.rs. The composed/ pipeline and clock injection stay as diagnostic infrastructure the sim drives; any production algorithm change ships as a separate, clean commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add LogErrorRegret, the principled replacement for the six-axis
EqualWeightFitness radar. In the natural error coordinate e =
ln(H_est/H_true) it decomposes behavior into:
- regret_over / regret_under: linear time-avg |e| split by sign (linear,
not quadratic — a quadratic loss is structurally blind to small
persistent degradation, the failing-ASIC case)
- effort_up / effort_down: Σ(Δln D)² over fires, split by direction
(tightening is costly, easing ~free)
Computed from the universal trajectory (current_hashrate_before,
new_hashrate, fired), so it works for every algorithm including the
opaque production monolith. Unit-tested on the over/under sign split.
EqualWeightFitness is marked deprecated (kept only until the maximin
sweep bins migrate to regret/effort scoring). The old radar-chart bin and
its generated SVGs are removed.
docs/THEORY.md (§1–§10) records the derivation: the conservation law, why
every fitness "axis" is one trade-off, the linear-vs-quadratic loss-shape
decision, and the incumbent-reference (vs hull) normalization choice.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tooling built on the §10 LogErrorRegret metric to find and present the new Pareto champions: - regret-effort: empirical validation of the conservation-law theory (binding? δ²-cancel? real frontier?) from raw trajectories. - regret-radar: the 5-axis principled radar (tracking, gentleness, detection, over-diff safety, tighten-care), anchored on ClassicComposed (real upstream vardiff). Familiar-metrics companion panel uses a per-axis log-ratio scale so contenders separate despite the anchor being a degenerate outlier on jitter/accuracy. - sweep-regret / sweep-regret-big: regret/effort-scored parameter sweeps (big = 9216 configs, parallel via std::thread::scope over run_cell_with_algorithm, bit-identical to a serial grid). - confirm-champions: high-trial tie-break on the sweep's flat top cluster plus edge-extension probes. - champion-weights: corner pressure-test + weight-sensitivity (re-scores one simulation pass under a weight grid for free), showing the new champion is a robust interior optimum, not a degenerate never-tighten corner, and validating the §10 3:1 over:under weight. - trajectory-plot: the plain-language comparison — estimate chasing truth over one timeline (cold-start ramp → settle → aged −10% drop), making ramp-up time and detection latency visible in a single frame. Includes an oracle reference line (same τ=150 estimator, no control policy) that decomposes the settle-phase offset into irreducible noise vs policy bias, and a fire-raster strip (mark height ∝ |Δln D|) that shows each algorithm's "few+violent vs many+gentle" retarget character and its reactivity to the small aged drop. New champion: Ewma150 / AsymCusum-s0.2-t6 / Accel-0.2-0.8-0.05 (~15% better cost than the prior balanced/react_priority champions, 100% detection). .gitignore generalized to cover the new bins' generated reports and charts (regenerate via cargo run; analysis lives in commits). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Improve on the two weaknesses the trajectory plot highlighted for the
interim AsymCusum champion: slow cold-start ramp (~34 min) and a
persistent settle-phase under-difficulty offset (−10.5%). Both are the
same thing — reluctance to tighten — and both are fixed by one mechanism.
Swap the boundary to AdaptiveSignPersist: the sign-persistence discount
relaxes the fire threshold on consecutive same-sign residuals AFTER the
tighten multiplier, so a *persistent* under-difficulty (cold start, settle
bias) progressively lowers the tighten bar and fires frequent small
corrections, while a one-off spike keeps full tighten-reluctance —
death-spiral safety preserved.
Found and validated with the §10 harness: a trajectory spike eliminated
the trend/confidence/gentle-frequent alternatives (VolatilityAdaptive
loosened the wrong way), then sweep-signpersist-regret (973 configs, 581
beat the AsymCusum champion) → confirm-signpersist (2000 trials + weight
grid) pinned the optimum at d=0.06 and proved it weight-robust at the §10
3:1 over:under weight (probes d<0.06 win only at an ungrounded ≤2:1).
New champion, promoted as AlgorithmSpec::champion():
Ewma150 / AdaptiveSignPersist[s0.3,f0.05,t6,d0.06,dm0.6,spm6]
/ Accel-0.2-0.8-0.05
Trajectory @1000 trials vs the interim AsymCusum champion: ramp-up
34→15 min, detect latency 12→9 min, settle gap −10.5%→−7.2%, detection
holds 100%. regret_under 0.096→0.087, regret_over flat.
- grid.rs: add champion() constructor.
- regret-radar / trajectory-plot: plot the SignPersist champion alongside
the interim AsymCusum one so the gains are visible at each step.
- sweep-signpersist-regret, confirm-signpersist: the new analysis bins.
- .gitignore: generalize confirm_*.md.
The settle gap is reduced, not closed (−7.2% vs the −0.8% oracle floor);
remaining daylight is a known, lower-priority follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… result Investigate idea stratum-mining#1 (cold-start warm-up) to recover the champion's policy-imposed ramp time. The trajectory oracle readout (added here) shows the τ=150 estimator alone converges in ~2-3 min while the cautious champion takes ~15 min — so ~12 min of ramp is policy, not estimator, and in principle recoverable. Add WarmupBoundary<B>: returns threshold 0 (fire on any deviation) until the realized rate first lands within `converge_band` of target, then one-way latches and delegates to the inner boundary B forever after. The rationale is sound — a fresh connection has no in-flight work to protect, so the death-spiral caution that justifies the steady-state policy only costs ramp time there. Latch behavior is unit-tested. RESULT: rejected as wired. The trajectory spike looked great (ramp 15→7 min, settle/detection unchanged), but confirm-warmup (1000 trials, WarmupBoundary<AdaptiveSignPersist>, converge_band swept) shows it REGRESSES the §10 steady-state cost by up to 16.5% — the no-warmup champion wins outright, every band raises regret_over. Root cause: cold start is excluded from the cost, but the Stable/Step scenarios start at truth with Poisson-noisy early windows, and warm-up can't tell a genuine cold start from a noisy first window at correct difficulty — so it fires aggressively on noise and books over-difficulty regret. Salvaging it needs arming only on a large *sustained* initial deviation (a redesign). Champion is UNCHANGED. The primitive (tested, reusable) and the confirmation bin are kept as a reproducible negative result; the spike itself is not wired into any shipped config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Investigate whether the SignPersist champion's persistent ~−7% settle-phase under-difficulty offset (the gap to the trajectory oracle line) is recoverable via an estimator-side debias. Add DebiasEstimator<E>: scales the inner estimator's h_estimate by a fixed `bias` (>1 lifts the belief), leaving the raw realized rate untouched. Unit-tested. The hypothesis: a tighten-reluctant boundary equilibrates with an under-difficulty offset; lifting the belief should move that equilibrium toward truth. RESULT: settles the question — the gap is the deliberate optimum, not a defect. confirm-debias swept bias ∈ [1.0, 1.25] on the champion (1000 trials), tracking both §10 cost and the settle gap. The debias works mechanically (gap dials −6.9% → 0 near bias 1.10 → +21% at 1.25), but §10 cost rises MONOTONICALLY from bias=1.0: regret_under falls while regret_over rises faster under the 3:1 weight. bias=1.0 is the cost minimum. So the champion correctly declines to track dead-on — sitting slightly under-difficulty is cheaper than the over-difficulty risk, and the oracle reaches −0.8% only by firing every tick (effort the cost penalizes). Triangulated with champion-weights and the trajectory oracle line; changing the gap now requires changing the WEIGHTS, not the algorithm. Reframe the trajectory plot to match: relabel the "oracle" line as the "accuracy ceiling (cost-blind)" — an accuracy bound, not a target — and add a green "cost-optimal settle (§10)" corridor at the champion's own settle level, so the champion visibly sits IN the objective's optimum rather than appearing to fall short of the cost-blind ceiling. Champion UNCHANGED. Primitive (tested, reusable) + confirmation bin kept as a reproducible result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add METRIC_DERIVATION.md — a standalone, skeptic-facing account of WHY the metric is regret(linear, sign-split) + effort(direction-split) + an explicit detection axis, per scenario class, with 3:1 directional weights. Where THEORY.md is the chronological lab notebook, this is the cleaned-up proof: each claim is labeled PROVEN (algebra/probability from the plant identity), EMPIRICAL (named simulation), or VALUES (declared judgment), and the argument is carried as much by the three falsified hypotheses (δ²-cancellation, single-scalar E=J_opt/J, sufficiency of ∫e²) as by the surviving ones. Includes the detection non-recoverability theorem (§7), the quadratic-blindness lemma (§5.3), the weight-robustness result, the confirm-debias "−7% gap is optimal" demonstration, and an explicit falsification checklist (§11) plus an epistemic-status table (§12). All citations verified against the tree (poisson_floor metrics.rs:864; commits 31a9dbc / a1d3fa7 / 70fcb26). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>



Adds a deterministic in-process simulation framework that characterizes
any
Vardiffimplementation across the operational rate range, andcommits the current algorithm's measurements as a baseline for automated
regression testing.
The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.
The finding that motivates this
Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:
Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):
Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence
delta_timegrows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in
vardiff_baseline.md.This isn't a fix. It's the measurement that lets the fix be evaluated.
What's in the PR
3 commits, ~2000 LOC plus baseline data:
feat(vardiff): inject Clock trait + add_shares trait methodMinimum API additions to
channels_sv2for testability andsimulation performance. Production behavior unchanged — existing
constructors default to
SystemClock, the new trait method has adefault implementation that keeps existing impls compiling.
feat(vardiff_sim): in-process simulation frameworkNew crate at
sv2/channels-sv2/sim/. Per-tick Poisson sharesampling, five behavioral metrics with percentile distributions,
50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
binary for baseline generation, regression test asserting against
a committed baseline.
data(vardiff_sim): design doc + baseline characterizationThe design proposal documenting metric definitions and tolerance
policy, plus the measured baseline as both TOML (consumed by the
regression test) and Markdown (for human review).
What the framework measures
Five behavioral attributes, each as a distribution across 1000
independent trials per cell:
Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.
How to run
From
sv2/channels-sv2/sim/:What this enables
For any future vardiff proposal:
Vardiffimplcargo run --release --bin generate-baselineto produce comparablemeasurements
No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."
Where to look in this PR
rationale):
sv2/channels-sv2/sim/VARDIFF_SIMULATION_FRAMEWORK.mdworkflow):
sv2/channels-sv2/sim/README.mdsv2/channels-sv2/sim/vardiff_baseline.mdWhat this PR is NOT
VardiffStatebehavior is unchanged.The only public-API additions are
Vardiff::add_shares(with adefault impl) and the
Clocktrait. Production code defaults toSystemClockand behaves identically to before.data suggests 12-30 spm is the operational sweet spot, but this
PR doesn't touch any defaults.
a GitHub Action to be a true CI gate. Follow-up.
Open follow-ups
cargo test --release --lib -- --ignoredinto CI on PRstouching
vardiff/*or the sim crate.channels_sv25.0.0 → 5.1.0 once the workspace lockfilesituation allows (the trait-method addition is technically a
minor-version semver change). TODO comment in
Cargo.tomltracksthis.
surfaces the problem; fixing it is a separate proposal that this
framework will be the right tool to evaluate.
Test plan
cargo test -p channels_sv2 --lib vardiff— 17 tests, all passcargo testfromsv2/channels-sv2/sim/— 53 fast unit testscargo test --release --lib -- --ignoredfrom sim/ — slowregression test passes against committed baseline
cargo run --release --bin generate-baseline— reproduces thecommitted
vardiff_baseline.tomlbyte-for-byte at the same seed