[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests by gimballock · Pull Request #2154 · stratum-mining/stratum

gimballock · 2026-05-13T21:06:17Z

Adds a deterministic in-process simulation framework that characterizes
any Vardiff implementation across the operational rate range, and
commits the current algorithm's measurements as a baseline for automated
regression testing.

The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.

The finding that motivates this

Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:

Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):

share/min	sensitivity
6	0.70
12	0.55
30	0.33
60	0.16
120	0.09

Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence delta_time
grows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in vardiff_baseline.md.

This isn't a fix. It's the measurement that lets the fix be evaluated.

What's in the PR

3 commits, ~2000 LOC plus baseline data:

feat(vardiff): inject Clock trait + add_shares trait method
Minimum API additions to channels_sv2 for testability and
simulation performance. Production behavior unchanged — existing
constructors default to SystemClock, the new trait method has a
default implementation that keeps existing impls compiling.
feat(vardiff_sim): in-process simulation framework
New crate at sv2/channels-sv2/sim/. Per-tick Poisson share
sampling, five behavioral metrics with percentile distributions,
50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
binary for baseline generation, regression test asserting against
a committed baseline.
data(vardiff_sim): design doc + baseline characterization
The design proposal documenting metric definitions and tolerance
policy, plus the measured baseline as both TOML (consumed by the
regression test) and Markdown (for human review).

What the framework measures

Five behavioral attributes, each as a distribution across 1000
independent trials per cell:

Metric	Better is	What it tells you
Convergence time	Smaller	How fast the algorithm settles after cold start
Settled accuracy	Smaller	How close to truth the algorithm lands
Steady-state jitter	Smaller	How often it fires on noise post-settle
Reaction time	Smaller	How fast it responds to genuine load changes
Reaction sensitivity	≈ 1 for real Δ, ≈ 0 for noise	Whether it distinguishes signal from noise

Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.

How to run

From sv2/channels-sv2/sim/:

# Fast unit tests (~1 second)
cargo test

# Generate a fresh baseline (~5-15 seconds)
cargo run --release --bin generate-baseline

# Run the slow regression test (~5-15 seconds; #[ignore]-d by default)
cargo test --release --lib -- --ignored

What this enables

For any future vardiff proposal:

Implement the new algorithm as a Vardiff impl
cargo run --release --bin generate-baseline to produce comparable
measurements
Diff against the committed baseline
Make the case with numbers

No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."

Where to look in this PR

Design proposal (architecture, metric definitions, tolerance
rationale): sv2/channels-sv2/sim/VARDIFF_SIMULATION_FRAMEWORK.md
Crate README (usage, output interpretation, baseline-update
workflow): sv2/channels-sv2/sim/README.md
The current algorithm's measured baseline:
sv2/channels-sv2/sim/vardiff_baseline.md

What this PR is NOT

Not an algorithm change. VardiffState behavior is unchanged.
The only public-API additions are Vardiff::add_shares (with a
default impl) and the Clock trait. Production code defaults to
SystemClock and behaves identically to before.
Not a recommendation about share rate defaults. The baseline
data suggests 12-30 spm is the operational sweet spot, but this
PR doesn't touch any defaults.
Not a CI workflow. The regression test works locally but needs
a GitHub Action to be a true CI gate. Follow-up.

Open follow-ups

Wire cargo test --release --lib -- --ignored into CI on PRs
touching vardiff/* or the sim crate.
Bump channels_sv2 5.0.0 → 5.1.0 once the workspace lockfile
situation allows (the trait-method addition is technically a
minor-version semver change). TODO comment in Cargo.toml tracks
this.
Investigate the reactivity-degrades-with-rate finding. The framework
surfaces the problem; fixing it is a separate proposal that this
framework will be the right tool to evaluate.

Test plan

cargo test -p channels_sv2 --lib vardiff — 17 tests, all pass
cargo test from sv2/channels-sv2/sim/ — 53 fast unit tests
cargo test --release --lib -- --ignored from sim/ — slow
regression test passes against committed baseline
cargo run --release --bin generate-baseline — reproduces the
committed vardiff_baseline.toml byte-for-byte at the same seed

gimballock · 2026-05-13T22:23:53Z

The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:

We can play dice with share-received events to simulate running the vardiff algorithms over arbitrary ranges of time. But we need to mock SystemTime::now() and add a way to bulk add new shares.
With fake time simulations we can do large scale vardiff trials of whatever metrics we want and contrast against correlated attributes like target shares-per-minute.
- I was interested in convergence time, stable-state jitter, and convergence accuracy
- But responsiveness to external change is also a key capability, (how fast to adjust to a 50% spike/dip in hashrate)
With this compilation of reproducible test results compiled into a profile we can use integration tests to lock in established performance thresholds and ratchet up the expectations if we find better algorithms.

gimballock · 2026-05-13T23:26:06Z

+| share/min | rate | p10 | p50 | p90 | p99 |
+| --- | --- | --- | --- | --- | --- |
+| 6 | 83.3% | 10m | 12m | 21m | 25m |
+| 12 | 95.4% | 10m | 10m | 20m | 25m |
+| 30 | 99.5% | 10m | 10m | 15m | 25m |
+| 60 | 100.0% | 10m | 10m | 10m | 20m |
+| 120 | 100.0% | 10m | 10m | 10m | 15m |


The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!

the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.

gimballock · 2026-05-14T13:35:13Z

+## Settled accuracy (stable load, post-convergence)
+
+`|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better.
+
+| share/min | p10 | p50 | p90 | p99 |
+| --- | --- | --- | --- | --- |
+| 6 | 0.0% | 4.9% | 23.6% | 70.3% |
+| 12 | 0.0% | 0.0% | 12.3% | 26.9% |
+| 30 | 0.0% | 0.0% | 0.8% | 15.6% |
+| 60 | 0.0% | 0.0% | 0.0% | 3.1% |
+| 120 | 0.0% | 0.0% | 0.0% | 0.0% |
+
+## Steady-state jitter (fires per minute)
+
+Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load.
+
+| share/min | p50 | p90 | p99 | mean |
+| --- | --- | --- | --- | --- |
+| 6 | 0.000 | 0.200 | 0.385 | 0.059 |
+| 12 | 0.000 | 0.077 | 0.217 | 0.019 |
+| 30 | 0.000 | 0.000 | 0.067 | 0.002 |
+| 60 | 0.000 | 0.000 | 0.000 | 0.000 |
+| 120 | 0.000 | 0.000 | 0.000 | 0.000 |


These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.

gimballock · 2026-05-14T13:55:16Z

+## Reaction time to a 50% drop (step at 15 min)
+
+| share/min | reacted | p10 | p50 | p90 | p99 |
+| --- | --- | --- | --- | --- | --- |
+| 6 | 69.7% | 1m | 3m | 5m | 5m |
+| 12 | 54.8% | 1m | 3m | 5m | 5m |
+| 30 | 32.6% | 2m | 4m | 5m | 5m |
+| 60 | 16.3% | 3m | 5m | 5m | 5m |
+| 120 | 8.6% | 4m | 5m | 5m | 5m |
+
+## Reaction sensitivity (P[fire within 5 min of step change])
+
+| Δ% | 6 | 12 | 30 | 60 | 120 |
+| --- | --- | --- | --- | --- | --- |
+| -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 |
+| -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 |
+| -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 |
+| -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 |
+| +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 |
+| +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 |
+| +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 |
+| +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 |


These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.

The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.

The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.

gimballock · 2026-05-14T14:24:46Z

+//! - **Convergence rate**: `current >= baseline - 0.01`
+//! - **Convergence p90**: `current <= baseline * 1.10`
+//! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15`
+//! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero)
+//! - **Jitter p95**: `current <= baseline * 1.25`
+//! - **Reaction rate**: `current >= baseline - 0.02`
+//! - **Reaction p50**: `current <= baseline * 1.20`
+//! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02`
+//! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05`


Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.

You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.

adammwest · 2026-05-18T11:47:37Z

after some optimization I got
https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md
with the Bayesian model

Method	Result
Bayesian model	Good, Best result
Kalman filters	Good
Jurik moving averages	Good but artifacts on hashrate changes
Thompson sampling	Bad

gimballock · 2026-05-18T12:54:53Z

after some optimization I got https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md with the Bayesian model
Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

I'm so excited to see people other people nerding out on vardiff with me! Thank you!

A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other.

adammwest · 2026-05-20T15:15:09Z

Some learning's I had @gimballock

This task is hard, I think the current implementation is optimized to a degree.
the most critical thing is the fitness, currently there are many metrics, how you combine all of them into a final value is what determines the goodness for any algorithm, as its a summary there are ways to game it.
Use every value in the toml file, if you dont those values will naturally will degrade.
There are many ways to combine many numbers which lead to slightly better and slightly worse performance.
one thing I did which helped was to separate fitness into 2 categories improvement and regression, even better separating these per group e.g stable,coldstart ,... then you can decompose the value.
Normalize each metric/group otherwise the more numerous or larger numbers will be the focus.
There are many cases where you can get a good score, but the fitness prefers optimizing 1 variable or a set of variables at the expense of others.
For grid parameter sweeps, they usually discretize the domain so you are bounded in improvement only by dimension range and amount of queries. so you need to constantly increase queries or shrink ranges. usually you are limited due to time. For this reason I prefer random restart hill climbing I find is generally pretty good when you don't make assumptions about the data.
If you have too many parameters to optimize you can over fit, and end up just gaming the test

gimballock · 2026-05-21T17:46:37Z

Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of
how the approach has matured:

Phase 1: Basic metrics + simulation harness

Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff
algorithm. This gave us reproducible, large-scale trials (50 cells × 1000 trials) against correlated attributes like target shares-per-minute.

Phase 2: Decomposed pipeline model

I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match
implementations at each slot for the best composite.

This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential
data flow: the estimator needs to communicate its belief to the boundary ("should we respond?") and to the update rule. Additionally, since vardiff triggers on a timer rather than on share arrival, the decision rule needs
to call back to the estimator to update state when adjustments occur. I also dropped the "statistic" component as it wasn't pulling its weight.

The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach.

Phase 3: Aggregate fitness metric

To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you
raised: rather than optimizing one metric at the expense of others, we can define a weighted composite that represents our desired tradeoff. The regression baseline locks in the full vector of metrics so we catch
regressions in any dimension, not just the aggregate.

Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression
can't hide behind a stable-state improvement, but making this more explicit in the scoring would help.

Phase 4: Realistic operating conditions

After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and
network slowdowns on an established channel is valuable functionality — even though in practice many hashrate changes currently cause miner reconnections (which resets vardiff anyway). We've been doing live testing with
physical miners on testnet4 and confirmed this pattern: when vardiff ramps difficulty too aggressively, it can interact with firmware timeout behaviors in ways that force reconnections, making reactivity testing harder
than expected.

Current direction

I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:

Stability under steady-state (minimal oscillation once converged)
Reasonable reactivity (detect genuine changes within 2–3 retarget windows, not 1)
Asymmetric cost awareness — difficulty increases are more disruptive than decreases. An overshoot upward causes difficulty-too-low share rejections (wasted miner work), while an undershoot downward just means slightly
more shares than optimal (cheap). The AsymmetricCusumBoundary encodes this: it requires stronger evidence before raising difficulty than lowering it. We can now actually measure the impact via the share-rejection metrics
(shares_rejected_total{reason="difficulty-too-low"}) that were recently added to the pool's monitoring (sv2-apps PR Docs: Channel Factory #491).

On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better
normalization approaches

gimballock · 2026-05-30T18:07:18Z

I've been working on a simple proxy to validate the vardiff responsiveness findings from the simulations. My first test aimed to confirm the assessment that the existing vardiff algorithm is slow to respond to hashrate changes after it has already converged.

Test Setup:

I configured an S21 miner → tproxy (vardiff disabled) → shape-proxy (controllable share rate) → SRI pool. The shape-proxy maintains a smoothed share rate based on current pool difficulty and can selectively drop shares to simulate hashrate changes on command.

Methodology:

I ran two parallel instances of the unmodified SRI pool server (main branch) and triggered a 50% share rate drop via API command. I then measured how long it took for the pool's hashrate estimate and share acceptance rate to adjust to the new rate.

Results:

The pool required approximately 40 minutes to complete the initial hashrate adjustment, while the simulation predicted 70% of miners failed to adjust at all and of those that did it took 5m. So it may look like a 20× discrepancy between simulation and observed behavior, its more of a confirmation that it takes a very long time.

I will investigate if there is any discrepancy and re-calibrate the simulation if so. The attached image shows the resulting hashrate and accepted share rate for both trials.

The blue lines are the pool hashrate, you see the initial spike after it first starts up is the convergence spike then is settles and stays flat for not-quite an hour before I activate the 50% hashrate drop, immediately visible in the share rate drop. This stays flat until eventually vardiff fires the adjustment allowing the share rate to return to normal.

gimballock · 2026-05-30T18:19:04Z

Here you can see the previous 12h of consistent hashrate w/ zero rejected shares as evidence that the 'smoothing' of the hashrate via my proxy is stable enough to test against.

paratoxicdev · 2026-06-01T23:25:40Z

Hi, this is quite a detailed analysis and great decomposition of relevant parts.

Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool

I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs

Would be interesting to see how that algorithm performs in your benchmarks.

gimballock · 2026-06-02T15:37:38Z

Hi, this is quite a detailed analysis and great decomposition of relevant parts.

Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool

I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs

Would be interesting to see how that algorithm performs in your benchmarks.

Thanks for the info @paratoxicdev , I will add this to my investigation. I know that is a sv1 native pool but i will see what bits of his research crossover to sv2 context and see if it's competitive!

gimballock · 2026-06-02T15:45:04Z

Here is a breakdown of the calibration comparisons I made between the real hashrate tests and the simulation results confirming the predictions with the understanding that; with this algorithm responsiveness scales with the age of the connection. Suggesting that this detail be included in the simulation so metrics are more directly comparable:

⏺ Vardiff Calibration Summary: Simulation vs Real Miner Testing

  Algorithm under test: Classic Parametric vardiff (VardiffState from stratum-core, branch vardiff/parametric-thresholds)

  Hardware: Antminer S21 (~200 TH/s), testnet4
  Pool config: shares_per_minute = 6 and shares_per_minute = 20
  Tool: shape-proxy with Step{1.0→0.5, at_secs:300} and Track{1.0} reset

  ---
  Structural findings confirmed
  
  ┌────────────────────────────┬───────────────────────────────────────────────────────┬────────────────────────────────────────────────────────┬───────┐
  │          Property          │                    Sim prediction                     │                    Real observation                    │ Match │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Steady-state jitter        │ 0.000 fires/min                                       │ Zero fires during 30+ min stable operation (both       │ Yes   │
  │                            │                                                       │ rates)                                                 │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Per-fire step magnitude    │ Deterministic: realized_spm / target_spm ratio        │ -16.7% consistently (5 consecutive fires)              │ Yes   │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Fire cadence after counter │ Exactly 300s (15% threshold at delta_time≥300)        │ 22:39→22:44→22:49→22:54→22:59 (5-min cadence)          │ Yes   │
  │  reset                     │                                                       │                                                        │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Overshoot after staircase  │ p99 = 69% at 6 spm                                    │ ~60% overshoot → share flood → oscillation             │ Yes   │
  │ descent                    │                                                       │                                                        │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Sensitivity depends on     │ Implicit in algorithm (accumulating counter never     │ 5-min counter: 4.4 min reaction; 51-min counter: 51.8  │ Yes   │
  │ counter age                │ resets except on fire)                                │ min reaction                                           │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Algorithm symmetric for    │ Sim shows similar reaction rates both directions      │ Confirmed: similar timescales for step-down and        │ Yes   │
  │ ±50%                       │                                                       │ step-up                                                │       │
  └────────────────────────────┴───────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┴───────┘

  ---
  Reaction time: -50% step (hashrate halved)
  
  ┌────────────────────────────────────┬──────────────────────────────────────────────────────────┬──────────────────────────────────┐
  │             Condition              │                      Sim prediction                      │           Real result            │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, fresh counter (~5 min)      │ 3.7% fire within 5 min; those that fire: p50=5m          │ Slot 3: 4.4 min, Slot 4: 8.0 min │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, settled counter (~51 min)   │ ~96% don't fire within 5 min (tail extends indefinitely) │ Slot 3: 51.8 min                 │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 20 spm, moderate counter (~27 min) │ 14.2% fire within 5 min                                  │ Slot 3: 6.9 min, Slot 4: 8.4 min │
  └────────────────────────────────────┴──────────────────────────────────────────────────────────┴──────────────────────────────────┘

  ---
  Reaction time: +50% step (return to full rate)
  
  ┌───────────────────────────┬──────────────────────────────────┬──────────────────────────────────┐
  │         Condition         │          Sim prediction          │           Real result            │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, 51-min counter     │ Deep in tail (>96% non-reactive) │ Slot 3: 51.8 min                 │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, 68-min counter     │ Deep in tail                     │ Slot 4: 15.4 min                 │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 20 spm, 20-21 min counter │ Higher spm improves detection    │ Slot 3: 5.9 min, Slot 4: 6.3 min │
  └───────────────────────────┴──────────────────────────────────┴──────────────────────────────────┘

  ---
  Key insight: counter age is the dominant variable
  
  The sim's test design (step at t=15min, 5 min after cold-start convergence) always tests with a fresh counter. This produces the "3.7% react within 5 min"
  figure. In reality, a pool that hasn't fired in hours has a massive accumulated counter that dilutes any step signal — explaining 40-minute to 2.5-hour
  response times observed in earlier tests.

  At 20 spm vs 6 spm with the same counter age (~20 min), reaction time improves ~8x (51.8 min → 5.9 min). This is because more shares per minute means the
  new rate accumulates statistical weight faster against the counter history.

  ---
  Simulation validity assessment
  
  The simulation is trustworthy for relative algorithm comparisons. It correctly models:
  - The threshold table mechanics (fire/no-fire decisions)
  - The deterministic staircase behavior post-fire
  - The zero-jitter steady state
  - The share-rate dependence on detection speed
  - The fundamental sensitivity-decay-over-time flaw

  Gap to address: The sim should add a "counter age" axis (test steps at t=30m, t=60m, t=120m) to characterize the tail distribution, which is where
  real-world pools spend most of their time. The current 5-minute observation window and 15-minute step timing understate the algorithm's poor real-world
  responsiveness.```

gimballock · 2026-06-03T21:53:50Z

Ok here is evidence that the top algorithm (deployed side-by-side with the current vardiff) is immune to the age-dependence effect. This reproduces results predicted in the simulation but now seen in real life.

I started a mining channel against both pools and let them mine overnight.
In the morning I dropped the hashrate in half for both pools at roughly the same time and waited till both pool's vardiffs finished adjusting to the new hashrate changes. The annotated image below shows the results from the grafana dashboard.

The current vardiff algorithm not only took several hours to respond, the response it eventually made was in the wrong direction! While the new algorithm responded in a few minutes and settled on the correct value.

… rejection Add ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS required by sv2-apps pool code when rejecting extended channel open requests on pools configured for standard jobs only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EwmaEstimator assumes one observe() call per tick (60s), but the pool calls increment_shares_since_last_update() once per share arrival. With 12+ shares per tick, each observe(1) applies a full EWMA decay, collapsing the rate estimate toward 1.0 regardless of actual throughput. This causes the algorithm to consistently see under-performance and ease. CumulativeCounter simply accumulates shares and computes realized SPM at snapshot time from the total count ÷ elapsed time — immune to the per-share vs per-tick calling convention difference. Retains AcceleratingPartialRetarget and AsymmetricCusumBoundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EwmaEstimator previously applied EWMA decay on every observe() call, assuming one call per tick. Production callers invoke observe(1) per share arrival (~12× per tick), causing the rate to collapse toward 1.0. Fix: observe() now accumulates into a pending counter. The EWMA decay is applied once per snapshot() call (one per vardiff tick), making the estimator produce identical results whether called as observe(12) once or observe(1) twelve times. Uses AtomicU64/AtomicU32 for interior mutability in snapshot() (the Estimator trait requires &self) to satisfy Send+Sync bounds. Also documents the calling convention contract in the Estimator trait: implementations MUST handle both per-share and per-tick observe patterns. Production VardiffState uses CumulativeCounter (immune to this issue) until EwmaEstimator is validated in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… is fixed EwmaEstimator is now safe for per-share observe() calls (pending shares accumulate, decay applied once per snapshot). Swap back from CumulativeCounter to get the EWMA's temporal smoothing benefits: better noise rejection and more stable estimates under Poisson variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_time simulation Port ckpool's vardiff algorithm (dual-window adaptive EWMA with per-share decay_time updates) to the three-stage pipeline. Key finding: the correct way to port a continuous-time EMA to a tick-based framework is to simulate per-share updates within each tick, not to apply time-bias correction factors. While tuning gets CkpoolRemedy within ~5% of FullRemedy's comprehensive fitness, it yields no Pareto improvement — the dual-window switching adds complexity without outperforming a single-window EWMA(120s). New components: - CkpoolEstimator: per-share decay_time() simulation with dual-window adaptive switching and configurable fast-threshold - HysteresisGate: binary fire/no-fire boundary with data gate + dead band - CkpoolRetarget: full retarget with oscillation guard - TimeBiasEwmaEstimator: single-window EWMA with time-bias correction Grid registrations: ckpool(), ckpool_remedy(), ckpool_remedy_ft(n), ckpool_narrow_hyst(), ckpool_with(), time_bias_remedy() Also updates PID_INVESTIGATION.md with structural analysis of why PID fails (stage conflation makes the 41% dead zone undiagnosable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The upstream commit df4e764 added ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS, which our branch already defined. Remove the duplicate to fix the build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The baseline was stale — generated before the EwmaEstimator promotion (f4cd687) changed VardiffState's internal composition. Regenerated with default 1000 trials × 80 cells at the canonical seed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

paratoxicdev · 2026-06-04T21:39:17Z

Thanks for analysis @gimballock!

Why not consider share-based vardiff, if you've used it before?

Since you're also Rust-based I would have a look at the vardiff.rs + decay.rs files with Claude. The type system has allowed much better encapsulation and reasoning about behaviour and I think that'll be easier to compare than the C code from ckpool.

I do have to disagree with some of the things your Claude said:

Where timer-based genuinely wins: Scalability. In SV2 with multiplexed channels, evaluating vardiff on every share means the hot path (share validation → channel dispatch → vardiff state update)
gets heavier per share. Timer-based decouples evaluation from the ingestion path — shares go into a counter, evaluation happens on a separate cadence.

Each channel has their own Vardiff so there's no shared data structure with contention. The pool is overall O(shares), where a couple of floating point operations for share based vardiff are completely dwarfed by the cost to deserialize a share from the wire and then do a double sha256 hash for nonce validation (which is done no matter what Vardiff you use). I'm not that caught up on SV2, so please correct me if that's not what it does.

The real answer: Timer-based with a well-tuned boundary (PoissonCI or CUSUM) is strictly better for pool architecture. The "responsiveness" argument for share-based evaluation is illusory — it's
the boundary's evidence threshold that gates reaction time, not how often you ask the question. Asking more often with less data per ask doesn't help.

So the ckpool algo not only specifies an estimator (exponentially weighted moving average, see decay.rs) but also a boundary (when to adjust difficulty). The two main parts which make up the boundary are the HYSTERESIS bounds and MIN_WINDOW_RATIO.

/// Minimum window ratio before considering adjustment.
/// Fraction of expected time (or shares) per window.
/// Derived from ckpool: 240s / 300s window = 0.8
const MIN_WINDOW_RATIO: f64 = 0.8;

/// Only decrease difficulty when rate drops below this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_LOW: f64 = 0.5;

/// Only increase difficulty when rate exceeds this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_HIGH: f64 = 1.33;

#[derive(Debug, Clone)]
pub(crate) struct Vardiff {
    period: Duration,
    window: Duration,
    min_shares_for_adjustment: u32,
    min_time_for_adjustment: Duration,
    dsps: DecayingAverage,
    current_diff: Difficulty,
    old_diff: Difficulty,
    first_share: Option<Instant>,
    last_diff_change: Instant,
    shares_since_change: u32,
    min_diff: Option<Difficulty>,
    max_diff: Option<Difficulty>,
    diff_change_job_id: Option<JobId>,
}

So not only is the responsiveness tuneable, you can see its responsiveness working on real mining machines in the wild: https://stats.ckpool.org/. From experience I can say that it responds beautifully to any scale of hashrate, be it a cpu miner, a bitaxe, or 1 EH/s rental hashrate.

I really like the theoretical work you're doing and it has deepened my understanding ( (I now know words like estimator and boundary) of what I originally just copied from ckpool. I feel like this algo should be the benchmark that any new vardiff is compared with, since its proven to work with real hashrate.

That's just my two cents, like the work you're doing!

…h SPM Replace the static AsymmetricCusumBoundary with AdaptivePoissonCusum, which selects the boundary based on the miner's configured share rate: - Below SPM 10: PoissonCI (prevents overshoot on sparse data) - At SPM 10+: AsymmetricCUSUM (fast reaction with abundant evidence) This eliminates the FullRemedy vs VardiffState trade-off — PoissonCI was better at low SPM (bitaxe/small miners) while CUSUM was better at high SPM (large hashrate). The adaptive boundary gets both. Also caps AcceleratingPartialRetarget η at 0.4 (was 0.6). The lower cap prevents cold-start overshoot while still accelerating convergence after step changes. Parameter sweep (compare_out7/) confirmed this is the Pareto-optimal cap. New components: - AdaptivePoissonCusum: SPM-threshold boundary selector - GuardedAccelRetarget: acceleration only after first direction reversal (experimental, not used in production) - sweep-adaptive.rs: parameter sweep binary for boundary tuning Simulation results vs previous production (comprehensive fitness): SPM 4: 0.706 vs 0.636 (+11%) SPM 8: 0.771 vs 0.737 (+5%) SPM 10: 0.783 vs 0.774 (+1%) SPM 15: 0.810 vs 0.803 (+1%) SPM 20: 0.894 vs 0.882 (+1%) SPM 25: 0.903 vs 0.894 (+1%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…estimator equivalence test Adds two follow-up investigations to CKPOOL_INVESTIGATION.md: 1. Hysteresis boundary sweep: tested band widths from [0.5,1.33] through [0.9,1.1] with varying data gates and update rules. Conclusion: hysteresis achieves excellent reaction rates (96-100%) but at 10× jitter cost vs statistical boundaries. No parameterization achieves competitive comprehensive fitness. 2. Estimator equivalence test (revised): paired CkpoolEstimator with the production-tuned boundary (AdaptivePoissonCusum) and update (AcceleratingPartialRetarget). Result: NOT equivalent — the per-share decay_time() simulation is noisier per tick than batch EWMA, and this difference is amplified by CUSUM's aggressive firing. EwmaEstimator(120s) is genuinely better, not just simpler. Also adds ckpool estimator + hysteresis variants to compare-algorithms binary for reproducibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gimballock · 2026-06-06T16:57:14Z

@paratoxicdev You're right on both points — I should have posted my actual findings rather than a raw Claude transcript.

On scalability: You're correct. Each channel owns its own Vardiff — no shared state, no contention. The FP cost of decay_time() is trivial next to deserialize + SHA256d. Timer-based in SRI is an architectural choice (the pool already runs a 60s tick loop), not a performance win.

On "strictly better": Also wrong. The reality is more nuanced than either of us stated — I'll explain below.

What I found after reading your vardiff.rs + decay.rs:

I ported ckpool's algorithm and benchmarked it across SPM 4–30. Full writeup in commit 6aa27a9, updated in b77eff8.

On your code specifically:

Your single-window DecayingAverage is simpler than ckpool's dual-window. My sweep confirmed — once τ_long is shortened to match our tick cadence, the dual-window switching adds nothing. You already knew this.
Your time_bias uses time_since_first (miner lifetime), not time since last retarget. This is critical — for any miner running >5 minutes, bias ≈ 1.0 and it's effectively inert. When I naively reset bias on each retarget (as our pipeline's on_fire() does), I got 1 / (1 - e^(-60/300)) ≈ 5.5× amplification on the first tick after every difficulty change — catastrophic overshoot (177–275% settled accuracy). The fix was to simulate per-share decay_time() calls within each tick, letting the EMA warm up organically.
Your value_at() (read-without-mutate) is a clean pattern — prevents observation from altering state.

Why we didn't use ckpool's components in production:

I tested all three stages of the pipeline (estimator, boundary, update) independently:

Estimator: After correct porting via per-share simulation, CkpoolEstimator initially appeared equivalent to EwmaEstimator(120s) — they matched within ~5% under a conservative boundary (PoissonCI). But when I paired both with our production-tuned boundary (which uses CUSUM for high-SPM miners), the ckpool estimator underperformed significantly:

Composition	SPM 4	SPM 8	SPM 12	SPM 20	SPM 30
EwmaEstimator(120s) + tuned boundary	0.689	0.768	0.787	0.882	0.869
CkpoolEstimator(60,300) + same boundary	0.698	0.638	0.696	0.777	0.858
CkpoolEstimator(60,120) + same boundary	0.598	0.688	0.764	0.805	0.858

The per-share decay_time() simulation running N decay steps per tick introduces more per-tick noise than a single batch EWMA update. Under the conservative PoissonCI boundary this noise is masked (the boundary requires large deviations to fire). Under CUSUM (which fires on smaller accumulated deviations), the extra noise triggers more false fires → higher jitter → lower fitness. So the evaluation cadence isn't purely scheduling — the numerical stability of the estimate depends on how the information is processed.

Boundary: I swept hysteresis from your native [0.5, 1.33] through [0.9, 1.1] with varying data gates:

Boundary	SPM 4	SPM 10	SPM 20	SPM 30	Jitter (SPM 10)
AdaptivePoissonCusum (production)	0.689	0.774	0.882	0.869	0.060/min
Hyst [0.7, 1.3]	0.621	0.695	0.666	0.608	0.043/min
Hyst [0.8, 1.2]	0.572	0.675	0.723	0.726	0.152/min
Hyst [0.85, 1.15]	0.578	0.634	0.706	0.730	0.254/min
Hyst [0.5, 1.33] (native)	0.630	0.542	0.508	0.504	0.005/min

Narrower bands get excellent reaction rates (96–100%, better than anything else), but at 5–10× jitter cost. The fundamental issue: hysteresis fires whenever the rate ratio crosses the band, with no evidence accumulation. Statistical boundaries (PoissonCI, CUSUM) require cumulative evidence before firing, so they distinguish real changes from noise.

Update: ckpool's full retarget (η=1.0) overshoots in the tick framework. Our AcceleratingPartialRetarget (η ramps 0.2→0.4) achieves the same convergence speed without overshoot.

What we ended up with:

The investigation led to a new production composition that adapts its boundary strategy based on the miner's share rate — conservative (PoissonCI) for low-SPM miners where data is sparse, aggressive (CUSUM) for high-SPM miners where evidence is abundant. This is analogous to what ckpool does with its dual-window adaptive switching, just at the boundary layer instead of the estimator layer. Your algorithm's idea of "be conservative when data is sparse, aggressive when data is abundant" is exactly right — we just implement it differently.

SPM	Old VardiffState	New Production	Improvement
4	0.636	0.706	+11%
8	0.737	0.771	+5%
12	0.787	0.793	+1%
20	0.882	0.894	+1%
30	0.869	0.874	+1%

CkpoolEstimator is preserved in the simulation grid as a benchmark contender — it runs alongside every other algorithm in our characterization suite and any future changes are measured against it.

gimballock · 2026-06-06T17:01:01Z

If you have any idea or suggestions to improve the comparison, let me know.

The field was documented as a share count but actually used as an SPM threshold — align the name with the semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add AsymmetricPoissonCI: applies a tighten_multiplier to the PoissonCI threshold when the miner is over-performing, reflecting the asymmetric cost of tightening (rejects in-flight shares) vs loosening (free). Parameter sweep results (500 trials/cell, SPM 4-30): Comprehensive fitness at low SPM (where PoissonCI is active): symmetric (t=1.0): SPM4=0.706, SPM6=0.744, SPM8=0.758 t=1.5: SPM4=0.743, SPM6=0.771, SPM8=0.784 t=2.0: SPM4=0.799, SPM6=0.821, SPM8=0.801 t=3.0: SPM4=0.858, SPM6=0.872, SPM8=0.882 t=3.0 matches CUSUM's tighten_multiplier and is the clear winner (+21% at SPM 4, +17% at SPM 6). SPM 10+ is unchanged (CUSUM active). New files: - AsymmetricPoissonCI in boundary.rs - sweep-asymmetric-poisson.rs: parameter sweep binary - asymmetric_poisson_sweep.md: sweep results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Pareto-optimal vardiff configs Names every composed algorithm by its three parts instead of opaque labels (FullRemedy/VardiffState), and uses that to drive a maximin-based search for balanced vardiff algorithms. Naming (drift-proof): - Add code() to the Estimator/Boundary/UpdateRule traits, implemented on every concrete type from its live params. naming::triple_name composes the display name "Estimator / Boundary / Update"; sanitize_filename derives a filesystem-safe form. AlgorithmSpec constructors now derive their names from the actual parts, so a name can never drift from what runs. Analysis tooling: - metrics: EqualWeightFitness — same 6 sub-metrics as OperationalFitness but uniform 1/6 weighting, so no cluster is privileged. - bin/radar-chart: SVG radar with maximin sort, best-in-class hull, and per-axis direction arrows. Defaults to the three headline contenders; VARDIFF_RADAR_FULL=1 plots the broad historical set. - bin/sweep-balanced, sweep-estimators, sweep-signpersist, sweep-signpersist-cotuned, sweep-voladapt: maximin-scored parameter sweeps. New boundary: - VolatilityAdaptiveBoundary: PoissonCI floor scaled by observed-vs-Poisson share-rate volatility. Kept as a documented negative result — it loosens during the drop it should catch, so it underperforms (see sweep-voladapt). - SignPersistenceCusumBoundary: gave it a param-distinctive code(). Two Pareto-optimal configs (AlgorithmSpec::balanced / react_priority): - balanced(): Ewma90 / AdaptPC-spm8[sensitive CUSUM] / Accel-0.3-0.6-0.2 — maximin 0.551, the best worst-axis characterized; beats production on both small-drop reaction and convergence. - react_priority(): Ewma90 / SignPersist / Accel — react-10% 0.696, far above the ~0.54 ceiling fixed boundaries hit, for fast failing-ASIC detection at the cost of convergence. The sweeps establish a ~0.55 maximin ceiling for the three-stage architecture: small-drop reaction and convergence trade against each other on a shared agility budget, confirmed from four independent directions (smoothing estimators, reactive boundary, volatility-adaptive boundary, sign-persistence). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…SPM collapse Per-SPM analysis revealed react_priority() (bare SignPersistenceCusumBoundary) collapses at low SPM: at 4 SPM its jitter/step-safety/convergence axes all fall to ~0 (maximin 0.000). Sparse-data Poisson noise produces spurious same-sign residual runs that trip the sign-persistence discount, causing constant false fires. balanced() was already immune — its AdaptivePoissonCusum dual-mode uses the conservative PoissonCI below the SPM threshold. - Generalize AdaptivePoissonCusum into AdaptiveBoundary<B: Boundary>: PoissonCI below spm_threshold, an arbitrary aggressive boundary B at/above. The CUSUM pairing is preserved as `type AdaptivePoissonCusum = AdaptiveBoundary<AsymmetricCusumBoundary>` with its original new/with_params, so all existing call sites are unchanged. - Add AdaptiveSignPersist alias + ::sign_persist constructor wrapping SignPersistenceCusumBoundary in the same low-SPM guard. - Repoint react_priority() at AdaptiveSignPersist(spm=8). Result: react_priority maximin at 4 SPM 0.000 → 0.292 (jitter 0.31→0.80, step 0.00→0.29, conv 0.00→0.60); high-SPM behavior unchanged (PoissonCI only engages below 8 SPM); aggregate maximin now competitive with balanced. The dual-mode code() always shows the high boundary's own code(), so names read "Adapt-spm8[AsymCusum-...]" / "Adapt-spm8[SignPersist-...]". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

VardiffState had been pointed at the tuned EWMA stack (EwmaEstimator(120) + AdaptivePoissonCusum + AcceleratingPartialRetarget) — scope creep that bundled an algorithm change into what is meant to be the diagnostic simulation framework. Revert production to the real upstream classic algorithm by delegating to composed::classic_composed (CumulativeCounter + StepFunction classic table + FullRetargetWithClamp). Delegating to the existing classic_composed factory (rather than re-spelling Composed::new) makes production and the sim's reference construction literally identical, so the monolith and ClassicComposed are now fire-for-fire equivalent by construction — restoring the truth of the "Cumul / Step / FullClamp*" name and the equivalence doc comment in grid.rs. The composed/ pipeline and clock injection stay as diagnostic infrastructure the sim drives; any production algorithm change ships as a separate, clean commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add LogErrorRegret, the principled replacement for the six-axis EqualWeightFitness radar. In the natural error coordinate e = ln(H_est/H_true) it decomposes behavior into: - regret_over / regret_under: linear time-avg |e| split by sign (linear, not quadratic — a quadratic loss is structurally blind to small persistent degradation, the failing-ASIC case) - effort_up / effort_down: Σ(Δln D)² over fires, split by direction (tightening is costly, easing ~free) Computed from the universal trajectory (current_hashrate_before, new_hashrate, fired), so it works for every algorithm including the opaque production monolith. Unit-tested on the over/under sign split. EqualWeightFitness is marked deprecated (kept only until the maximin sweep bins migrate to regret/effort scoring). The old radar-chart bin and its generated SVGs are removed. docs/THEORY.md (§1–§10) records the derivation: the conservation law, why every fitness "axis" is one trade-off, the linear-vs-quadratic loss-shape decision, and the incumbent-reference (vs hull) normalization choice. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tooling built on the §10 LogErrorRegret metric to find and present the new Pareto champions: - regret-effort: empirical validation of the conservation-law theory (binding? δ²-cancel? real frontier?) from raw trajectories. - regret-radar: the 5-axis principled radar (tracking, gentleness, detection, over-diff safety, tighten-care), anchored on ClassicComposed (real upstream vardiff). Familiar-metrics companion panel uses a per-axis log-ratio scale so contenders separate despite the anchor being a degenerate outlier on jitter/accuracy. - sweep-regret / sweep-regret-big: regret/effort-scored parameter sweeps (big = 9216 configs, parallel via std::thread::scope over run_cell_with_algorithm, bit-identical to a serial grid). - confirm-champions: high-trial tie-break on the sweep's flat top cluster plus edge-extension probes. - champion-weights: corner pressure-test + weight-sensitivity (re-scores one simulation pass under a weight grid for free), showing the new champion is a robust interior optimum, not a degenerate never-tighten corner, and validating the §10 3:1 over:under weight. - trajectory-plot: the plain-language comparison — estimate chasing truth over one timeline (cold-start ramp → settle → aged −10% drop), making ramp-up time and detection latency visible in a single frame. Includes an oracle reference line (same τ=150 estimator, no control policy) that decomposes the settle-phase offset into irreducible noise vs policy bias, and a fire-raster strip (mark height ∝ |Δln D|) that shows each algorithm's "few+violent vs many+gentle" retarget character and its reactivity to the small aged drop. New champion: Ewma150 / AsymCusum-s0.2-t6 / Accel-0.2-0.8-0.05 (~15% better cost than the prior balanced/react_priority champions, 100% detection). .gitignore generalized to cover the new bins' generated reports and charts (regenerate via cargo run; analysis lives in commits). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1000

Improve on the two weaknesses the trajectory plot highlighted for the interim AsymCusum champion: slow cold-start ramp (~34 min) and a persistent settle-phase under-difficulty offset (−10.5%). Both are the same thing — reluctance to tighten — and both are fixed by one mechanism. Swap the boundary to AdaptiveSignPersist: the sign-persistence discount relaxes the fire threshold on consecutive same-sign residuals AFTER the tighten multiplier, so a *persistent* under-difficulty (cold start, settle bias) progressively lowers the tighten bar and fires frequent small corrections, while a one-off spike keeps full tighten-reluctance — death-spiral safety preserved. Found and validated with the §10 harness: a trajectory spike eliminated the trend/confidence/gentle-frequent alternatives (VolatilityAdaptive loosened the wrong way), then sweep-signpersist-regret (973 configs, 581 beat the AsymCusum champion) → confirm-signpersist (2000 trials + weight grid) pinned the optimum at d=0.06 and proved it weight-robust at the §10 3:1 over:under weight (probes d<0.06 win only at an ungrounded ≤2:1). New champion, promoted as AlgorithmSpec::champion(): Ewma150 / AdaptiveSignPersist[s0.3,f0.05,t6,d0.06,dm0.6,spm6] / Accel-0.2-0.8-0.05 Trajectory @1000 trials vs the interim AsymCusum champion: ramp-up 34→15 min, detect latency 12→9 min, settle gap −10.5%→−7.2%, detection holds 100%. regret_under 0.096→0.087, regret_over flat. - grid.rs: add champion() constructor. - regret-radar / trajectory-plot: plot the SignPersist champion alongside the interim AsymCusum one so the gains are visible at each step. - sweep-signpersist-regret, confirm-signpersist: the new analysis bins. - .gitignore: generalize confirm_*.md. The settle gap is reduced, not closed (−7.2% vs the −0.8% oracle floor); remaining daylight is a known, lower-priority follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… result Investigate idea stratum-mining#1 (cold-start warm-up) to recover the champion's policy-imposed ramp time. The trajectory oracle readout (added here) shows the τ=150 estimator alone converges in ~2-3 min while the cautious champion takes ~15 min — so ~12 min of ramp is policy, not estimator, and in principle recoverable. Add WarmupBoundary<B>: returns threshold 0 (fire on any deviation) until the realized rate first lands within `converge_band` of target, then one-way latches and delegates to the inner boundary B forever after. The rationale is sound — a fresh connection has no in-flight work to protect, so the death-spiral caution that justifies the steady-state policy only costs ramp time there. Latch behavior is unit-tested. RESULT: rejected as wired. The trajectory spike looked great (ramp 15→7 min, settle/detection unchanged), but confirm-warmup (1000 trials, WarmupBoundary<AdaptiveSignPersist>, converge_band swept) shows it REGRESSES the §10 steady-state cost by up to 16.5% — the no-warmup champion wins outright, every band raises regret_over. Root cause: cold start is excluded from the cost, but the Stable/Step scenarios start at truth with Poisson-noisy early windows, and warm-up can't tell a genuine cold start from a noisy first window at correct difficulty — so it fires aggressively on noise and books over-difficulty regret. Salvaging it needs arming only on a large *sustained* initial deviation (a redesign). Champion is UNCHANGED. The primitive (tested, reusable) and the confirmation bin are kept as a reproducible negative result; the spike itself is not wired into any shipped config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Investigate whether the SignPersist champion's persistent ~−7% settle-phase under-difficulty offset (the gap to the trajectory oracle line) is recoverable via an estimator-side debias. Add DebiasEstimator<E>: scales the inner estimator's h_estimate by a fixed `bias` (>1 lifts the belief), leaving the raw realized rate untouched. Unit-tested. The hypothesis: a tighten-reluctant boundary equilibrates with an under-difficulty offset; lifting the belief should move that equilibrium toward truth. RESULT: settles the question — the gap is the deliberate optimum, not a defect. confirm-debias swept bias ∈ [1.0, 1.25] on the champion (1000 trials), tracking both §10 cost and the settle gap. The debias works mechanically (gap dials −6.9% → 0 near bias 1.10 → +21% at 1.25), but §10 cost rises MONOTONICALLY from bias=1.0: regret_under falls while regret_over rises faster under the 3:1 weight. bias=1.0 is the cost minimum. So the champion correctly declines to track dead-on — sitting slightly under-difficulty is cheaper than the over-difficulty risk, and the oracle reaches −0.8% only by firing every tick (effort the cost penalizes). Triangulated with champion-weights and the trajectory oracle line; changing the gap now requires changing the WEIGHTS, not the algorithm. Reframe the trajectory plot to match: relabel the "oracle" line as the "accuracy ceiling (cost-blind)" — an accuracy bound, not a target — and add a green "cost-optimal settle (§10)" corridor at the champion's own settle level, so the champion visibly sits IN the objective's optimum rather than appearing to fall short of the cost-blind ceiling. Champion UNCHANGED. Primitive (tested, reusable) + confirmation bin kept as a reproducible result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add METRIC_DERIVATION.md — a standalone, skeptic-facing account of WHY the metric is regret(linear, sign-split) + effort(direction-split) + an explicit detection axis, per scenario class, with 3:1 directional weights. Where THEORY.md is the chronological lab notebook, this is the cleaned-up proof: each claim is labeled PROVEN (algebra/probability from the plant identity), EMPIRICAL (named simulation), or VALUES (declared judgment), and the argument is carried as much by the three falsified hypotheses (δ²-cancellation, single-scalar E=J_opt/J, sufficiency of ∫e²) as by the surviving ones. Includes the detection non-recoverability theorem (§7), the quadratic-blindness lemma (§5.3), the weight-robustness result, the confirm-debias "−7% gap is optimal" demonstration, and an explicit falsification checklist (§11) plus an epistemic-status table (§12). All citations verified against the tree (poisson_floor metrics.rs:864; commits 31a9dbc / a1d3fa7 / 70fcb26). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gimballock force-pushed the vardiff/simulation-framework branch from 11b2560 to 88d8d1d Compare May 13, 2026 22:06

This was referenced May 13, 2026

replace vardiff hardcoded threshold ladder with parametric noise floor #2148

Closed

[experiment] Apply new error estimate math to vardiff algo stratum-mining/sv2-apps#488

Closed

gimballock commented May 13, 2026

View reviewed changes

gimballock commented May 14, 2026

View reviewed changes

gimballock force-pushed the vardiff/simulation-framework branch 2 times, most recently from 5cbed7c to 85d6f8b Compare May 17, 2026 14:48

gimballock changed the title ~~feat(vardiff): add in-process simulation framework + baseline regression tests~~ [Draft] feat(vardiff): add in-process simulation framework + baseline regression tests May 17, 2026

gimballock force-pushed the vardiff/simulation-framework branch from 2d10f57 to 414afbb Compare May 19, 2026 14:25

plebhash mentioned this pull request May 19, 2026

consider smaller vardiff cycles stratum-mining/sv2-apps#396

Closed

gimballock force-pushed the vardiff/simulation-framework branch 4 times, most recently from 211bc98 to 2a88fde Compare May 20, 2026 15:09

gimballock force-pushed the vardiff/simulation-framework branch from 63a19d0 to a18c3a3 Compare May 21, 2026 14:28

gimballock mentioned this pull request Jun 1, 2026

feat(test-tools): add shape-proxy SV2 share-gating proxy stratum-mining/sv2-apps#536

Closed

6 tasks

gimballock force-pushed the vardiff/simulation-framework branch from 006363a to a58132b Compare June 3, 2026 04:51

Eric Price and others added 5 commits June 4, 2026 12:57

gimballock force-pushed the vardiff/simulation-framework branch from e0213c8 to 6aa27a9 Compare June 4, 2026 16:59

Eric Price and others added 3 commits June 4, 2026 13:00

style: apply rustfmt to vardiff modules

adab1ee

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vnprc mentioned this pull request Jun 5, 2026

Socratic Seminar 53 TriangleBitDevs/TriangleBitDevs.github.io#55

Closed

Eric Price and others added 2 commits June 5, 2026 20:58

refactor(vardiff): rename transition_shares to spm_threshold

007dcb4

The field was documented as a share count but actually used as an SPM threshold — align the name with the semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gimballock mentioned this pull request Jun 9, 2026

feat(vardiff): replace threshold-ladder with adaptive EWMA algorithm #2188

Open

4 tasks

Eric Price and others added 9 commits June 18, 2026 00:04

Conversation

gimballock commented May 13, 2026

The finding that motivates this

What's in the PR

What the framework measures

How to run

What this enables

Where to look in this PR

What this PR is NOT

Open follow-ups

Test plan

Uh oh!

gimballock commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gimballock May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gimballock May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gimballock May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gimballock May 14, 2026

Choose a reason for hiding this comment

Uh oh!

adammwest commented May 18, 2026

Uh oh!

gimballock commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adammwest commented May 20, 2026

Uh oh!

gimballock commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gimballock commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gimballock commented May 30, 2026

Uh oh!

paratoxicdev commented Jun 1, 2026

Uh oh!

gimballock commented Jun 2, 2026

Uh oh!

gimballock commented Jun 2, 2026

Uh oh!

gimballock commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paratoxicdev commented Jun 4, 2026

Uh oh!

gimballock commented Jun 6, 2026

Uh oh!

gimballock commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gimballock commented May 13, 2026 •

edited

Loading

gimballock May 13, 2026 •

edited

Loading

gimballock commented May 18, 2026 •

edited

Loading

gimballock commented May 21, 2026 •

edited

Loading

gimballock commented May 30, 2026 •

edited

Loading

gimballock commented Jun 3, 2026 •

edited

Loading