Skip to content

Conversation

@thebigjc
Copy link

@thebigjc thebigjc commented Sep 3, 2025

Raidbots was struggling today, so I tried running simc locally and noticed it was trying to hit the same error for each run, even when it's clear that a profileset is not going to be the 'winner'. If the user cares about finding the 'best' configuration over a number of profilesets, this will speed it up greatly.

Implements statistical confidence-based early elimination for profilesets to avoid simulating clearly inferior gear combinations. Uses confidence intervals to compare candidates against the current best, with configurable safety margins.

Options:

  • profileset_cull=1 to enable
  • profileset_cull_verbose=N for verbosity control
  • profileset_cull_method= {t_test (default), ci}
  • profileset_cull_min_iterations=N (default 100, min 30 before t-test breaks)
  • profileset_cull_alpha=N (default 0.05, used as the alpha for the t-test)

@renanthera
Copy link
Member

renanthera commented Sep 8, 2025

I've always wanted this, but never sat down to actually get this going. Thanks for your work on this PR so far!

I find the naming of variables in this commit to be a little dense/hard to parse. I've oftentimes seen this sort of technique referred to as "profileset culling", or things like that.

Instead of checking whether or not the target error CIs are disjoint (which is more or less what is occurring in Raidbots), considering instead a two sample t-test to determine if the means are not equal may be a better strategy. I would also like the option to use a variety of culling methods, if only for the ability to experiment with alternate schemes.

Much of the code could become a member function of find_best_state_t, which could better allow for modularity for culling strategies.

Comment on lines 1025 to 1042
if ( sim.find_best.enabled )
{
obj[ "find_best_best" ] = ( sim.find_best.best_name == profileset->name() );
if ( sim.find_best.best_name == profileset->name() )
{
obj[ "find_best_best_error" ] = sim.find_best.best_error;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason why this is repeated for each profileset? Seems like attaching it to the baseline sim would make more sense. Also, there should probably be some indication that the find_best mode is on in the HTML report.

Comment on lines 141 to 144
if ( profile_sim->find_best_eliminated )
{
fmt::print( stderr, "\nProfileset '{}' early-stopped: {}\n", set.name(), profile_sim->find_best_reason );
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like an early stop. You eliminate a profileset by telling simc that it ran enough iterations, not that the sim failed.

double abs_error = ( current_error / 100.0 ) * current_mean;

// ensure enough iterations
if ( work_queue->progress().current_iterations >= s.min_iterations )
Copy link
Contributor

@vituscze vituscze Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to keep asking work_queue->progress() multiple times? Also, does it work correctly when strict_work_queue is on, in which case the children don't share the same work queue.

double best_mean = 0.0; // mean of current best
double best_error = 0.0; // absolute half-width (same units as mean)
unsigned best_iterations = 0; // iterations when last updated
bool best_precision_satisfied = false; // winner precision threshold reached
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to do anything. You only ever write into it.

bool best_precision_satisfied = false; // winner precision threshold reached
// Configuration copied from options (stored here for child sims to read)
int min_iterations = 500; // minimum iterations before evaluating elimination
double winner_precision = -1.0; // percent relative error threshold (same unit as target_error/current_error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used with best_precision_satisfied (see above) and as such doesn't do anything.

}

// If no best yet, promote self
if ( s.best_name.empty() )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the baseline sim? Seems like a waste to ignore it when it typically finishes (with full precision) before any of the profilesets even start.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've changed the approach to start from the baseline sim.

Comment on lines 2298 to 2300
double safety = s.elim_safety_margin_frac > 0 ? s.elim_safety_margin_frac * s.best_mean : 0.0;
double candidate_upper = current_mean + abs_error + safety;
double best_lower = s.best_mean - s.best_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As kate mentioned, this could really use a proper statistical test. Especially the safety margin seems pretty ad-hoc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some valid reasons to compare using a CI, but it's very (probably too) conservative. It was the easiest to implement given what you already had, but it's not the bet it could be. The biggest barrier on the t-test is that we don't have a good statistics library - I'll add in an quick implementation, but it would be easier if we brought in Boost.Math. If you'd rather do that, let me know after you see the next patch

Comment on lines 2324 to 2345
work_queue->unlock();
interrupt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work_queue uses a reentrant lock partly so that you can interrupt the sim without having to let the lock go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines 3871 to 3876
add_option( opt_bool( "find_best", find_best.enabled ) );
add_option( opt_string( "find_best_metric", find_best_metric_str ) );
add_option( opt_int( "find_best_min_iterations", find_best.min_iterations ) );
add_option( opt_float( "find_best_winner_precision", find_best.winner_precision ) );
add_option( opt_float( "find_best_elim_safety_margin", find_best.elim_safety_margin_frac ) );
add_option( opt_int( "find_best_verbose", find_best.verbose ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point, but maybe a more descriptive name would be better for the options (and the whole mode). "Finding best" is kinda what the sim does normally, so this doesn't really provide any more info about what's going on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to adopt profileset_cull as the prefix for everything going forward.

@thebigjc
Copy link
Author

thebigjc commented Sep 8, 2025

Thanks for the great feedback! I'll share some thoughts and another patch in the next day or so!

@thebigjc
Copy link
Author

Okay - I cleaned up the approach and added the t-test and made it the default. It's a little messy to pick between the CI approach and the t-test approach, but I've tried to keep it clean enough that it could be refactored into something more 'strategy pattern' like if we want to go down that path.

Also added the HTML and JSON report outputs.

I like the t-test approach - it is more 'optimistic' than the CI approach. One 'theoretical' concern is that we are doing a lot of repeated comparisons so we might want to either make alpha smaller by default, or take an approach that is more theoretically sound for this kind of sequential arm search. (Interestingly enough, it looks more like the Confidence Interface approach, but with a different method for calculating the bounds)

Let me know if you have any other feedback here.

@thebigjc thebigjc changed the title Add find_best feature for profileset early elimination Add ~find_best~ profileset_cull feature for profileset early elimination Sep 10, 2025
@thebigjc
Copy link
Author

Oh, I added a quick and dirty t-test into the code. If we'd rather use Boost.Math I can change that, but as I mentioned above I didn't see that anywhere in the project so I didn't want to add a dependency.

Copy link
Contributor

@vituscze vituscze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go over the code in more detail later, so I'm just commenting on some stuff that stood out when I took a quick look.

Regarding the statistical test: I played with it some more after writing the initial comment and while the CI overlap testing is pretty conservative (and perhaps not entirely statistically sound) it might not necessarily be a bad thing. After trying it with some dummy data, when the 95% CIs just barely didn't overlap, a proper statistical test (a one-tailed Welch's t-test) gave me p-values of around 0.002-0.003. So yeah, I agree with your point: if we're gonna be culling potentially thousands of profilesets, a lower alpha is probably warranted (I'm thinking 0.005?).

The general approach of simc is to keep the dependencies pretty light, so we'd rather avoid bringing in Boost.Math for a single statistical test.

@thebigjc
Copy link
Author

Regarding the statistical test: I played with it some more after writing the initial comment and while the CI overlap testing is pretty conservative (and perhaps not entirely statistically sound) it might not necessarily be a bad thing. After trying it with some dummy data, when the 95% CIs just barely didn't overlap, a proper statistical test (a one-tailed Welch's t-test) gave me p-values of around 0.002-0.003. So yeah, I agree with your point: if we're gonna be culling potentially thousands of profilesets, a lower alpha is probably warranted (I'm thinking 0.005?).

The 'traditional' method would be to take a single alpha (0.05) and divide by the number of comparisons (profilesets, or arms, k), called a 'bonferroni correction' which is a much smaller value than you're suggesting (0.05 / 1000 -> 0.00005). As you saw, this is a VERY conservative number and wouldn't prune many, if any, arms. Other classical approaches refine this, but have requirements that don't fit into our 'sequential' model of testing each profileset (arm) (Holm-Bonferroni, which requires all the p-values to exist and be sorted)

There is a whole sub-field of stats here (SAVI) we could get into for the exact type of test that is right here if we want to very 'tight', or we could start with the CI approach, which is a little hand wavy, and add in the 'more correct' approach later.

Or we could do the research and add in the best option now!

@thebigjc
Copy link
Author

We don't need to pull in the library, but https://github.com/gostevehoward/confseq is an example of the 'theoretically correct' way to do this.

Comment on lines 4062 to 4068
if ( util::str_compare_ci( value, "ci" ) )
sim->profileset_cull.method = sim_t::profileset_cull_state_t::CI_OVERLAP;
else if ( util::str_compare_ci( value, "t_test" ) )
sim->profileset_cull.method = sim_t::profileset_cull_state_t::T_TEST;
else {
sim->error( "Invalid profileset_cull_method '{}', valid options: ci, t_test", value );
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some enum helper functions that can automatically derive a parser (i.e. string -> enum function) when given an enum -> string function. Not really a big deal, but might be convenient in case more methods are added down the line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use those, I'd have to move the enum out of being declared inline, and touch a bunch of extra files. I cleaned this up by moving the parsing out of line here. If you want the full change (move the enum declaration to sc_enum.hpp, etc) I can do that too, but it seemed like more than you intended.

@thebigjc
Copy link
Author

PR comments addressed. Let me know if you have any other feedback!

Implements statistical confidence-based early elimination for profilesets
to avoid simulating clearly inferior gear combinations. Uses Bayesian
confidence intervals to compare candidates against the current best,
with configurable safety margins.

Key features:
- Early elimination based on statistical confidence intervals
- Default verbosity 0 (quiet), always shows new best discoveries
- Configurable minimum iterations and winner precision thresholds
- Thread-safe implementation with mutex protection
- Integration with existing profileset workflow

Options:
- find_best=1 to enable
- find_best_verbose=N for verbosity control
- find_best_min_iterations=N (default 500)
- find_best_winner_precision=N (default inherits target_error)
- find_best_safety_margin=N (default 0.001)

(cherry picked from commit d4cf4eea35ecbea4a39cd755ebe35abe7bc4e028)
(cherry picked from commit a5113b7f59c0221046ffa0b38d278bc90cf32dc1)
(cherry picked from commit 010eead3109828e57c9d88b5f6ba1338b8ce3f9d)
(cherry picked from commit d1e3e536466fdb0d86de5d4256194d146a423346)
(cherry picked from commit 12472f597fdf91cb43d18d8dbec1c53fd9b3eaa8)
(cherry picked from commit 7de1c5a7b1e26a4ea6012274f4fef4e4409f3d0a)
(cherry picked from commit c61fcd4e3d93c97fedee0582af60290554a76f7a)
…N, and simplified iteration counts. Cleaned up method parsing

(cherry picked from commit 84d2402bb345925c1d6c46bbfda93d95090b3547)
(cherry picked from commit c3ac34bb71cb1d08cf631de46318859d451eab58)
@thebigjc
Copy link
Author

Rebased

Copy link
Contributor

@vituscze vituscze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had some time to go over it in more detail and I don't think I have any other comments. I'll try to play with it to get a feel for the culling with some real profiles and I'll poke another dev to take a look.

@thebigjc
Copy link
Author

Thanks for the update!


double mean_total=0;
int mean_count=0;
double current_standard_error = 0.0;
Copy link
Member

@nyterage nyterage Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value of this variable outside of single_actor_batch sims is never set, leading to t_test always seeing a se of 0. Thus, breaking the t_test culling logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm , that's a good catch, but I'm not exactly sure how to test a fix.

I haven't used multi-actor simulation with profilesets. I see how to get a valid se in the other arm of !single_actor_batch, but I'm not really sure what the right behavior would be. Do you have a good test case I could validate against?

Alternatively I can try to some multiactor tests and just figure it out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are already hacks in the case of multi actor metrics. Perhaps we should just fail if culling with multiactor is used? Happy to take a suggestion.

Copy link
Member

@nyterage nyterage Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only reason i noticed is because it came up in my testing. Doesn't even have to be multi-actor, just a base profile without single_actor_batch=1 set, and the profilesets.

Test case for me was as follows

Basic Unholy Death Knight profile
46k talent combination profilesets

best solution is probably throwing an error in the is_multiactor_metric check if its trying to use culling, and setting the variable value in the else function though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on the above comment a bit more, might be better to just set the variable no matter if its multi-actor or not, as throwing an error in multi-actor will just prevent it from working on augmentation sims.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'll have some time on the weekend to work on this. I'll make a change and push a patch once I have it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants