-
Notifications
You must be signed in to change notification settings - Fork 737
Add ~find_best~ profileset_cull feature for profileset early elimination #10597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: thewarwithin
Are you sure you want to change the base?
Conversation
|
I've always wanted this, but never sat down to actually get this going. Thanks for your work on this PR so far! I find the naming of variables in this commit to be a little dense/hard to parse. I've oftentimes seen this sort of technique referred to as "profileset culling", or things like that. Instead of checking whether or not the target error CIs are disjoint (which is more or less what is occurring in Raidbots), considering instead a two sample t-test to determine if the means are not equal may be a better strategy. I would also like the option to use a variety of culling methods, if only for the ability to experiment with alternate schemes. Much of the code could become a member function of |
engine/report/json/report_json.cpp
Outdated
| if ( sim.find_best.enabled ) | ||
| { | ||
| obj[ "find_best_best" ] = ( sim.find_best.best_name == profileset->name() ); | ||
| if ( sim.find_best.best_name == profileset->name() ) | ||
| { | ||
| obj[ "find_best_best_error" ] = sim.find_best.best_error; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason why this is repeated for each profileset? Seems like attaching it to the baseline sim would make more sense. Also, there should probably be some indication that the find_best mode is on in the HTML report.
engine/sim/profileset.cpp
Outdated
| if ( profile_sim->find_best_eliminated ) | ||
| { | ||
| fmt::print( stderr, "\nProfileset '{}' early-stopped: {}\n", set.name(), profile_sim->find_best_reason ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like an early stop. You eliminate a profileset by telling simc that it ran enough iterations, not that the sim failed.
engine/sim/sim.cpp
Outdated
| double abs_error = ( current_error / 100.0 ) * current_mean; | ||
|
|
||
| // ensure enough iterations | ||
| if ( work_queue->progress().current_iterations >= s.min_iterations ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to keep asking work_queue->progress() multiple times? Also, does it work correctly when strict_work_queue is on, in which case the children don't share the same work queue.
engine/sim/sim.hpp
Outdated
| double best_mean = 0.0; // mean of current best | ||
| double best_error = 0.0; // absolute half-width (same units as mean) | ||
| unsigned best_iterations = 0; // iterations when last updated | ||
| bool best_precision_satisfied = false; // winner precision threshold reached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to do anything. You only ever write into it.
engine/sim/sim.hpp
Outdated
| bool best_precision_satisfied = false; // winner precision threshold reached | ||
| // Configuration copied from options (stored here for child sims to read) | ||
| int min_iterations = 500; // minimum iterations before evaluating elimination | ||
| double winner_precision = -1.0; // percent relative error threshold (same unit as target_error/current_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used with best_precision_satisfied (see above) and as such doesn't do anything.
engine/sim/sim.cpp
Outdated
| } | ||
|
|
||
| // If no best yet, promote self | ||
| if ( s.best_name.empty() ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the baseline sim? Seems like a waste to ignore it when it typically finishes (with full precision) before any of the profilesets even start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've changed the approach to start from the baseline sim.
engine/sim/sim.cpp
Outdated
| double safety = s.elim_safety_margin_frac > 0 ? s.elim_safety_margin_frac * s.best_mean : 0.0; | ||
| double candidate_upper = current_mean + abs_error + safety; | ||
| double best_lower = s.best_mean - s.best_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As kate mentioned, this could really use a proper statistical test. Especially the safety margin seems pretty ad-hoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some valid reasons to compare using a CI, but it's very (probably too) conservative. It was the easiest to implement given what you already had, but it's not the bet it could be. The biggest barrier on the t-test is that we don't have a good statistics library - I'll add in an quick implementation, but it would be easier if we brought in Boost.Math. If you'd rather do that, let me know after you see the next patch
engine/sim/sim.cpp
Outdated
| work_queue->unlock(); | ||
| interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
work_queue uses a reentrant lock partly so that you can interrupt the sim without having to let the lock go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
engine/sim/sim.cpp
Outdated
| add_option( opt_bool( "find_best", find_best.enabled ) ); | ||
| add_option( opt_string( "find_best_metric", find_best_metric_str ) ); | ||
| add_option( opt_int( "find_best_min_iterations", find_best.min_iterations ) ); | ||
| add_option( opt_float( "find_best_winner_precision", find_best.winner_precision ) ); | ||
| add_option( opt_float( "find_best_elim_safety_margin", find_best.elim_safety_margin_frac ) ); | ||
| add_option( opt_int( "find_best_verbose", find_best.verbose ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor point, but maybe a more descriptive name would be better for the options (and the whole mode). "Finding best" is kinda what the sim does normally, so this doesn't really provide any more info about what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to adopt profileset_cull as the prefix for everything going forward.
|
Thanks for the great feedback! I'll share some thoughts and another patch in the next day or so! |
992953d to
11d49c0
Compare
|
Okay - I cleaned up the approach and added the t-test and made it the default. It's a little messy to pick between the CI approach and the t-test approach, but I've tried to keep it clean enough that it could be refactored into something more 'strategy pattern' like if we want to go down that path. Also added the HTML and JSON report outputs. I like the t-test approach - it is more 'optimistic' than the CI approach. One 'theoretical' concern is that we are doing a lot of repeated comparisons so we might want to either make alpha smaller by default, or take an approach that is more theoretically sound for this kind of sequential arm search. (Interestingly enough, it looks more like the Confidence Interface approach, but with a different method for calculating the bounds) Let me know if you have any other feedback here. |
|
Oh, I added a quick and dirty t-test into the code. If we'd rather use Boost.Math I can change that, but as I mentioned above I didn't see that anywhere in the project so I didn't want to add a dependency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go over the code in more detail later, so I'm just commenting on some stuff that stood out when I took a quick look.
Regarding the statistical test: I played with it some more after writing the initial comment and while the CI overlap testing is pretty conservative (and perhaps not entirely statistically sound) it might not necessarily be a bad thing. After trying it with some dummy data, when the 95% CIs just barely didn't overlap, a proper statistical test (a one-tailed Welch's t-test) gave me p-values of around 0.002-0.003. So yeah, I agree with your point: if we're gonna be culling potentially thousands of profilesets, a lower alpha is probably warranted (I'm thinking 0.005?).
The general approach of simc is to keep the dependencies pretty light, so we'd rather avoid bringing in Boost.Math for a single statistical test.
The 'traditional' method would be to take a single alpha (0.05) and divide by the number of comparisons (profilesets, or arms, k), called a 'bonferroni correction' which is a much smaller value than you're suggesting (0.05 / 1000 -> 0.00005). As you saw, this is a VERY conservative number and wouldn't prune many, if any, arms. Other classical approaches refine this, but have requirements that don't fit into our 'sequential' model of testing each profileset (arm) (Holm-Bonferroni, which requires all the p-values to exist and be sorted) There is a whole sub-field of stats here (SAVI) we could get into for the exact type of test that is right here if we want to very 'tight', or we could start with the CI approach, which is a little hand wavy, and add in the 'more correct' approach later. Or we could do the research and add in the best option now! |
|
We don't need to pull in the library, but https://github.com/gostevehoward/confseq is an example of the 'theoretically correct' way to do this. |
engine/sim/sim.cpp
Outdated
| if ( util::str_compare_ci( value, "ci" ) ) | ||
| sim->profileset_cull.method = sim_t::profileset_cull_state_t::CI_OVERLAP; | ||
| else if ( util::str_compare_ci( value, "t_test" ) ) | ||
| sim->profileset_cull.method = sim_t::profileset_cull_state_t::T_TEST; | ||
| else { | ||
| sim->error( "Invalid profileset_cull_method '{}', valid options: ci, t_test", value ); | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some enum helper functions that can automatically derive a parser (i.e. string -> enum function) when given an enum -> string function. Not really a big deal, but might be convenient in case more methods are added down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use those, I'd have to move the enum out of being declared inline, and touch a bunch of extra files. I cleaned this up by moving the parsing out of line here. If you want the full change (move the enum declaration to sc_enum.hpp, etc) I can do that too, but it seemed like more than you intended.
|
PR comments addressed. Let me know if you have any other feedback! |
Implements statistical confidence-based early elimination for profilesets to avoid simulating clearly inferior gear combinations. Uses Bayesian confidence intervals to compare candidates against the current best, with configurable safety margins. Key features: - Early elimination based on statistical confidence intervals - Default verbosity 0 (quiet), always shows new best discoveries - Configurable minimum iterations and winner precision thresholds - Thread-safe implementation with mutex protection - Integration with existing profileset workflow Options: - find_best=1 to enable - find_best_verbose=N for verbosity control - find_best_min_iterations=N (default 500) - find_best_winner_precision=N (default inherits target_error) - find_best_safety_margin=N (default 0.001) (cherry picked from commit d4cf4eea35ecbea4a39cd755ebe35abe7bc4e028)
(cherry picked from commit a5113b7f59c0221046ffa0b38d278bc90cf32dc1)
(cherry picked from commit 010eead3109828e57c9d88b5f6ba1338b8ce3f9d)
(cherry picked from commit d1e3e536466fdb0d86de5d4256194d146a423346)
(cherry picked from commit 12472f597fdf91cb43d18d8dbec1c53fd9b3eaa8)
(cherry picked from commit 7de1c5a7b1e26a4ea6012274f4fef4e4409f3d0a)
(cherry picked from commit c61fcd4e3d93c97fedee0582af60290554a76f7a)
…N, and simplified iteration counts. Cleaned up method parsing (cherry picked from commit 84d2402bb345925c1d6c46bbfda93d95090b3547)
(cherry picked from commit c3ac34bb71cb1d08cf631de46318859d451eab58)
a3e0bfb to
bd87830
Compare
|
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally had some time to go over it in more detail and I don't think I have any other comments. I'll try to play with it to get a feel for the culling with some real profiles and I'll poke another dev to take a look.
|
Thanks for the update! |
|
|
||
| double mean_total=0; | ||
| int mean_count=0; | ||
| double current_standard_error = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value of this variable outside of single_actor_batch sims is never set, leading to t_test always seeing a se of 0. Thus, breaking the t_test culling logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm , that's a good catch, but I'm not exactly sure how to test a fix.
I haven't used multi-actor simulation with profilesets. I see how to get a valid se in the other arm of !single_actor_batch, but I'm not really sure what the right behavior would be. Do you have a good test case I could validate against?
Alternatively I can try to some multiactor tests and just figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are already hacks in the case of multi actor metrics. Perhaps we should just fail if culling with multiactor is used? Happy to take a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only reason i noticed is because it came up in my testing. Doesn't even have to be multi-actor, just a base profile without single_actor_batch=1 set, and the profilesets.
Test case for me was as follows
Basic Unholy Death Knight profile
46k talent combination profilesets
best solution is probably throwing an error in the is_multiactor_metric check if its trying to use culling, and setting the variable value in the else function though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on the above comment a bit more, might be better to just set the variable no matter if its multi-actor or not, as throwing an error in multi-actor will just prevent it from working on augmentation sims.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I'll have some time on the weekend to work on this. I'll make a change and push a patch once I have it.
Raidbots was struggling today, so I tried running simc locally and noticed it was trying to hit the same error for each run, even when it's clear that a profileset is not going to be the 'winner'. If the user cares about finding the 'best' configuration over a number of profilesets, this will speed it up greatly.
Implements statistical confidence-based early elimination for profilesets to avoid simulating clearly inferior gear combinations. Uses confidence intervals to compare candidates against the current best, with configurable safety margins.
Options: