-
Notifications
You must be signed in to change notification settings - Fork 1
fix: support distribution bug #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| result["Support Score"], bins=bins, include_lowest=True, right=False | ||
| ) | ||
| bin_width = 0.2 / 5 # 0.04 | ||
| bins = np.arange(0.0, 1.0 + bin_width, bin_width) # 0.00 ... 1.00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the bin width and bins depend on the number of data points? I'm not understanding the 0.2 / 5 logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true, but in our case, we want a fixed-width bin for consistency with the charts we show in the api and model cards.
The initial implementation was correct, but it used a larger bin width (0.1 steps), which aggregates the scores over wider intervals, that made the distribution look different across the api/model-cards.
So basically now we are shrinking the bin width to a finer scale (0.04), so it matches how the charts in the experiment is being plotted, also since we are feeding each data point, bin intervals to the FE.
- We want the x-axis ticks to stay at 0.0, 0.2, 0.4, …, 1.0 for readability.
- Between each pair of ticks (e.g. 0.0–0.2), we want 5 histogram bars.
- That means each bin needs to cover 0.2 / 5 = 0.04 of support score.
| institution_id=cfg.institution_id, | ||
| automl_run_id=cfg.model.run_id, | ||
| modeling_dataset_name=cfg.datasets.silver["modeling"].train_table_path, | ||
| modeling_df=df_test, # this path expect a dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that this function pulls the data directly from the AutoML run? Why do we need to also add a dataframe as an input here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, nevermind, you're right that this expects a dataframe. Good catch!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @Mesh-ach, the removing of sampling makes sense here, but I think I'm struggling to follow why we need to refactor the source code and the binning logic. How has that been creating the discrepancy that we are seeing between the table and the training artifact?
Since we aren't using templates anymore for PDP, I'm curious to see if this fixes any discrepancies seen in our training PDP pipeline as well. We don't need to make changes in the scripts in our pipeline, right?
https://github.com/datakind/edvise/blob/develop/src/edvise/scripts/training_h2o.py
https://github.com/datakind/edvise/blob/develop/src/edvise/scripts/predictions_h2o.py
That's why I thought to change the code for plotting itself, instead of adjusting the template notebooks to use the data directly from experiments, so long we are not sampling within the PDP notebooks when creating the support distribution table, I think we should be fine. Also I assume PDP uses the same function right? This change ensures the data we are generating for the FE support distribution table closely matches how seaborn is plotting the histogram we are logging in the experiments. |
I made a couple of changes in this PR that should fix the support score distribution problem:
I removed the test sampling step. We originally added this because SHAP computation was expensive, but that’s no longer an issue with h2o thanks @vishpillai123 . The problem with sampling was that we fed the sampled data directly into the support score distribution generator, which made the output weird because it no longer reflected the natural distribution of the data.
I refactored the support score distribution function. Previously, it tried to create binned output with a groupby and then counted students in 0.2 increments. Now, we retrieve the bin results directly from np.histogram, which is more accurate and removes the need to recompute bin classes. I also made the bin width smaller, so we now use 0.02 increments.
Overall, initial tests show this works really well and closely matches how we currently generate the histogram support score plot in experiments. I’ve added an image showing the output for Collin County using this updated code.
actual image from experiments
