Skip to content

Conversation

@Mesh-ach
Copy link
Collaborator

@Mesh-ach Mesh-ach commented Nov 25, 2025

I made a couple of changes in this PR that should fix the support score distribution problem:

  1. I removed the test sampling step. We originally added this because SHAP computation was expensive, but that’s no longer an issue with h2o thanks @vishpillai123 . The problem with sampling was that we fed the sampled data directly into the support score distribution generator, which made the output weird because it no longer reflected the natural distribution of the data.

  2. I refactored the support score distribution function. Previously, it tried to create binned output with a groupby and then counted students in 0.2 increments. Now, we retrieve the bin results directly from np.histogram, which is more accurate and removes the need to recompute bin classes. I also made the bin width smaller, so we now use 0.02 increments.

Overall, initial tests show this works really well and closely matches how we currently generate the histogram support score plot in experiments. I’ve added an image showing the output for Collin County using this updated code.

image

actual image from experiments
image


@Mesh-ach Mesh-ach changed the base branch from main to develop November 25, 2025 03:52
@Mesh-ach Mesh-ach changed the title Fix support distribution bug fix: support distribution bug Nov 25, 2025
@Mesh-ach Mesh-ach marked this pull request as ready for review November 25, 2025 04:04
result["Support Score"], bins=bins, include_lowest=True, right=False
)
bin_width = 0.2 / 5 # 0.04
bins = np.arange(0.0, 1.0 + bin_width, bin_width) # 0.00 ... 1.00
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the bin width and bins depend on the number of data points? I'm not understanding the 0.2 / 5 logic here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's true, but in our case, we want a fixed-width bin for consistency with the charts we show in the api and model cards.

The initial implementation was correct, but it used a larger bin width (0.1 steps), which aggregates the scores over wider intervals, that made the distribution look different across the api/model-cards.

So basically now we are shrinking the bin width to a finer scale (0.04), so it matches how the charts in the experiment is being plotted, also since we are feeding each data point, bin intervals to the FE.

  • We want the x-axis ticks to stay at 0.0, 0.2, 0.4, …, 1.0 for readability.
  • Between each pair of ticks (e.g. 0.0–0.2), we want 5 histogram bars.
  • That means each bin needs to cover 0.2 / 5 = 0.04 of support score.

institution_id=cfg.institution_id,
automl_run_id=cfg.model.run_id,
modeling_dataset_name=cfg.datasets.silver["modeling"].train_table_path,
modeling_df=df_test, # this path expect a dataframe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that this function pulls the data directly from the AutoML run? Why do we need to also add a dataframe as an input here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, nevermind, you're right that this expects a dataframe. Good catch!!

Copy link
Collaborator

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @Mesh-ach, the removing of sampling makes sense here, but I think I'm struggling to follow why we need to refactor the source code and the binning logic. How has that been creating the discrepancy that we are seeing between the table and the training artifact?

Since we aren't using templates anymore for PDP, I'm curious to see if this fixes any discrepancies seen in our training PDP pipeline as well. We don't need to make changes in the scripts in our pipeline, right?

https://github.com/datakind/edvise/blob/develop/src/edvise/scripts/training_h2o.py
https://github.com/datakind/edvise/blob/develop/src/edvise/scripts/predictions_h2o.py

@Mesh-ach
Copy link
Collaborator Author

Mesh-ach commented Dec 1, 2025

Since we aren't using templates anymore for PDP, I'm curious to see if this fixes any discrepancies seen in our training PDP pipeline as well. We don't need to make changes in the scripts in our pipeline, right?

That's why I thought to change the code for plotting itself, instead of adjusting the template notebooks to use the data directly from experiments, so long we are not sampling within the PDP notebooks when creating the support distribution table, I think we should be fine. Also I assume PDP uses the same function right?

This change ensures the data we are generating for the FE support distribution table closely matches how seaborn is plotting the histogram we are logging in the experiments.

@vishpillai123 vishpillai123 merged commit caaa105 into develop Dec 1, 2025
6 checks passed
@vishpillai123 vishpillai123 deleted the Fix-SupportDistribution-Bug branch December 1, 2025 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants