fix: treat pandas Boolean as numeric in clinical_kernel#590
Merged
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #590 +/- ##
=======================================
Coverage 98.46% 98.46%
=======================================
Files 38 38
Lines 3713 3715 +2
Branches 480 481 +1
=======================================
+ Hits 3656 3658 +2
Misses 27 27
Partials 30 30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checklist
What does this implement/fix? Explain your changes
Fixes the two divergent Boolean-handling bugs in
sksurv/kernels/clinical.pydescribed in #589:clinical_kernelsilent drop._get_continuous_and_ordinal_arrayusedx.select_dtypes(include=[np.number]), which excludesnp.bool_because numpy treats it as a sibling ofnp.numberrather than a subclass. Boolean columns matched neither the numeric nor theobject/categoryfilter, were silently dropped, and the resulting matrix was biased by(n_features - n_bool_cols) / n_featuresbecause normalization still usedx.shape[1].Fix:
select_dtypes(include=[np.number, "bool"]).ClinicalKernelTransform.fitTypeError._prepare_by_column_dtypeusespandas.api.types.is_numeric_dtype, which returnsTruefor Boolean, butcol.max() - col.min()then failed withTypeError: numpy boolean subtracton numpy ≥ 1.25.Fix: cast Boolean columns to
np.uint8before computing the range.Both fixes align the pandas path with the policy already established in this codebase:
pandas.api.types.is_numeric_dtype(bool)isTrue. After this change,clinical_kernel(df)andclinical_kernel(df.astype({col: 'uint8'}))produce identical kernel matrices, andClinicalKernelTransform().fit(df)no longer crashes.Tests
Added
TestClinicalKernel.test_bool_column_treated_as_numericasserting:clinical_kernelreturns the same matrix for a Boolean column and itsuint8equivalent.ClinicalKernelTransform().fitproduces matching_numeric_rangesandX_fit_for both dtypes._numeric_columns, not_nominal_columns.Local verification:
pytest tests/(including slow tests): 966 passed, 48 skipped, 0 failed.pytest --doctest-modules --pyargs sksurv.kernels: 1 passed.ruff checkon the whole repo: clean.pre-commit runon changed files: all hooks pass.Behavior change note
Users whose pipelines previously called
clinical_kernelon a pandas frame containing Boolean columns will see different (corrected) kernel values after this change. The previous values were silently biased, so the change is a bug fix rather than an API change, but it may be worth flagging in the changelog.