Addressing Normalization TODO in clustering.py #81

emersodb · 2025-11-03T19:03:05Z

PR Type

Refactor

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868g7a81q

This PR addresses a TODO in the clustering code that aimed at re-considering how the normalization functions are structured.

Tests Added

Tests were added for the functions that I materially touch in this refactor.

…ly sizeable refactor of the various transformations in the dataset file and adding a number of tests.

src/midst_toolkit/models/clavaddpm/clustering.py

emersodb · 2025-11-03T19:06:28Z

src/midst_toolkit/models/clavaddpm/clustering.py


-    normalized_data = np.empty((matrix.shape[0], 0))
-
-    # Apply QuantileTransformer to each column and concatenate the results


Literally no idea why this was being done per column...

emersodb · 2025-11-03T19:06:33Z

src/midst_toolkit/models/clavaddpm/clustering.py

-
-    normalized_data = np.empty((matrix.shape[0], 0))
-
-    # Apply MinMaxScaler to each column and concatenate the results


Literally no idea why this was being done per column...

lotif

Awesome refactor! LGTM!

coderabbitai · 2025-11-06T15:49:29Z

📝 Walkthrough

Walkthrough

This pull request refactors the clustering normalization system to consolidate normalization logic using a newly renamed DataAndKeyNormalizationType enum (formerly KeyScalingType). The changes unify the dual-path numerical normalization (MINMAX vs QUANTILE) into a single normalized array approach, update function signatures in clustering.py to accept and propagate the normalization type parameter, and rename/repurpose the get_normalized_numerical_columns utility function. Additionally, unit tests are added for normalization utilities, and a minor typo in an enum docstring is corrected.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

clustering.py normalization workflow refactoring: The reworked normalization flow consolidating child-parent data into a single step with subsequent parent scaling requires careful verification of the logic, particularly the transition from dual-array returns to single normalized_data array.
Function signature propagation: Trace the data_and_key_normalization parameter flow through _pair_clustering → _prepare_cluster_data → get_normalized_numerical_columns to ensure consistent propagation and default values.
Enum rename impact: Verify that all references to KeyScalingType have been replaced with DataAndKeyNormalizationType across all affected files, and confirm the breaking change is intentional.
Test adequacy: Review new tests in test_clustering.py to ensure they cover both normalization paths (MINMAX and QUANTILE) through the consolidated get_normalized_numerical_columns function with appropriate tolerances and seeded RNG management.
Error handling: Validate the new ValueError for unsupported data_and_key_normalization types is properly raised in all relevant code paths.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main refactoring work: addressing a normalization TODO in clustering.py by restructuring how normalization functions are organized.
Description check	✅ Passed	The description follows the template with PR Type (Refactor), a short description referencing the ClickUp ticket and the normalization TODO, and mentions tests added. All required sections are present.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dbe/clustering_todo

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35d10f6 and 7ac7cd8.

📒 Files selected for processing (4)

src/midst_toolkit/models/clavaddpm/clustering.py (9 hunks)
src/midst_toolkit/models/clavaddpm/dataset.py (1 hunks)
src/midst_toolkit/models/clavaddpm/enumerations.py (2 hunks)
tests/unit/models/clavaddpm/test_clustering.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

tests/unit/models/clavaddpm/test_clustering.py (3)

src/midst_toolkit/common/random.py (2)

set_all_random_seeds (11-55)

unset_all_random_seeds (58-67)

src/midst_toolkit/models/clavaddpm/clustering.py (3)

_min_max_normalize_sklearn (758-769)

_quantile_normalize_sklearn (740-755)

get_normalized_numerical_columns (339-374)

src/midst_toolkit/models/clavaddpm/enumerations.py (1)

DataAndKeyNormalizationType (105-109)

src/midst_toolkit/models/clavaddpm/clustering.py (1)

src/midst_toolkit/models/clavaddpm/enumerations.py (1)

DataAndKeyNormalizationType (105-109)

🪛 Ruff (0.14.3)

src/midst_toolkit/models/clavaddpm/clustering.py

369-369: Avoid specifying long messages outside the exception class

(TRY003)

502-502: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: run-code-check
GitHub Check: build
GitHub Check: unit-tests
GitHub Check: integration-tests

🔇 Additional comments (13)

src/midst_toolkit/models/clavaddpm/dataset.py (1)

118-118: LGTM! Formatting improvement.

The blank line improves readability by visually separating the variable declaration from the loop logic.

src/midst_toolkit/models/clavaddpm/enumerations.py (2)

19-19: LGTM! Typo fix.

Corrected "Possioble" to "Possible" in the docstring.

105-109: LGTM! Enum rename clarifies intent.

The rename from KeyScalingType to DataAndKeyNormalizationType better reflects that this enum governs normalization for both data and primary keys during clustering. The updated docstring accurately describes this broader scope.

Note: This is a breaking API change. Ensure all references to KeyScalingType have been updated throughout the codebase.

tests/unit/models/clavaddpm/test_clustering.py (3)

12-29: LGTM! Well-structured test.

The test correctly validates _quantile_normalize_sklearn with deterministic input and expected output. The use of set_all_random_seeds(42) ensures reproducibility, and the cleanup with unset_all_random_seeds() prevents test pollution.

32-49: LGTM! Appropriate tolerance for deterministic normalization.

The test correctly validates _min_max_normalize_sklearn. The stricter tolerance (atol=1e-8) is appropriate since min-max normalization is fully deterministic.

52-83: LGTM! Comprehensive test of normalization paths.

The test correctly validates get_normalized_numerical_columns for both MINMAX and QUANTILE normalization types, including verification that the parent_scale factor is properly applied to the parent data columns after normalization.

src/midst_toolkit/models/clavaddpm/clustering.py (7)

21-21: LGTM! Import updated for renamed enum.

The import correctly reflects the enum rename from KeyScalingType to DataAndKeyNormalizationType.

159-186: LGTM! Well-documented signature update.

The addition of the data_and_key_normalization parameter with a sensible default (DataAndKeyNormalizationType.MINMAX) maintains backward compatibility. The updated docstrings clearly explain the purpose of parent_scale, key_scale, and the new normalization parameter.

339-374: LGTM! Efficient single-path normalization.

The refactored function eliminates redundant computation by applying only the requested normalization method instead of computing both MINMAX and QUANTILE. The logic is clear:

Merge child and parent numerical data

Apply selected normalization

Scale parent data by parent_scale

Return single normalized array

The error handling for unrecognized normalization methods is appropriate.

Note: Static analysis flags the exception message at line 369 as too long (TRY003), but this is a minor style preference and the message is clear and informative.

425-456: LGTM! Signature and documentation updated consistently.

The data_and_key_normalization parameter is properly added with a sensible default, and the docstring clearly explains all parameters including the updated normalization approach.

488-503: LGTM! Unified normalization logic.

The refactor successfully consolidates the normalization workflow:

Single call to get_normalized_numerical_columns with the selected method (lines 488-493)

Consistent normalization applied to parent primary key data (lines 495-502)

Appropriate error handling for unrecognized normalization types

Note: Static analysis flags the exception message at line 502 as too long (TRY003), but the message is clear and helpful for debugging.

740-755: LGTM! Simplified normalization helper.

The function now directly returns the transformed matrix instead of accumulating results per column. This simplification improves readability and performance.

758-769: LGTM! Simplified normalization helper.

The function now directly returns the transformed matrix, eliminating unnecessary complexity. The simplification aligns with the broader refactoring goals.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

emersodb added 6 commits October 31, 2025 13:42

First commit addressing a few todos in the code. This includes a fair…

8c82c2e

…ly sizeable refactor of the various transformations in the dataset file and adding a number of tests.

Small cleanup

17944bb

Addressing some coderabbit comments

5f024b0

Addressing some comments from Behnoosh

72d5a3f

A few more PR comments

1cebf9a

Addressing normalization todo comment

5e0edac

emersodb requested review from ElahehBassak, bzamanlooy, fatemetkl, lotif, masi-sh and sarakodeiri November 3, 2025 19:03

emersodb commented Nov 3, 2025

View reviewed changes

src/midst_toolkit/models/clavaddpm/clustering.py Show resolved Hide resolved

emersodb commented Nov 3, 2025

View reviewed changes

src/midst_toolkit/models/clavaddpm/clustering.py Show resolved Hide resolved

Some small fixes

ed705bb

emersodb commented Nov 3, 2025

View reviewed changes

emersodb added 2 commits November 4, 2025 11:14

Merge branch 'main' into dbe/more_trainer_todos

ba76f21

Merge branch 'dbe/more_trainer_todos' into dbe/clustering_todo

b3e1d10

lotif approved these changes Nov 4, 2025

View reviewed changes

Base automatically changed from dbe/more_trainer_todos to main November 6, 2025 15:37

Merge branch 'main' into dbe/clustering_todo

7ac7cd8

emersodb merged commit e095d56 into main Nov 6, 2025
6 checks passed

emersodb deleted the dbe/clustering_todo branch November 6, 2025 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Addressing Normalization TODO in clustering.py #81

Addressing Normalization TODO in clustering.py #81

Uh oh!

emersodb commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

emersodb Nov 3, 2025

Uh oh!

emersodb Nov 3, 2025

Uh oh!

lotif left a comment

Uh oh!

coderabbitai bot commented Nov 6, 2025

Walkthrough

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		normalized_data = np.empty((matrix.shape[0], 0))

		# Apply QuantileTransformer to each column and concatenate the results


		normalized_data = np.empty((matrix.shape[0], 0))

		# Apply MinMaxScaler to each column and concatenate the results

Addressing Normalization TODO in clustering.py #81

Addressing Normalization TODO in clustering.py #81

Uh oh!

Conversation

emersodb commented Nov 3, 2025

PR Type

Short Description

Tests Added

Uh oh!

Uh oh!

Uh oh!

emersodb Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

emersodb Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 6, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants