Skip to content

Conversation

@emersodb
Copy link
Collaborator

@emersodb emersodb commented Nov 3, 2025

PR Type

Refactor

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868g7a81q

This PR addresses a TODO in the clustering code that aimed at re-considering how the normalization functions are structured.

Tests Added

Tests were added for the functions that I materially touch in this refactor.


normalized_data = np.empty((matrix.shape[0], 0))

# Apply QuantileTransformer to each column and concatenate the results
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literally no idea why this was being done per column...


normalized_data = np.empty((matrix.shape[0], 0))

# Apply MinMaxScaler to each column and concatenate the results
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literally no idea why this was being done per column...

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome refactor! LGTM!

Base automatically changed from dbe/more_trainer_todos to main November 6, 2025 15:37
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

📝 Walkthrough

Walkthrough

This pull request refactors the clustering normalization system to consolidate normalization logic using a newly renamed DataAndKeyNormalizationType enum (formerly KeyScalingType). The changes unify the dual-path numerical normalization (MINMAX vs QUANTILE) into a single normalized array approach, update function signatures in clustering.py to accept and propagate the normalization type parameter, and rename/repurpose the get_normalized_numerical_columns utility function. Additionally, unit tests are added for normalization utilities, and a minor typo in an enum docstring is corrected.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • clustering.py normalization workflow refactoring: The reworked normalization flow consolidating child-parent data into a single step with subsequent parent scaling requires careful verification of the logic, particularly the transition from dual-array returns to single normalized_data array.
  • Function signature propagation: Trace the data_and_key_normalization parameter flow through _pair_clustering → _prepare_cluster_data → get_normalized_numerical_columns to ensure consistent propagation and default values.
  • Enum rename impact: Verify that all references to KeyScalingType have been replaced with DataAndKeyNormalizationType across all affected files, and confirm the breaking change is intentional.
  • Test adequacy: Review new tests in test_clustering.py to ensure they cover both normalization paths (MINMAX and QUANTILE) through the consolidated get_normalized_numerical_columns function with appropriate tolerances and seeded RNG management.
  • Error handling: Validate the new ValueError for unsupported data_and_key_normalization types is properly raised in all relevant code paths.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main refactoring work: addressing a normalization TODO in clustering.py by restructuring how normalization functions are organized.
Description check ✅ Passed The description follows the template with PR Type (Refactor), a short description referencing the ClickUp ticket and the normalization TODO, and mentions tests added. All required sections are present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dbe/clustering_todo

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35d10f6 and 7ac7cd8.

📒 Files selected for processing (4)
  • src/midst_toolkit/models/clavaddpm/clustering.py (9 hunks)
  • src/midst_toolkit/models/clavaddpm/dataset.py (1 hunks)
  • src/midst_toolkit/models/clavaddpm/enumerations.py (2 hunks)
  • tests/unit/models/clavaddpm/test_clustering.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit/models/clavaddpm/test_clustering.py (3)
src/midst_toolkit/common/random.py (2)
  • set_all_random_seeds (11-55)
  • unset_all_random_seeds (58-67)
src/midst_toolkit/models/clavaddpm/clustering.py (3)
  • _min_max_normalize_sklearn (758-769)
  • _quantile_normalize_sklearn (740-755)
  • get_normalized_numerical_columns (339-374)
src/midst_toolkit/models/clavaddpm/enumerations.py (1)
  • DataAndKeyNormalizationType (105-109)
src/midst_toolkit/models/clavaddpm/clustering.py (1)
src/midst_toolkit/models/clavaddpm/enumerations.py (1)
  • DataAndKeyNormalizationType (105-109)
🪛 Ruff (0.14.3)
src/midst_toolkit/models/clavaddpm/clustering.py

369-369: Avoid specifying long messages outside the exception class

(TRY003)


502-502: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: run-code-check
  • GitHub Check: build
  • GitHub Check: unit-tests
  • GitHub Check: integration-tests
🔇 Additional comments (13)
src/midst_toolkit/models/clavaddpm/dataset.py (1)

118-118: LGTM! Formatting improvement.

The blank line improves readability by visually separating the variable declaration from the loop logic.

src/midst_toolkit/models/clavaddpm/enumerations.py (2)

19-19: LGTM! Typo fix.

Corrected "Possioble" to "Possible" in the docstring.


105-109: LGTM! Enum rename clarifies intent.

The rename from KeyScalingType to DataAndKeyNormalizationType better reflects that this enum governs normalization for both data and primary keys during clustering. The updated docstring accurately describes this broader scope.

Note: This is a breaking API change. Ensure all references to KeyScalingType have been updated throughout the codebase.

tests/unit/models/clavaddpm/test_clustering.py (3)

12-29: LGTM! Well-structured test.

The test correctly validates _quantile_normalize_sklearn with deterministic input and expected output. The use of set_all_random_seeds(42) ensures reproducibility, and the cleanup with unset_all_random_seeds() prevents test pollution.


32-49: LGTM! Appropriate tolerance for deterministic normalization.

The test correctly validates _min_max_normalize_sklearn. The stricter tolerance (atol=1e-8) is appropriate since min-max normalization is fully deterministic.


52-83: LGTM! Comprehensive test of normalization paths.

The test correctly validates get_normalized_numerical_columns for both MINMAX and QUANTILE normalization types, including verification that the parent_scale factor is properly applied to the parent data columns after normalization.

src/midst_toolkit/models/clavaddpm/clustering.py (7)

21-21: LGTM! Import updated for renamed enum.

The import correctly reflects the enum rename from KeyScalingType to DataAndKeyNormalizationType.


159-186: LGTM! Well-documented signature update.

The addition of the data_and_key_normalization parameter with a sensible default (DataAndKeyNormalizationType.MINMAX) maintains backward compatibility. The updated docstrings clearly explain the purpose of parent_scale, key_scale, and the new normalization parameter.


339-374: LGTM! Efficient single-path normalization.

The refactored function eliminates redundant computation by applying only the requested normalization method instead of computing both MINMAX and QUANTILE. The logic is clear:

  1. Merge child and parent numerical data
  2. Apply selected normalization
  3. Scale parent data by parent_scale
  4. Return single normalized array

The error handling for unrecognized normalization methods is appropriate.

Note: Static analysis flags the exception message at line 369 as too long (TRY003), but this is a minor style preference and the message is clear and informative.


425-456: LGTM! Signature and documentation updated consistently.

The data_and_key_normalization parameter is properly added with a sensible default, and the docstring clearly explains all parameters including the updated normalization approach.


488-503: LGTM! Unified normalization logic.

The refactor successfully consolidates the normalization workflow:

  • Single call to get_normalized_numerical_columns with the selected method (lines 488-493)
  • Consistent normalization applied to parent primary key data (lines 495-502)
  • Appropriate error handling for unrecognized normalization types

Note: Static analysis flags the exception message at line 502 as too long (TRY003), but the message is clear and helpful for debugging.


740-755: LGTM! Simplified normalization helper.

The function now directly returns the transformed matrix instead of accumulating results per column. This simplification improves readability and performance.


758-769: LGTM! Simplified normalization helper.

The function now directly returns the transformed matrix, eliminating unnecessary complexity. The simplification aligns with the broader refactoring goals.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@emersodb emersodb merged commit e095d56 into main Nov 6, 2025
6 checks passed
@emersodb emersodb deleted the dbe/clustering_todo branch November 6, 2025 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants