Some refactors to improve the grouping code and remove the todo in clustering.py #83

emersodb · 2025-11-03T22:54:51Z

PR Type

Refactor

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868g7h7h6

Refactoring the grouping helper functions in clustering.py and adding documentation. The goal was to provide some refactors to make them easier to read. Tests add a lot of contex to how they work as well.

Tests Added

Tests for the functions that were touched.

…ly sizeable refactor of the various transformations in the dataset file and adding a number of tests.

src/midst_toolkit/models/clavaddpm/clustering.py

lotif

Approved with minor comments. Thanks for dealing with the conflicts and adding tests!

src/midst_toolkit/models/clavaddpm/clustering.py

coderabbitai · 2025-11-06T16:05:34Z

📝 Walkthrough

Walkthrough

The changes introduce two new public helper functions to the clustering module: group_data_by_group_id_as_dict and group_data_by_id. These functions replace internal grouping logic previously used in _pair_clustering. The refactoring updates internal data grouping to rely on a dict-based helper and adjusts downstream usage accordingly. The implementation switches from OrderedDict to defaultdict for grouping storage. Docstrings are updated to clarify parameters, and comprehensive tests are added for the new public functions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

New public API signatures: Verify the parameter names, types, and return value shapes of group_data_by_group_id_as_dict and group_data_by_id are documented and semantically correct
Behavior preservation: Confirm that the refactored _pair_clustering produces identical results using the new group_data_by_id(..., sort_by_column_value=True) approach
Data structure change: Validate that replacing OrderedDict with defaultdict maintains the expected ordering and grouping semantics in the output
Test coverage: Review the new test cases in test_clustering.py to ensure they exercise both functions with representative data shapes and edge cases

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main refactoring work: improving grouping code and removing a todo in clustering.py, which aligns with the PR's core purpose.
Description check	✅ Passed	The description follows the template structure with PR Type, Short Description (including ClickUp ticket), and Tests Added sections completed with relevant details.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dbe/another_clustering_todo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/midst_toolkit/models/clavaddpm/clustering.py (2)
672-694: Consider numpy-based grouping for better performance.

The current row-by-row iteration is correct but could be optimized using numpy operations for larger datasets. However, given typical clustering dataset sizes, the current implementation is readable and acceptable.

If performance becomes a concern, consider using numpy's sorting and splitting operations:
def group_data_by_group_id_as_dict(
    data_to_be_grouped: np.ndarray, column_index_to_group_by: int
) -> dict[int, list[np.ndarray]]:
    """..."""
    if len(data_to_be_grouped) == 0:
        return {}
    
    # Sort by grouping column
    sort_indices = np.argsort(data_to_be_grouped[:, column_index_to_group_by])
    sorted_data = data_to_be_grouped[sort_indices]
    sorted_keys = sorted_data[:, column_index_to_group_by]
    
    # Find group boundaries
    unique_keys, split_indices = np.unique(sorted_keys, return_index=True)
    groups = np.split(sorted_data, split_indices[1:])
    
    # Convert to dict
    grouped_data_dict = {}
    for key, group in zip(unique_keys, groups):
        group_id = _parse_numpy_number_as_int(key)
        grouped_data_dict[group_id] = [row for row in group]
    
    return grouped_data_dict
697-719: Consider clarifying the return type in the docstring.

The function returns a numpy array with dtype=object (necessary for ragged groups), which might not be immediately obvious to users. Consider adding a note about this in the docstring.

Add a note to the Returns section:
     Returns:
-        Numpy array of the data grouped by values in the column with index ``column_index_to_group_by``.
+        Numpy array of the data grouped by values in the column with index ``column_index_to_group_by``.
+        The returned array has dtype=object since groups may have different lengths.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e095d56 and 26baedd.

📒 Files selected for processing (2)

src/midst_toolkit/models/clavaddpm/clustering.py (4 hunks)
tests/unit/models/clavaddpm/test_clustering.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/unit/models/clavaddpm/test_clustering.py (2)

src/midst_toolkit/models/clavaddpm/clustering.py (2)

group_data_by_group_id_as_dict (672-694)

group_data_by_id (697-719)

src/midst_toolkit/common/random.py (2)

set_all_random_seeds (11-55)

unset_all_random_seeds (58-67)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: integration-tests
GitHub Check: run-code-check
GitHub Check: unit-tests

🔇 Additional comments (7)

src/midst_toolkit/models/clavaddpm/clustering.py (4)

5-5: LGTM!

The switch from OrderedDict to defaultdict is appropriate for the grouping use case and simplifies the code.

236-236: LGTM!

The use of group_data_by_id with sort_by_column_value=True is correct and maintains the expected ordering of groups by foreign key values.

316-316: LGTM!

The updated docstring clearly specifies that this is the foreign key index in the child data.

321-321: LGTM!

The use of group_data_by_group_id_as_dict is appropriate here, as the dict-based return value allows efficient lookup of group data by group ID.

tests/unit/models/clavaddpm/test_clustering.py (3)

8-9: LGTM!

The imports for the new grouping functions are correctly added.

88-152: LGTM!

Comprehensive test coverage for group_data_by_id with:

Multiple data configurations (foreign key in different positions)

Both sorting modes (sorted and unsorted)

Validation of both structure and specific values

Proper random seed management

155-179: LGTM!

Excellent test coverage for group_data_by_group_id_as_dict that validates:

Dict structure with correct keys

Group sizes for each key

Specific values within groups

Proper random seed management

emersodb added 8 commits October 31, 2025 13:42

First commit addressing a few todos in the code. This includes a fair…

8c82c2e

…ly sizeable refactor of the various transformations in the dataset file and adding a number of tests.

Small cleanup

17944bb

Addressing some coderabbit comments

5f024b0

Addressing some comments from Behnoosh

72d5a3f

A few more PR comments

1cebf9a

Addressing normalization todo comment

5e0edac

Some small fixes

ed705bb

Some refactors to improve the grouping code and remove the todo

abe8a17

emersodb requested review from ElahehBassak, bzamanlooy, fatemetkl, lotif, masi-sh and sarakodeiri November 3, 2025 22:54

emersodb commented Nov 3, 2025

View reviewed changes

src/midst_toolkit/models/clavaddpm/clustering.py Outdated Show resolved Hide resolved

emersodb commented Nov 3, 2025

View reviewed changes

src/midst_toolkit/models/clavaddpm/clustering.py Outdated Show resolved Hide resolved

emersodb added 4 commits November 4, 2025 11:14

Merge branch 'main' into dbe/more_trainer_todos

ba76f21

Merge branch 'dbe/more_trainer_todos' into dbe/clustering_todo

b3e1d10

Merge branch 'dbe/clustering_todo' into dbe/another_clustering_todo

1eadeaa

Edits to merge in the code that Marcelo refactored

5e67776

lotif approved these changes Nov 4, 2025

View reviewed changes

Some edits associated with Marcelo's PR comments

afc3d65

Base automatically changed from dbe/clustering_todo to main November 6, 2025 15:54

Merge branch 'main' into dbe/another_clustering_todo

26baedd

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

An update associated with a comment via Coderabbit

b921d90

emersodb merged commit 5758c67 into main Nov 6, 2025
6 of 7 checks passed

emersodb deleted the dbe/another_clustering_todo branch November 6, 2025 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some refactors to improve the grouping code and remove the todo in clustering.py #83

Some refactors to improve the grouping code and remove the todo in clustering.py #83

Uh oh!

emersodb commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 6, 2025

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Some refactors to improve the grouping code and remove the todo in clustering.py #83

Some refactors to improve the grouping code and remove the todo in clustering.py #83

Uh oh!

Conversation

emersodb commented Nov 3, 2025

PR Type

Short Description

Tests Added

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 6, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants