Skip to content

Conversation

@emersodb
Copy link
Collaborator

@emersodb emersodb commented Nov 3, 2025

PR Type

Refactor

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868g7h7h6

Refactoring the grouping helper functions in clustering.py and adding documentation. The goal was to provide some refactors to make them easier to read. Tests add a lot of contex to how they work as well.

Tests Added

Tests for the functions that were touched.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with minor comments. Thanks for dealing with the conflicts and adding tests!

Base automatically changed from dbe/clustering_todo to main November 6, 2025 15:54
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

📝 Walkthrough

Walkthrough

The changes introduce two new public helper functions to the clustering module: group_data_by_group_id_as_dict and group_data_by_id. These functions replace internal grouping logic previously used in _pair_clustering. The refactoring updates internal data grouping to rely on a dict-based helper and adjusts downstream usage accordingly. The implementation switches from OrderedDict to defaultdict for grouping storage. Docstrings are updated to clarify parameters, and comprehensive tests are added for the new public functions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

  • New public API signatures: Verify the parameter names, types, and return value shapes of group_data_by_group_id_as_dict and group_data_by_id are documented and semantically correct
  • Behavior preservation: Confirm that the refactored _pair_clustering produces identical results using the new group_data_by_id(..., sort_by_column_value=True) approach
  • Data structure change: Validate that replacing OrderedDict with defaultdict maintains the expected ordering and grouping semantics in the output
  • Test coverage: Review the new test cases in test_clustering.py to ensure they exercise both functions with representative data shapes and edge cases

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main refactoring work: improving grouping code and removing a todo in clustering.py, which aligns with the PR's core purpose.
Description check ✅ Passed The description follows the template structure with PR Type, Short Description (including ClickUp ticket), and Tests Added sections completed with relevant details.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dbe/another_clustering_todo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/midst_toolkit/models/clavaddpm/clustering.py (2)

672-694: Consider numpy-based grouping for better performance.

The current row-by-row iteration is correct but could be optimized using numpy operations for larger datasets. However, given typical clustering dataset sizes, the current implementation is readable and acceptable.

If performance becomes a concern, consider using numpy's sorting and splitting operations:

def group_data_by_group_id_as_dict(
    data_to_be_grouped: np.ndarray, column_index_to_group_by: int
) -> dict[int, list[np.ndarray]]:
    """..."""
    if len(data_to_be_grouped) == 0:
        return {}
    
    # Sort by grouping column
    sort_indices = np.argsort(data_to_be_grouped[:, column_index_to_group_by])
    sorted_data = data_to_be_grouped[sort_indices]
    sorted_keys = sorted_data[:, column_index_to_group_by]
    
    # Find group boundaries
    unique_keys, split_indices = np.unique(sorted_keys, return_index=True)
    groups = np.split(sorted_data, split_indices[1:])
    
    # Convert to dict
    grouped_data_dict = {}
    for key, group in zip(unique_keys, groups):
        group_id = _parse_numpy_number_as_int(key)
        grouped_data_dict[group_id] = [row for row in group]
    
    return grouped_data_dict

697-719: Consider clarifying the return type in the docstring.

The function returns a numpy array with dtype=object (necessary for ragged groups), which might not be immediately obvious to users. Consider adding a note about this in the docstring.

Add a note to the Returns section:

     Returns:
-        Numpy array of the data grouped by values in the column with index ``column_index_to_group_by``.
+        Numpy array of the data grouped by values in the column with index ``column_index_to_group_by``.
+        The returned array has dtype=object since groups may have different lengths.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e095d56 and 26baedd.

📒 Files selected for processing (2)
  • src/midst_toolkit/models/clavaddpm/clustering.py (4 hunks)
  • tests/unit/models/clavaddpm/test_clustering.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/unit/models/clavaddpm/test_clustering.py (2)
src/midst_toolkit/models/clavaddpm/clustering.py (2)
  • group_data_by_group_id_as_dict (672-694)
  • group_data_by_id (697-719)
src/midst_toolkit/common/random.py (2)
  • set_all_random_seeds (11-55)
  • unset_all_random_seeds (58-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration-tests
  • GitHub Check: run-code-check
  • GitHub Check: unit-tests
🔇 Additional comments (7)
src/midst_toolkit/models/clavaddpm/clustering.py (4)

5-5: LGTM!

The switch from OrderedDict to defaultdict is appropriate for the grouping use case and simplifies the code.


236-236: LGTM!

The use of group_data_by_id with sort_by_column_value=True is correct and maintains the expected ordering of groups by foreign key values.


316-316: LGTM!

The updated docstring clearly specifies that this is the foreign key index in the child data.


321-321: LGTM!

The use of group_data_by_group_id_as_dict is appropriate here, as the dict-based return value allows efficient lookup of group data by group ID.

tests/unit/models/clavaddpm/test_clustering.py (3)

8-9: LGTM!

The imports for the new grouping functions are correctly added.


88-152: LGTM!

Comprehensive test coverage for group_data_by_id with:

  • Multiple data configurations (foreign key in different positions)
  • Both sorting modes (sorted and unsorted)
  • Validation of both structure and specific values
  • Proper random seed management

155-179: LGTM!

Excellent test coverage for group_data_by_group_id_as_dict that validates:

  • Dict structure with correct keys
  • Group sizes for each key
  • Specific values within groups
  • Proper random seed management

@emersodb emersodb merged commit 5758c67 into main Nov 6, 2025
6 of 7 checks passed
@emersodb emersodb deleted the dbe/another_clustering_todo branch November 6, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants