-
Notifications
You must be signed in to change notification settings - Fork 1
Some refactors to improve the grouping code and remove the todo in clustering.py #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ly sizeable refactor of the various transformations in the dataset file and adding a number of tests.
lotif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with minor comments. Thanks for dealing with the conflicts and adding tests!
📝 WalkthroughWalkthroughThe changes introduce two new public helper functions to the clustering module: Estimated code review effort🎯 2 (Simple) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/midst_toolkit/models/clavaddpm/clustering.py (2)
672-694: Consider numpy-based grouping for better performance.The current row-by-row iteration is correct but could be optimized using numpy operations for larger datasets. However, given typical clustering dataset sizes, the current implementation is readable and acceptable.
If performance becomes a concern, consider using numpy's sorting and splitting operations:
def group_data_by_group_id_as_dict( data_to_be_grouped: np.ndarray, column_index_to_group_by: int ) -> dict[int, list[np.ndarray]]: """...""" if len(data_to_be_grouped) == 0: return {} # Sort by grouping column sort_indices = np.argsort(data_to_be_grouped[:, column_index_to_group_by]) sorted_data = data_to_be_grouped[sort_indices] sorted_keys = sorted_data[:, column_index_to_group_by] # Find group boundaries unique_keys, split_indices = np.unique(sorted_keys, return_index=True) groups = np.split(sorted_data, split_indices[1:]) # Convert to dict grouped_data_dict = {} for key, group in zip(unique_keys, groups): group_id = _parse_numpy_number_as_int(key) grouped_data_dict[group_id] = [row for row in group] return grouped_data_dict
697-719: Consider clarifying the return type in the docstring.The function returns a numpy array with
dtype=object(necessary for ragged groups), which might not be immediately obvious to users. Consider adding a note about this in the docstring.Add a note to the Returns section:
Returns: - Numpy array of the data grouped by values in the column with index ``column_index_to_group_by``. + Numpy array of the data grouped by values in the column with index ``column_index_to_group_by``. + The returned array has dtype=object since groups may have different lengths.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/midst_toolkit/models/clavaddpm/clustering.py(4 hunks)tests/unit/models/clavaddpm/test_clustering.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/unit/models/clavaddpm/test_clustering.py (2)
src/midst_toolkit/models/clavaddpm/clustering.py (2)
group_data_by_group_id_as_dict(672-694)group_data_by_id(697-719)src/midst_toolkit/common/random.py (2)
set_all_random_seeds(11-55)unset_all_random_seeds(58-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-tests
- GitHub Check: run-code-check
- GitHub Check: unit-tests
🔇 Additional comments (7)
src/midst_toolkit/models/clavaddpm/clustering.py (4)
5-5: LGTM!The switch from
OrderedDicttodefaultdictis appropriate for the grouping use case and simplifies the code.
236-236: LGTM!The use of
group_data_by_idwithsort_by_column_value=Trueis correct and maintains the expected ordering of groups by foreign key values.
316-316: LGTM!The updated docstring clearly specifies that this is the foreign key index in the child data.
321-321: LGTM!The use of
group_data_by_group_id_as_dictis appropriate here, as the dict-based return value allows efficient lookup of group data by group ID.tests/unit/models/clavaddpm/test_clustering.py (3)
8-9: LGTM!The imports for the new grouping functions are correctly added.
88-152: LGTM!Comprehensive test coverage for
group_data_by_idwith:
- Multiple data configurations (foreign key in different positions)
- Both sorting modes (sorted and unsorted)
- Validation of both structure and specific values
- Proper random seed management
155-179: LGTM!Excellent test coverage for
group_data_by_group_id_as_dictthat validates:
- Dict structure with correct keys
- Group sizes for each key
- Specific values within groups
- Proper random seed management
PR Type
Refactor
Short Description
Clickup Ticket(s): https://app.clickup.com/t/868g7h7h6
Refactoring the grouping helper functions in clustering.py and adding documentation. The goal was to provide some refactors to make them easier to read. Tests add a lot of contex to how they work as well.
Tests Added
Tests for the functions that were touched.