Extend multiple data drift univariate methods to multivariate#353
Extend multiple data drift univariate methods to multivariate#353alvarolopez merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request extends the BhattacharyyaDistance detector from univariate to multivariate data support. The implementation now uses the mathematically correct Bhattacharyya distance formula (-log(BC)) instead of the coefficient complement (1 - BC), and leverages multidimensional histograms to handle multivariate distributions.
Changes:
- Extended BhattacharyyaDistance to support multivariate data by changing from UnivariateData to MultivariateData
- Updated _calculate_bins_values in base class to use histogramdd for both univariate and multivariate cases
- Modified Bhattacharyya distance calculation to use the proper distance formula with logarithm
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py | Changed statistical type to MultivariateData, updated distance formula to use -log(BC), and modified documentation/examples for multivariate support |
| frouros/detectors/data_drift/batch/distance_based/base.py | Refactored _calculate_bins_values to handle both 1D and multivariate data using histogramdd, added statistical_type parameter to BaseDistanceBasedBins constructor |
| frouros/detectors/data_drift/batch/distance_based/psi.py | Explicitly set statistical_type to UnivariateData for univariate-only detector |
| frouros/detectors/data_drift/batch/distance_based/hi_normalized_complement.py | Explicitly set statistical_type to UnivariateData for univariate-only detector |
| frouros/detectors/data_drift/batch/distance_based/hellinger_distance.py | Explicitly set statistical_type to UnivariateData for univariate-only detector |
| frouros/tests/integration/test_data_drift.py | Updated expected values to reflect new distance formula, added multivariate test cases, cleaned up import formatting |
| frouros/tests/integration/test_callback.py | Updated expected distance value for BhattacharyyaDistance |
| README.md | Updated capability indicator from "U" (univariate) to "M" (multivariate) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| :param num_bins: number of bins per dimension in which to | ||
| divide probabilities, defaults to 10 |
There was a problem hiding this comment.
The docstring parameter description should clarify that for multivariate data, num_bins represents the number of bins per dimension, which means the total number of bins in the histogram grows exponentially with dimensionality (num_bins^d where d is the number of dimensions). This is an important consideration for users when choosing the num_bins parameter for high-dimensional data.
| :param num_bins: number of bins per dimension in which to | |
| divide probabilities, defaults to 10 | |
| :param num_bins: number of bins per dimension used to build the histogram. | |
| For multivariate data with d dimensions this results in ``num_bins**d`` | |
| total bins, so large values of ``num_bins`` in high dimensions can be | |
| computationally expensive. Defaults to 10. |
| # Add a new axis if X_ref and X are 1D | ||
| if X_ref.ndim == 1: | ||
| X_ref = X_ref[:, np.newaxis] | ||
| X = X[:, np.newaxis] |
There was a problem hiding this comment.
The in-place modification of input arrays X_ref and X by adding a new axis at lines 195-196 can cause unintended side effects. Although the assignment creates a view/new array reference within the function scope, this approach is unclear. Consider using np.atleast_2d() or explicitly creating new arrays with X_ref.reshape(-1, 1) to make the intent clearer and avoid any potential confusion.
| # Add a new axis if X_ref and X are 1D | |
| if X_ref.ndim == 1: | |
| X_ref = X_ref[:, np.newaxis] | |
| X = X[:, np.newaxis] | |
| # Ensure 2D shape if X_ref and X are 1D (treat as single feature) | |
| if X_ref.ndim == 1: | |
| X_ref = X_ref.reshape(-1, 1) | |
| X = X.reshape(-1, 1) |
| min_edge = np.min(np.vstack((X_ref, X)), axis=0) | ||
| max_edge = np.max(np.vstack((X_ref, X)), axis=0) | ||
| bins = [ | ||
| np.linspace(min_edge[i], max_edge[i], num_bins + 1) | ||
| for i in range(X_ref.shape[1]) | ||
| ] |
There was a problem hiding this comment.
When min_edge equals max_edge for a dimension (all values in that dimension are the same), np.linspace will create bins where all edges are the same value. This will cause all data points in that dimension to fall outside the bin range or into a single bin, which may lead to unexpected behavior. Consider adding a check for constant dimensions and handling them appropriately, such as using a small range around the constant value or skipping that dimension.
| # Use absolute value to avoid negative zero values | ||
| bhattacharyya = np.abs(-np.log(bc)) | ||
|
|
||
| return bhattacharyya |
There was a problem hiding this comment.
Using np.abs() on the result of -np.log(bc) is unnecessary because -np.log(x) for x in (0, 1] always produces a non-negative value. The logarithm of a value between 0 and 1 is negative, so negating it produces a positive value. The np.abs() call may mask potential issues if bc somehow becomes greater than 1, which would indicate a problem with the normalization. Consider removing np.abs() or adding an assertion to verify bc is in the valid range.
| # Use absolute value to avoid negative zero values | |
| bhattacharyya = np.abs(-np.log(bc)) | |
| return bhattacharyya | |
| bhattacharyya = -np.log(bc) | |
| return bhattacharyya |
| X_ref_hist, _ = np.histogramdd(X_ref, bins=bins) | ||
| X_hist, _ = np.histogramdd(X, bins=bins) |
There was a problem hiding this comment.
Using histogramdd with a fixed number of bins per dimension can lead to exponential growth in the number of bins as dimensionality increases. For example, with num_bins=10 and 5 dimensions, the histogram will have 10^5 = 100,000 bins, most of which may be empty. This can lead to performance issues and memory concerns for high-dimensional data. Consider adding a warning or validation for high-dimensional inputs, or implementing an adaptive binning strategy for multivariate data.
| # Add small epsilon to avoid log(0) | ||
| epsilon = np.finfo(float).eps | ||
| X_percents = X_percents + epsilon | ||
| Y_percents = Y_percents + epsilon | ||
|
|
||
| # Compute Bhattacharyya coefficient | ||
| bc = np.sum(np.sqrt(X_percents * Y_percents)) | ||
| # Clip between [0,1] to avoid numerical errors | ||
| bc = np.clip(bc, a_min=0, a_max=1) | ||
|
|
There was a problem hiding this comment.
The clipping operation at line 93 may mask issues with the epsilon addition at lines 87-88. If the histograms are properly normalized and epsilon is not added, the Bhattacharyya coefficient should naturally be in [0,1]. The current approach of adding epsilon to all bins and then clipping could hide the fact that the coefficient exceeds 1 due to improper normalization after epsilon addition. Consider removing the clipping and ensuring proper normalization, or add epsilon only when needed (i.e., when bc equals 0 before taking the log).
| # Add small epsilon to avoid log(0) | |
| epsilon = np.finfo(float).eps | |
| X_percents = X_percents + epsilon | |
| Y_percents = Y_percents + epsilon | |
| # Compute Bhattacharyya coefficient | |
| bc = np.sum(np.sqrt(X_percents * Y_percents)) | |
| # Clip between [0,1] to avoid numerical errors | |
| bc = np.clip(bc, a_min=0, a_max=1) | |
| # Compute Bhattacharyya coefficient directly from normalized histograms | |
| bc = np.sum(np.sqrt(X_percents * Y_percents)) | |
| # Guard against log(0) or log of negative due to numerical issues | |
| epsilon = np.finfo(float).eps | |
| if bc <= 0.0: | |
| bc = epsilon |
| # Add small epsilon to avoid log(0) | ||
| epsilon = np.finfo(float).eps | ||
| X_percents = X_percents + epsilon | ||
| Y_percents = Y_percents + epsilon | ||
|
|
||
| # Compute Bhattacharyya coefficient | ||
| bc = np.sum(np.sqrt(X_percents * Y_percents)) | ||
| # Clip between [0,1] to avoid numerical errors | ||
| bc = np.clip(bc, a_min=0, a_max=1) |
There was a problem hiding this comment.
The normalization after adding epsilon to the histograms may not preserve probability mass correctly. When epsilon is added to X_percents and Y_percents at lines 87-88, the total probability mass of each distribution increases beyond 1.0. The sum of X_percents + epsilon will be greater than 1.0, which violates the requirement that probabilities sum to 1. This affects the validity of the Bhattacharyya coefficient calculation. Consider either: (1) re-normalizing after adding epsilon, or (2) adding epsilon conditionally only where needed in the bc calculation.
| # Add small epsilon to avoid log(0) | |
| epsilon = np.finfo(float).eps | |
| X_percents = X_percents + epsilon | |
| Y_percents = Y_percents + epsilon | |
| # Compute Bhattacharyya coefficient | |
| bc = np.sum(np.sqrt(X_percents * Y_percents)) | |
| # Clip between [0,1] to avoid numerical errors | |
| bc = np.clip(bc, a_min=0, a_max=1) | |
| # Small epsilon to avoid log(0) when computing the distance | |
| epsilon = np.finfo(float).eps | |
| # Compute Bhattacharyya coefficient from normalized histograms | |
| bc = np.sum(np.sqrt(X_percents * Y_percents)) | |
| # Clip between [epsilon, 1] to avoid numerical errors and log(0) | |
| bc = np.clip(bc, a_min=epsilon, a_max=1.0) |
📚 Documentation preview 📚: https://frouros--353.org.readthedocs.build/en/353/