Skip to content

Extend multiple data drift univariate methods to multivariate#353

Merged
alvarolopez merged 2 commits intomainfrom
feature/data-drift-univariate-to-multivariate
Jan 21, 2026
Merged

Extend multiple data drift univariate methods to multivariate#353
alvarolopez merged 2 commits intomainfrom
feature/data-drift-univariate-to-multivariate

Conversation

@jaime-cespedes-sisniega
Copy link
Contributor

@jaime-cespedes-sisniega jaime-cespedes-sisniega commented Dec 1, 2024

  • BhattacharyyaDistance

📚 Documentation preview 📚: https://frouros--353.org.readthedocs.build/en/353/

@alvarolopez alvarolopez marked this pull request as ready for review January 21, 2026 10:19
Copilot AI review requested due to automatic review settings January 21, 2026 10:19
@alvarolopez alvarolopez merged commit 4de13a2 into main Jan 21, 2026
4 checks passed
@alvarolopez alvarolopez deleted the feature/data-drift-univariate-to-multivariate branch January 21, 2026 10:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request extends the BhattacharyyaDistance detector from univariate to multivariate data support. The implementation now uses the mathematically correct Bhattacharyya distance formula (-log(BC)) instead of the coefficient complement (1 - BC), and leverages multidimensional histograms to handle multivariate distributions.

Changes:

  • Extended BhattacharyyaDistance to support multivariate data by changing from UnivariateData to MultivariateData
  • Updated _calculate_bins_values in base class to use histogramdd for both univariate and multivariate cases
  • Modified Bhattacharyya distance calculation to use the proper distance formula with logarithm

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py Changed statistical type to MultivariateData, updated distance formula to use -log(BC), and modified documentation/examples for multivariate support
frouros/detectors/data_drift/batch/distance_based/base.py Refactored _calculate_bins_values to handle both 1D and multivariate data using histogramdd, added statistical_type parameter to BaseDistanceBasedBins constructor
frouros/detectors/data_drift/batch/distance_based/psi.py Explicitly set statistical_type to UnivariateData for univariate-only detector
frouros/detectors/data_drift/batch/distance_based/hi_normalized_complement.py Explicitly set statistical_type to UnivariateData for univariate-only detector
frouros/detectors/data_drift/batch/distance_based/hellinger_distance.py Explicitly set statistical_type to UnivariateData for univariate-only detector
frouros/tests/integration/test_data_drift.py Updated expected values to reflect new distance formula, added multivariate test cases, cleaned up import formatting
frouros/tests/integration/test_callback.py Updated expected distance value for BhattacharyyaDistance
README.md Updated capability indicator from "U" (univariate) to "M" (multivariate)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +18
:param num_bins: number of bins per dimension in which to
divide probabilities, defaults to 10
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring parameter description should clarify that for multivariate data, num_bins represents the number of bins per dimension, which means the total number of bins in the histogram grows exponentially with dimensionality (num_bins^d where d is the number of dimensions). This is an important consideration for users when choosing the num_bins parameter for high-dimensional data.

Suggested change
:param num_bins: number of bins per dimension in which to
divide probabilities, defaults to 10
:param num_bins: number of bins per dimension used to build the histogram.
For multivariate data with d dimensions this results in ``num_bins**d``
total bins, so large values of ``num_bins`` in high dimensions can be
computationally expensive. Defaults to 10.

Copilot uses AI. Check for mistakes.
Comment on lines +193 to +196
# Add a new axis if X_ref and X are 1D
if X_ref.ndim == 1:
X_ref = X_ref[:, np.newaxis]
X = X[:, np.newaxis]
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place modification of input arrays X_ref and X by adding a new axis at lines 195-196 can cause unintended side effects. Although the assignment creates a view/new array reference within the function scope, this approach is unclear. Consider using np.atleast_2d() or explicitly creating new arrays with X_ref.reshape(-1, 1) to make the intent clearer and avoid any potential confusion.

Suggested change
# Add a new axis if X_ref and X are 1D
if X_ref.ndim == 1:
X_ref = X_ref[:, np.newaxis]
X = X[:, np.newaxis]
# Ensure 2D shape if X_ref and X are 1D (treat as single feature)
if X_ref.ndim == 1:
X_ref = X_ref.reshape(-1, 1)
X = X.reshape(-1, 1)

Copilot uses AI. Check for mistakes.
Comment on lines +198 to 203
min_edge = np.min(np.vstack((X_ref, X)), axis=0)
max_edge = np.max(np.vstack((X_ref, X)), axis=0)
bins = [
np.linspace(min_edge[i], max_edge[i], num_bins + 1)
for i in range(X_ref.shape[1])
]
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When min_edge equals max_edge for a dimension (all values in that dimension are the same), np.linspace will create bins where all edges are the same value. This will cause all data points in that dimension to fall outside the bin range or into a single bin, which may lead to unexpected behavior. Consider adding a check for constant dimensions and handling them appropriately, such as using a small range around the constant value or skipping that dimension.

Copilot uses AI. Check for mistakes.
Comment on lines +96 to 99
# Use absolute value to avoid negative zero values
bhattacharyya = np.abs(-np.log(bc))

return bhattacharyya
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.abs() on the result of -np.log(bc) is unnecessary because -np.log(x) for x in (0, 1] always produces a non-negative value. The logarithm of a value between 0 and 1 is negative, so negating it produces a positive value. The np.abs() call may mask potential issues if bc somehow becomes greater than 1, which would indicate a problem with the normalization. Consider removing np.abs() or adding an assertion to verify bc is in the valid range.

Suggested change
# Use absolute value to avoid negative zero values
bhattacharyya = np.abs(-np.log(bc))
return bhattacharyya
bhattacharyya = -np.log(bc)
return bhattacharyya

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +206
X_ref_hist, _ = np.histogramdd(X_ref, bins=bins)
X_hist, _ = np.histogramdd(X, bins=bins)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using histogramdd with a fixed number of bins per dimension can lead to exponential growth in the number of bins as dimensionality increases. For example, with num_bins=10 and 5 dimensions, the histogram will have 10^5 = 100,000 bins, most of which may be empty. This can lead to performance issues and memory concerns for high-dimensional data. Consider adding a warning or validation for high-dimensional inputs, or implementing an adaptive binning strategy for multivariate data.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +94
# Add small epsilon to avoid log(0)
epsilon = np.finfo(float).eps
X_percents = X_percents + epsilon
Y_percents = Y_percents + epsilon

# Compute Bhattacharyya coefficient
bc = np.sum(np.sqrt(X_percents * Y_percents))
# Clip between [0,1] to avoid numerical errors
bc = np.clip(bc, a_min=0, a_max=1)

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clipping operation at line 93 may mask issues with the epsilon addition at lines 87-88. If the histograms are properly normalized and epsilon is not added, the Bhattacharyya coefficient should naturally be in [0,1]. The current approach of adding epsilon to all bins and then clipping could hide the fact that the coefficient exceeds 1 due to improper normalization after epsilon addition. Consider removing the clipping and ensuring proper normalization, or add epsilon only when needed (i.e., when bc equals 0 before taking the log).

Suggested change
# Add small epsilon to avoid log(0)
epsilon = np.finfo(float).eps
X_percents = X_percents + epsilon
Y_percents = Y_percents + epsilon
# Compute Bhattacharyya coefficient
bc = np.sum(np.sqrt(X_percents * Y_percents))
# Clip between [0,1] to avoid numerical errors
bc = np.clip(bc, a_min=0, a_max=1)
# Compute Bhattacharyya coefficient directly from normalized histograms
bc = np.sum(np.sqrt(X_percents * Y_percents))
# Guard against log(0) or log of negative due to numerical issues
epsilon = np.finfo(float).eps
if bc <= 0.0:
bc = epsilon

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +93
# Add small epsilon to avoid log(0)
epsilon = np.finfo(float).eps
X_percents = X_percents + epsilon
Y_percents = Y_percents + epsilon

# Compute Bhattacharyya coefficient
bc = np.sum(np.sqrt(X_percents * Y_percents))
# Clip between [0,1] to avoid numerical errors
bc = np.clip(bc, a_min=0, a_max=1)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization after adding epsilon to the histograms may not preserve probability mass correctly. When epsilon is added to X_percents and Y_percents at lines 87-88, the total probability mass of each distribution increases beyond 1.0. The sum of X_percents + epsilon will be greater than 1.0, which violates the requirement that probabilities sum to 1. This affects the validity of the Bhattacharyya coefficient calculation. Consider either: (1) re-normalizing after adding epsilon, or (2) adding epsilon conditionally only where needed in the bc calculation.

Suggested change
# Add small epsilon to avoid log(0)
epsilon = np.finfo(float).eps
X_percents = X_percents + epsilon
Y_percents = Y_percents + epsilon
# Compute Bhattacharyya coefficient
bc = np.sum(np.sqrt(X_percents * Y_percents))
# Clip between [0,1] to avoid numerical errors
bc = np.clip(bc, a_min=0, a_max=1)
# Small epsilon to avoid log(0) when computing the distance
epsilon = np.finfo(float).eps
# Compute Bhattacharyya coefficient from normalized histograms
bc = np.sum(np.sqrt(X_percents * Y_percents))
# Clip between [epsilon, 1] to avoid numerical errors and log(0)
bc = np.clip(bc, a_min=epsilon, a_max=1.0)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants