Extend multiple data drift univariate methods to multivariate by jaime-cespedes-sisniega · Pull Request #353 · IFCA-Advanced-Computing/frouros

jaime-cespedes-sisniega · 2024-12-01T19:04:19Z

BhattacharyyaDistance

📚 Documentation preview 📚: https://frouros--353.org.readthedocs.build/en/353/

Copilot

Pull request overview

This pull request extends the BhattacharyyaDistance detector from univariate to multivariate data support. The implementation now uses the mathematically correct Bhattacharyya distance formula (-log(BC)) instead of the coefficient complement (1 - BC), and leverages multidimensional histograms to handle multivariate distributions.

Changes:

Extended BhattacharyyaDistance to support multivariate data by changing from UnivariateData to MultivariateData
Updated _calculate_bins_values in base class to use histogramdd for both univariate and multivariate cases
Modified Bhattacharyya distance calculation to use the proper distance formula with logarithm

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py	Changed statistical type to MultivariateData, updated distance formula to use -log(BC), and modified documentation/examples for multivariate support
frouros/detectors/data_drift/batch/distance_based/base.py	Refactored _calculate_bins_values to handle both 1D and multivariate data using histogramdd, added statistical_type parameter to BaseDistanceBasedBins constructor
frouros/detectors/data_drift/batch/distance_based/psi.py	Explicitly set statistical_type to UnivariateData for univariate-only detector
frouros/detectors/data_drift/batch/distance_based/hi_normalized_complement.py	Explicitly set statistical_type to UnivariateData for univariate-only detector
frouros/detectors/data_drift/batch/distance_based/hellinger_distance.py	Explicitly set statistical_type to UnivariateData for univariate-only detector
frouros/tests/integration/test_data_drift.py	Updated expected values to reflect new distance formula, added multivariate test cases, cleaned up import formatting
frouros/tests/integration/test_callback.py	Updated expected distance value for BhattacharyyaDistance
README.md	Updated capability indicator from "U" (univariate) to "M" (multivariate)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T10:25:51Z

frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py

+    :param num_bins: number of bins per dimension in which to
+    divide probabilities, defaults to 10


The docstring parameter description should clarify that for multivariate data, num_bins represents the number of bins per dimension, which means the total number of bins in the histogram grows exponentially with dimensionality (num_bins^d where d is the number of dimensions). This is an important consideration for users when choosing the num_bins parameter for high-dimensional data.

Suggested change

:param num_bins: number of bins per dimension in which to

divide probabilities, defaults to 10

:param num_bins: number of bins per dimension used to build the histogram.

For multivariate data with d dimensions this results in ``num_bins**d``

total bins, so large values of ``num_bins`` in high dimensions can be

computationally expensive. Defaults to 10.

Copilot · 2026-01-21T10:25:52Z

frouros/detectors/data_drift/batch/distance_based/base.py

+        # Add a new axis if X_ref and X are 1D
+        if X_ref.ndim == 1:
+            X_ref = X_ref[:, np.newaxis]
+            X = X[:, np.newaxis]


The in-place modification of input arrays X_ref and X by adding a new axis at lines 195-196 can cause unintended side effects. Although the assignment creates a view/new array reference within the function scope, this approach is unclear. Consider using np.atleast_2d() or explicitly creating new arrays with X_ref.reshape(-1, 1) to make the intent clearer and avoid any potential confusion.

Suggested change

# Add a new axis if X_ref and X are 1D

if X_ref.ndim == 1:

X_ref = X_ref[:, np.newaxis]

X = X[:, np.newaxis]

# Ensure 2D shape if X_ref and X are 1D (treat as single feature)

if X_ref.ndim == 1:

X_ref = X_ref.reshape(-1, 1)

X = X.reshape(-1, 1)

Copilot · 2026-01-21T10:25:52Z

frouros/detectors/data_drift/batch/distance_based/base.py

+        min_edge = np.min(np.vstack((X_ref, X)), axis=0)
+        max_edge = np.max(np.vstack((X_ref, X)), axis=0)
+        bins = [
+            np.linspace(min_edge[i], max_edge[i], num_bins + 1)
+            for i in range(X_ref.shape[1])
        ]


When min_edge equals max_edge for a dimension (all values in that dimension are the same), np.linspace will create bins where all edges are the same value. This will cause all data points in that dimension to fall outside the bin range or into a single bin, which may lead to unexpected behavior. Consider adding a check for constant dimensions and handling them appropriately, such as using a small range around the constant value or skipping that dimension.

Copilot · 2026-01-21T10:25:52Z

frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py

+        # Use absolute value to avoid negative zero values
+        bhattacharyya = np.abs(-np.log(bc))
+
        return bhattacharyya


Using np.abs() on the result of -np.log(bc) is unnecessary because -np.log(x) for x in (0, 1] always produces a non-negative value. The logarithm of a value between 0 and 1 is negative, so negating it produces a positive value. The np.abs() call may mask potential issues if bc somehow becomes greater than 1, which would indicate a problem with the normalization. Consider removing np.abs() or adding an assertion to verify bc is in the valid range.

Suggested change

# Use absolute value to avoid negative zero values

bhattacharyya = np.abs(-np.log(bc))

return bhattacharyya

bhattacharyya = -np.log(bc)

return bhattacharyya

Copilot · 2026-01-21T10:25:53Z

frouros/detectors/data_drift/batch/distance_based/base.py

+        X_ref_hist, _ = np.histogramdd(X_ref, bins=bins)
+        X_hist, _ = np.histogramdd(X, bins=bins)


Using histogramdd with a fixed number of bins per dimension can lead to exponential growth in the number of bins as dimensionality increases. For example, with num_bins=10 and 5 dimensions, the histogram will have 10^5 = 100,000 bins, most of which may be empty. This can lead to performance issues and memory concerns for high-dimensional data. Consider adding a warning or validation for high-dimensional inputs, or implementing an adaptive binning strategy for multivariate data.

Copilot · 2026-01-21T10:25:53Z

frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py

+        # Add small epsilon to avoid log(0)
+        epsilon = np.finfo(float).eps
+        X_percents = X_percents + epsilon
+        Y_percents = Y_percents + epsilon
+
+        # Compute Bhattacharyya coefficient
+        bc = np.sum(np.sqrt(X_percents * Y_percents))
+        # Clip between [0,1] to avoid numerical errors
+        bc = np.clip(bc, a_min=0, a_max=1)
+


The clipping operation at line 93 may mask issues with the epsilon addition at lines 87-88. If the histograms are properly normalized and epsilon is not added, the Bhattacharyya coefficient should naturally be in [0,1]. The current approach of adding epsilon to all bins and then clipping could hide the fact that the coefficient exceeds 1 due to improper normalization after epsilon addition. Consider removing the clipping and ensuring proper normalization, or add epsilon only when needed (i.e., when bc equals 0 before taking the log).

Suggested change

# Add small epsilon to avoid log(0)

epsilon = np.finfo(float).eps

X_percents = X_percents + epsilon

Y_percents = Y_percents + epsilon

# Compute Bhattacharyya coefficient

bc = np.sum(np.sqrt(X_percents * Y_percents))

# Clip between [0,1] to avoid numerical errors

bc = np.clip(bc, a_min=0, a_max=1)

# Compute Bhattacharyya coefficient directly from normalized histograms

bc = np.sum(np.sqrt(X_percents * Y_percents))

# Guard against log(0) or log of negative due to numerical issues

epsilon = np.finfo(float).eps

if bc <= 0.0:

bc = epsilon

Copilot · 2026-01-21T10:25:53Z

frouros/detectors/data_drift/batch/distance_based/bhattacharyya_distance.py

+        # Add small epsilon to avoid log(0)
+        epsilon = np.finfo(float).eps
+        X_percents = X_percents + epsilon
+        Y_percents = Y_percents + epsilon
+
+        # Compute Bhattacharyya coefficient
+        bc = np.sum(np.sqrt(X_percents * Y_percents))
+        # Clip between [0,1] to avoid numerical errors
+        bc = np.clip(bc, a_min=0, a_max=1)


The normalization after adding epsilon to the histograms may not preserve probability mass correctly. When epsilon is added to X_percents and Y_percents at lines 87-88, the total probability mass of each distribution increases beyond 1.0. The sum of X_percents + epsilon will be greater than 1.0, which violates the requirement that probabilities sum to 1. This affects the validity of the Bhattacharyya coefficient calculation. Consider either: (1) re-normalizing after adding epsilon, or (2) adding epsilon conditionally only where needed in the bc calculation.

Suggested change

# Add small epsilon to avoid log(0)

epsilon = np.finfo(float).eps

X_percents = X_percents + epsilon

Y_percents = Y_percents + epsilon

# Compute Bhattacharyya coefficient

bc = np.sum(np.sqrt(X_percents * Y_percents))

# Clip between [0,1] to avoid numerical errors

bc = np.clip(bc, a_min=0, a_max=1)

# Small epsilon to avoid log(0) when computing the distance

epsilon = np.finfo(float).eps

# Compute Bhattacharyya coefficient from normalized histograms

bc = np.sum(np.sqrt(X_percents * Y_percents))

# Clip between [epsilon, 1] to avoid numerical errors and log(0)

bc = np.clip(bc, a_min=epsilon, a_max=1.0)

jaime-cespedes-sisniega added 2 commits December 1, 2024 20:02

Extend BhattacharyyaDistance to multivariate

91c35c5

Change Bhattacharyya distance to multivariate in README

9d96888

alvarolopez marked this pull request as ready for review January 21, 2026 10:19

Copilot AI review requested due to automatic review settings January 21, 2026 10:19

alvarolopez merged commit 4de13a2 into main Jan 21, 2026
4 checks passed

Copilot started reviewing on behalf of alvarolopez January 21, 2026 10:19 View session

alvarolopez deleted the feature/data-drift-univariate-to-multivariate branch January 21, 2026 10:19

Copilot AI reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend multiple data drift univariate methods to multivariate#353

Extend multiple data drift univariate methods to multivariate#353
alvarolopez merged 2 commits intomainfrom
feature/data-drift-univariate-to-multivariate

jaime-cespedes-sisniega commented Dec 1, 2024 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		:param num_bins: number of bins per dimension in which to
		divide probabilities, defaults to 10

-    :param num_bins: number of bins per dimension in which to
-    divide probabilities, defaults to 10
+    :param num_bins: number of bins per dimension used to build the histogram.
+        For multivariate data with d dimensions this results in ``num_bins**d``
+        total bins, so large values of ``num_bins`` in high dimensions can be
+        computationally expensive. Defaults to 10.

		X_ref_hist, _ = np.histogramdd(X_ref, bins=bins)
		X_hist, _ = np.histogramdd(X, bins=bins)

Conversation

jaime-cespedes-sisniega commented Dec 1, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaime-cespedes-sisniega commented Dec 1, 2024 •

edited by github-actions bot

Loading