Skip to content

Adding distribution overlap metrics #95

@axiomcura

Description

@axiomcura

One of the advantages of using EMD is that it provides a metric indicating the amount of work still required for a treated population; it provides a distance metric. However, it does not tell us how much overlap exists between these two single-cell populations.

We can add a separate function that calculates the amount of overlap using a traditional logistic regression model or a binary tree. This will provide insight into the global level of overlap, making it more interpretable to see how single-cell overlap in the on-morphology signature space.

implementation example:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

def classifier_auroc(control_df, treated_df, n_splits=5, random_state=42):
    """
    AUROC-based overlap for pre-normalized, feature-selected morphological profiles.

    Parameters
    ----------
    control_df : pd.DataFrame  shape (n_control, n_features)
    treated_df : pd.DataFrame  shape (n_treated, n_features)

    Returns
    -------
    dict with auroc, overlap, and population sizes
    """
    n_ctrl = len(control_df)
    n_trt  = len(treated_df)
    ratio  = min(n_ctrl, n_trt) / max(n_ctrl, n_trt)

    if ratio < 0.3:
        print(f"⚠️  High imbalance detected: {n_ctrl} control vs {n_trt} treated "
              f"(ratio={ratio:.2f}). Using class_weight='balanced'.")

    X = np.vstack([control_df.values, treated_df.values])
    y = np.array([0] * n_ctrl + [1] * n_trt)

    cv  = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    clf = LogisticRegression(
        max_iter=1000,
        random_state=random_state,
        class_weight="balanced"
    )

    auroc   = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc").mean()
    overlap = 1 - abs(auroc - 0.5) * 2

    return {"auroc": auroc, "overlap": overlap, "n_control": n_ctrl, "n_treated": n_trt}

However, incorporating the overlap score into the on-Buscar score needs further discussion. Should it be treated separately, or should higher overlap lower the overall on-Buscar score?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions