Skip to content

Multivariate – Cross-Column ID Consistency #15

Description

@hansen-maria

Is your feature request related to a problem? Please describe.
Datasets that contain multiple ID or key columns (e.g. start, end, length, or sequence_id, peptide_id) should have perfectly consistent relationships between them (e.g. length = end - start). When these relationships break down — or when two columns are expected to be identical but aren't — it indicates a data integrity issue that current correlation analysis does not surface clearly.

Describe the solution you'd like
Add a _check_cross_column_id_consistency() function to quality_assessment/relationships.py:

  1. Identify column pairs with Pearson or Eta² association ≥ 0.99 (near-perfect correlation)
  2. For numeric pairs, check whether the relationship is strictly linear (residuals ≈ 0) using a simple OLS fit
  3. Flag pairs where correlation is 0.99+ but residuals are non-trivial as "inconsistent ID relationship"
  4. For string columns, detect columns that are identical in values but named differently (potential duplicate columns or derived keys)

Surface in the Relationships section of the Quality Assessment Report.

Describe alternatives you've considered
Manual rule specification (e.g. --assert "length == end - start"). More precise but requires user knowledge of the schema; the automatic approach catches unexpected inconsistencies.

Additional context
A Pearson correlation of exactly 1.0 or −1.0 between two columns that are not mathematically related (e.g. a sequence ID and a peptide count) is a strong signal that one column was derived from the other, which is a common source of train/test leakage in ML pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions