Multivariate – Cross-Column ID Consistency

**Is your feature request related to a problem? Please describe.**
Datasets that contain multiple ID or key columns (e.g. `start`, `end`, `length`, or `sequence_id`, `peptide_id`) should have perfectly consistent relationships between them (e.g. `length = end - start`). When these relationships break down — or when two columns are expected to be identical but aren't — it indicates a data integrity issue that current correlation analysis does not surface clearly.

**Describe the solution you'd like**
Add a `_check_cross_column_id_consistency()` function to `quality_assessment/relationships.py`:
1. Identify column pairs with Pearson or Eta² association ≥ 0.99 (near-perfect correlation)
2. For numeric pairs, check whether the relationship is strictly linear (residuals ≈ 0) using a simple OLS fit
3. Flag pairs where correlation is 0.99+ but residuals are non-trivial as "inconsistent ID relationship"
4. For string columns, detect columns that are identical in values but named differently (potential duplicate columns or derived keys)

Surface in the Relationships section of the Quality Assessment Report.

**Describe alternatives you've considered**
Manual rule specification (e.g. `--assert "length == end - start"`). More precise but requires user knowledge of the schema; the automatic approach catches unexpected inconsistencies.

**Additional context**
A Pearson correlation of exactly 1.0 or −1.0 between two columns that are not mathematically related (e.g. a sequence ID and a peptide count) is a strong signal that one column was derived from the other, which is a common source of train/test leakage in ML pipelines.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multivariate – Cross-Column ID Consistency #15

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multivariate – Cross-Column ID Consistency #15

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions