Skip to content

Fix df_mash_corr_dist silently overwritten with raw Mash distances#35

Open
dingyifei wants to merge 1 commit intoSBRG:mainfrom
dingyifei:2b-fix
Open

Fix df_mash_corr_dist silently overwritten with raw Mash distances#35
dingyifei wants to merge 1 commit intoSBRG:mainfrom
dingyifei:2b-fix

Conversation

@dingyifei
Copy link

Marimo notebooks showed some discrepancies in the filtering process, it appears that the 2b notebook potentially had a bug introduced in commit e2f5398.

Commit message:

In 6 places across cells 14, 20, 23, 32, and 34, df_mash_corr_dist (and related _complete variants) were incorrectly assigned from df_mash_square instead of from themselves. This caused correlation distance matrices to be replaced with raw Mash distances, changing the effective fcluster thresholds by ~30x and producing fragmented clustering results.


    Bug: df_mash_corr_dist silently overwritten with raw Mash distances                                                                                                                                       
   
  Location                                                                                                                                                                                                  
                  
  examples/2b_mash_filtration_and_clustering.ipynb — cells 14, 20, and 23 (introduced in e2f5398, Feb 5 2025, never corrected).

  What happens

  Cell 11 correctly computes the Pearson correlation distance matrix:

  df_mash_corr = df_mash_square.corr()
  df_mash_corr_dist = 1 - df_mash_corr   # ← correct: values in [0, ~1.3]

  Then three subsequent cells overwrite df_mash_corr_dist (and related variables) with the raw Mash square distance matrix instead:

  Cell 14 — filtering to scrubbed strains:
  df_mash_square    = df_mash_square.loc[scrubbed_strains, scrubbed_strains]
  df_mash_corr      = df_mash_corr.loc[scrubbed_strains, scrubbed_strains]
  df_mash_corr_dist = df_mash_square.loc[scrubbed_strains, scrubbed_strains]  # BUG
  #                    ^^^^^^^^^^^^^^ should be df_mash_corr_dist

  Cell 20 — filtering by Mash < 0.05 to reference strains:
  df_mash_square    = df_mash_square.loc[good_strains, good_strains]
  df_mash_corr      = df_mash_corr.loc[good_strains, good_strains]
  df_mash_corr_dist = df_mash_square.loc[good_strains, good_strains]          # BUG
  #                    ^^^^^^^^^^^^^^ should be df_mash_corr_dist

  Cell 23 — subsetting to Complete genomes:
  df_mash_square_complete    = df_mash_square.loc[complete_seqs, complete_seqs]
  df_mash_corr_complete      = df_mash_square.loc[complete_seqs, complete_seqs]  # BUG
  df_mash_corr_dist_complete = df_mash_square.loc[complete_seqs, complete_seqs]  # BUG
  #                             ^^^^^^^^^^^^^^ should be df_mash_corr and df_mash_corr_dist respectively

  The same bug in the iterative loop

  Cells 32 and 34 also propagate it during the small-cluster removal loop:
  df_mash_corr_complete      = remove_bad_strains(df_mash_square_complete, bad_genomes_list)  # BUG
  #                                                ^^^^^^^^^^^^^^^^^^^^^^ should be df_mash_corr_complete

  Why it matters

  sensitivity_analysis() and cluster_corr_dist() in pyphylon/mash.py are designed to operate on correlation distances. They use thresh * dist.max() as the fcluster distance cutoff. The distance metric
  fundamentally changes the clustering behavior:

  ┌─────────────────────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
  │                                         │   Raw Mash distances (buggy)    │ Correlation distances (correct) │
  ├─────────────────────────────────────────┼─────────────────────────────────┼─────────────────────────────────┤
  │ Value range                             │ 0 – 0.045                       │ 0 – 1.3                         │
  ├─────────────────────────────────────────┼─────────────────────────────────┼─────────────────────────────────┤
  │ dist.max()                              │ ~0.045                          │ ~1.3                            │
  ├─────────────────────────────────────────┼─────────────────────────────────┼─────────────────────────────────┤
  │ Effective fcluster thresholds           │ 0.000045 – 0.045                │ 0.0013 – 1.3                    │
  ├─────────────────────────────────────────┼─────────────────────────────────┼─────────────────────────────────┤
  │ Result                                  │ Many small, fragmented clusters │ Fewer, larger clusters          │
  ├─────────────────────────────────────────┼─────────────────────────────────┼─────────────────────────────────┤
  │ Genomes surviving small-cluster pruning │ 373 / 468 (80%)                 │ 465 / 468 (99%)                 │
  └─────────────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

  The over-fragmentation from using raw Mash distances causes the iterative small-cluster removal loop to discard 95 additional genomes that should have been retained.

  Downstream impact

  The 95 extra removed genomes cascade through the entire pipeline:

  1. 2c (pangenome build): fewer genomes → fewer genes discovered (8,318 vs 11,328)
  2. 3a (CAR classification): smaller accessory genome matrix (1,273 × 355 vs 1,801 × 465)
  3. 4a (NMF): different AIC landscape → 25 phylons instead of the correct 13

  Fix

  Six lines need to change. In each case, the right-hand side should reference the same variable being assigned to (subsetting it in place), not df_mash_square:

  Cell 14:
  df_mash_corr_dist = df_mash_corr_dist.loc[scrubbed_strains, scrubbed_strains]

  Cell 20:
  df_mash_corr_dist = df_mash_corr_dist.loc[good_strains, good_strains]

  Cell 23:
  df_mash_corr_complete      = df_mash_corr.loc[complete_seqs, complete_seqs]
  df_mash_corr_dist_complete = df_mash_corr_dist.loc[complete_seqs, complete_seqs]

  Cells 32 and 34 (both contain the same line):
  df_mash_corr_complete = remove_bad_strains(df_mash_corr_complete, bad_genomes_list)

  After fixing, the pipeline needs to be re-run from 2b onwards since all downstream outputs (2c through 5f) were computed on the reduced genome set.

  Confirmed bugs

  ┌──────┬──────────────────────────────────────────────────────┬────────────────────────────────────────────────┬──────────┐
  │ Cell │                      Code (RHS)                      │                   Should be                    │ Verified │
  ├──────┼──────────────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────┤
  │ 14   │ df_mash_square.loc[scrubbed_strains, ...]            │ df_mash_corr_dist.loc[...]                     │ Yes      │
  ├──────┼──────────────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────┤
  │ 20   │ df_mash_square.loc[good_strains, ...]                │ df_mash_corr_dist.loc[...]                     │ Yes      │
  ├──────┼──────────────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────┤
  │ 23   │ df_mash_corr_complete = df_mash_square.loc[...]      │ df_mash_corr.loc[...]                          │ Yes      │
  ├──────┼──────────────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────┤
  │ 23   │ df_mash_corr_dist_complete = df_mash_square.loc[...] │ df_mash_corr_dist.loc[...]                     │ Yes      │
  ├──────┼──────────────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────┤
  │ 32   │ remove_bad_strains(df_mash_square_complete, ...)     │ remove_bad_strains(df_mash_corr_complete, ...) │ Yes      │
  ├──────┼──────────────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────┤
  │ 34   │ remove_bad_strains(df_mash_square_complete, ...)     │ remove_bad_strains(df_mash_corr_complete, ...) │ Yes      │
  └──────┴──────────────────────────────────────────────────────┴────────────────────────────────────────────────┴──────────┘

  Confirmed mechanism

  I confirmed that both sensitivity_analysis() and cluster_corr_dist() in pyphylon/mash.py compute the fcluster cutoff as thresh * dist.max(). With raw Mash distances (~0.045 max) vs correlation distances (~1.3 max), the effective
  thresholds differ by ~30x, which would indeed produce drastically more fragmented clusters.

  Proposed fixes are correct

  All six fix lines correctly subset each variable from itself rather than from df_mash_square.

  One caveat on the impact numbers

  The committed notebook has small_clst_limit = 0 (cell 29), which means the small-cluster removal loop is effectively a no-op — no genomes get pruned regardless of the distance metric. The specific claim of "373/468 vs 465/468"
  and "95 extra removed genomes" would only apply if small_clst_limit was set to a positive value during the actual run.

  However, even with small_clst_limit = 0, the bug still affects:
  - The sensitivity analysis elbow detection (cell 24)
  - All cluster assignments (cells 26, 36)
  - The saved df_mash_corr_dist.csv output (cell 46) — which contains raw Mash distances, not correlation distances

  So the bug identification and fixes are fully correct. The downstream cascade analysis is directionally correct but the exact numbers depend on what small_clst_limit was actually used in production.

In 6 places across cells 14, 20, 23, 32, and 34, df_mash_corr_dist
(and related _complete variants) were incorrectly assigned from
df_mash_square instead of from themselves. This caused correlation
distance matrices to be replaced with raw Mash distances, changing
the effective fcluster thresholds by ~30x and producing fragmented
clustering results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant