Skip to content

dcrntn/heimdall

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Project HEIMDALL: GBM Pathogenomics Early Warning System

⚠️ EDUCATIONAL DISCLAIMER (Updated March 2026): This is a home-based learning project. The algorithms, mappings, and outputs of Project HEIMDALL are for educational and research-demonstration purposes only. This software is NOT a medical diagnostic tool and has not been validated for clinical use. It should NEVER be used to inform surgical decisions, biopsy actions, or any form of patient diagnosis.


📌 Project Status: PAUSED

No statistically meaningful findings were produced in this iteration. After completing the full pipeline — morphology extraction, proteome-wide sweep, transcriptome-wide sweep, and multi-omics Ridge fusion — no robust correlations between nuclear morphology features and molecular markers survived rigorous statistical filtering (Spearman ρ, Benjamini-Hochberg FDR < 0.10).

This does not necessarily mean the underlying hypothesis is wrong. It more likely reflects the constraints described below. The project is being paused rather than abandoned.

Why no findings — honest assessment

  • Fast-paced, unfocused development. This pipeline was built and iterated quickly across multiple sessions without a stable experimental design. Scientific discovery at this scale requires sustained, methodical attention that this sprint-style approach did not provide.
  • Small cohort. The CPTAC-GBM dataset provides a limited number of patients with all three data modalities (SM imaging, proteomics, RNAseq) simultaneously available. Low n means low statistical power — real correlations may exist but be undetectable here.
  • Imaging modality pivot mid-project (see section below). The original design was built around MRI radiomics; switching to 2D slide microscopy mid-stream introduced methodological drift that was never fully reconciled.
  • Morphology pipeline sensitivity. The nucleus segmentation and feature extraction pipeline (watershed, HED deconvolution, biological gating) was iteratively improved but never formally validated against ground-truth annotations. Noisy morphology features will suppress real signal.
  • No domain expert input. Pathogenomic research of this kind typically requires a pathologist to validate segmentation quality and select biologically meaningful feature subsets. Doing this solo and computationally introduces unchecked assumptions.

To determine whether this approach is genuinely viable, the project would need: a larger multi-site cohort, validated morphology features with pathologist sign-off, pre-registered analysis plan, and sustained focused time — not sprint development.


🔄 Original Concept vs What Was Actually Built

Original plan: Radiogenomics

The project was conceived as a radiogenomics study — correlating MRI imaging features with molecular data. MRI provides 3D volumetric texture, perfusion, and diffusion features that have established literature links to GBM molecular subtypes (IDH status, MGMT methylation, etc.).

Why it changed: MRI data is not publicly accessible

CPTAC-GBM MRI files are held in a restricted-access repository and require a formal data access request through the NCI. The request process was not pursued, so MRI data was unavailable.

What replaced it: Slide Microscopy (Pathogenomics)

The CPTAC-GBM IDC repository contains Slide Microscopy (SM) DICOM files that are publicly accessible. The project was pivoted to extract nuclear morphology features from these 2D H&E-stained whole-slide images instead.

This is a meaningfully different scientific question. Radiogenomics asks: "Does what the tumour looks like on a scanner reflect its molecular state?" Pathogenomics asks: "Does nuclear architecture under a microscope reflect its molecular state?" Both are legitimate research directions but require different validation frameworks, feature engineering approaches, and reference literature. The pivot was made pragmatically rather than scientifically, which is a limitation.


🛡️ The Legend of Heimdall

In Norse mythology, Heimdall is the Watcher of the Gods who guards the Bifröst — the bridge connecting different realms. This project acts as a "Watcher" over multi-omic data, attempting to signal when cellular-level microscopy patterns might bridge to systemic molecular state. The name remains, even if the bridge is still under construction.


🛠️ Environments

Three isolated Conda environments separate concerns cleanly:

Environment Role Key Libraries
heimdall_base.yml Data Fetch cptac, idc-index (v1.2+)
heimdall_pixels.yml Image Processing OpenCV, pydicom (v3.0+), scikit-image, scipy
heimdall_brain.yml ML & Analysis scikit-learn, statsmodels, seaborn, XGBoost
# Build all three environments from project root
conda env create -f environments/heimdall_base.yml
conda env create -f environments/heimdall_pixels.yml
conda env create -f environments/heimdall_brain.yml

🚀 Execution Pipeline

Phase I — Data Fetch (heimdall_base)

Connects to the CPTAC Discovery Cohort API to retrieve synchronised proteomics, RNAseq, and SM DICOM datasets.

conda activate heimdall_base && python scripts/heimdall_fetch.py

Phase II — Morphology Extraction (heimdall_pixels)

Script: scripts/pixels/extract_morphology.py

Processes SM DICOM patches per patient. For each patch the pipeline runs:

  1. HED colour deconvolution — isolates the hematoxylin (nuclear) channel
  2. Gaussian denoisingOtsu thresholdingbinary closing — produces a clean binary mask
  3. Distance-transform watershed — splits touching/clumped nuclei that naive thresholding treats as one object
  4. Biological gating — filters by area (MIN_AREA/MAX_AREA) and circularity (MIN_CIRC)
  5. Feature extraction — per-nucleus: area, perimeter, circularity, eccentricity, solidity, aspect ratio, mean intensity, entropy
  6. Patient aggregation — patch-level features are averaged to produce one row per patient

Output: data/processed/morphology_features.csv

⚠️ Calibration required. Gating parameters (MIN_AREA, MAX_AREA, MIN_CIRC) and the watershed min_distance must be tuned to your cohort's magnification (20x vs 40x). Review validation plots in data/processed/validation/plots/ — red overlays should align with nuclei, not artefacts or stroma.

conda activate heimdall_pixels && python scripts/pixels/extract_morphology.py

Phase III — Multi-Omics Discovery Sweep (heimdall_brain)

Script: scripts/multiomics_sweep.py

Runs a proteome-wide and transcriptome-wide correlation sweep independently, then produces combined outputs.

Per omics layer (proteins and genes separately):

  • Presence filter: drops features with fewer than MIN_PATIENTS non-NaN observations
  • Variance filter: drops near-constant features (CV < 0.01)
  • Median imputation on remaining features
  • Spearman correlation (rank-based, robust to outliers and non-normality) for every (omics feature × morphology feature) pair
  • Benjamini-Hochberg FDR correction across all tests in that layer

Outputs:

  • data/processed/proteome_discovery_results.csv
  • data/processed/transcriptome_discovery_results.csv
  • data/processed/multiomics_combined_results.csv
  • Volcano plots, per-layer heatmaps, combined overview plot — all saved to data/processed/validation/plots/
conda activate heimdall_brain && python scripts/multiomics_sweep.py

Phase IV — Ridge Fusion Model (heimdall_brain)

Script: scripts/triple_omics_fusion.py

Takes the top morphology-correlated targets (3 genes + 3 proteins, selected via Spearman + FDR) and trains a Leave-One-Out cross-validated RidgeCV model to predict each molecular target purely from nuclear morphology features.

Design decisions:

  • Target selection uses Spearman + BH-FDR, not raw Pearson sorting, to avoid cherry-picking inflated by multiple comparisons
  • RidgeCV uses GCV (Generalised Cross-Validation) internally — analytically exact for Ridge, no nested LOO needed, no data leakage
  • Alpha search space: 50 log-uniform values from 1e-3 to 1e4 (wider than typical defaults)
  • Morphology predictors are CV-filtered before entering the model
  • Per-target output: actual vs predicted scatter plot + morphology feature coefficient bar chart (mean ± SD across LOO folds)

Outputs:

  • data/processed/fusion_leaderboard.csv — R², MAE, mean alpha, n per target
  • Per-target diagnostic plots + leaderboard bar chart in data/processed/validation/plots/
conda activate heimdall_brain && python scripts/triple_omics_fusion.py

📁 Directory Structure

project_heimdall/
├── data/
│   ├── raw/
│   │   ├── proteomics/       # CPTAC_GBM_proteomics.tsv
│   │   └── genomics/         # CPTAC_GBM_RNAseq.tsv
│   └── processed/
│       ├── patches/           # Per-patient SM image patches
│       ├── morphology_features.csv
│       ├── proteome_discovery_results.csv
│       ├── transcriptome_discovery_results.csv
│       ├── multiomics_combined_results.csv
│       ├── fusion_leaderboard.csv
│       └── validation/
│           └── plots/         # All diagnostic and result plots
├── scripts
│   ├── base
│   │   └── heimdall_fetch.py
│   ├── brain
│   │   ├── heimdall_brain.py
│   │   └── heimdall_discovery.py
│   └── pixels
│       ├── extract_morphology.py
│       └── heimdall_dcm_patch.py
├── environments/
│   ├── heimdall_base.yml
│   ├── heimdall_pixels.yml
│   └── heimdall_brain.yml
└── README.md

📜 Licensing & Data Governance

Software: MIT License.

Data:

About

Pathogenomics pipeline aligning H&E nuclear morphology with proteogenomic signatures (RNA-seq/Mass-Spec) in Glioblastoma

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages