🧠 Project HEIMDALL: GBM Pathogenomics Early Warning System

⚠️ EDUCATIONAL DISCLAIMER (Updated March 2026): This is a home-based learning project. The algorithms, mappings, and outputs of Project HEIMDALL are for educational and research-demonstration purposes only. This software is NOT a medical diagnostic tool and has not been validated for clinical use. It should NEVER be used to inform surgical decisions, biopsy actions, or any form of patient diagnosis.

📌 Project Status: PAUSED

No statistically meaningful findings were produced in this iteration. After completing the full pipeline — morphology extraction, proteome-wide sweep, transcriptome-wide sweep, and multi-omics Ridge fusion — no robust correlations between nuclear morphology features and molecular markers survived rigorous statistical filtering (Spearman ρ, Benjamini-Hochberg FDR < 0.10).

This does not necessarily mean the underlying hypothesis is wrong. It more likely reflects the constraints described below. The project is being paused rather than abandoned.

Why no findings — honest assessment

Fast-paced, unfocused development. This pipeline was built and iterated quickly across multiple sessions without a stable experimental design. Scientific discovery at this scale requires sustained, methodical attention that this sprint-style approach did not provide.
Small cohort. The CPTAC-GBM dataset provides a limited number of patients with all three data modalities (SM imaging, proteomics, RNAseq) simultaneously available. Low n means low statistical power — real correlations may exist but be undetectable here.
Imaging modality pivot mid-project (see section below). The original design was built around MRI radiomics; switching to 2D slide microscopy mid-stream introduced methodological drift that was never fully reconciled.
Morphology pipeline sensitivity. The nucleus segmentation and feature extraction pipeline (watershed, HED deconvolution, biological gating) was iteratively improved but never formally validated against ground-truth annotations. Noisy morphology features will suppress real signal.
No domain expert input. Pathogenomic research of this kind typically requires a pathologist to validate segmentation quality and select biologically meaningful feature subsets. Doing this solo and computationally introduces unchecked assumptions.

To determine whether this approach is genuinely viable, the project would need: a larger multi-site cohort, validated morphology features with pathologist sign-off, pre-registered analysis plan, and sustained focused time — not sprint development.

🔄 Original Concept vs What Was Actually Built

Original plan: Radiogenomics

The project was conceived as a radiogenomics study — correlating MRI imaging features with molecular data. MRI provides 3D volumetric texture, perfusion, and diffusion features that have established literature links to GBM molecular subtypes (IDH status, MGMT methylation, etc.).

Why it changed: MRI data is not publicly accessible

CPTAC-GBM MRI files are held in a restricted-access repository and require a formal data access request through the NCI. The request process was not pursued, so MRI data was unavailable.

What replaced it: Slide Microscopy (Pathogenomics)

The CPTAC-GBM IDC repository contains Slide Microscopy (SM) DICOM files that are publicly accessible. The project was pivoted to extract nuclear morphology features from these 2D H&E-stained whole-slide images instead.

This is a meaningfully different scientific question. Radiogenomics asks: "Does what the tumour looks like on a scanner reflect its molecular state?" Pathogenomics asks: "Does nuclear architecture under a microscope reflect its molecular state?" Both are legitimate research directions but require different validation frameworks, feature engineering approaches, and reference literature. The pivot was made pragmatically rather than scientifically, which is a limitation.

🛡️ The Legend of Heimdall

In Norse mythology, Heimdall is the Watcher of the Gods who guards the Bifröst — the bridge connecting different realms. This project acts as a "Watcher" over multi-omic data, attempting to signal when cellular-level microscopy patterns might bridge to systemic molecular state. The name remains, even if the bridge is still under construction.

🛠️ Environments

Three isolated Conda environments separate concerns cleanly:

Environment	Role	Key Libraries
`heimdall_base.yml`	Data Fetch	`cptac`, `idc-index` (v1.2+)
`heimdall_pixels.yml`	Image Processing	`OpenCV`, `pydicom` (v3.0+), `scikit-image`, `scipy`
`heimdall_brain.yml`	ML & Analysis	`scikit-learn`, `statsmodels`, `seaborn`, `XGBoost`

# Build all three environments from project root
conda env create -f environments/heimdall_base.yml
conda env create -f environments/heimdall_pixels.yml
conda env create -f environments/heimdall_brain.yml

🚀 Execution Pipeline

Phase I — Data Fetch (`heimdall_base`)

Connects to the CPTAC Discovery Cohort API to retrieve synchronised proteomics, RNAseq, and SM DICOM datasets.

conda activate heimdall_base && python scripts/heimdall_fetch.py

Phase II — Morphology Extraction (`heimdall_pixels`)

Script: scripts/pixels/extract_morphology.py

Processes SM DICOM patches per patient. For each patch the pipeline runs:

HED colour deconvolution — isolates the hematoxylin (nuclear) channel
Gaussian denoising → Otsu thresholding → binary closing — produces a clean binary mask
Distance-transform watershed — splits touching/clumped nuclei that naive thresholding treats as one object
Biological gating — filters by area (MIN_AREA/MAX_AREA) and circularity (MIN_CIRC)
Feature extraction — per-nucleus: area, perimeter, circularity, eccentricity, solidity, aspect ratio, mean intensity, entropy
Patient aggregation — patch-level features are averaged to produce one row per patient

Output: data/processed/morphology_features.csv

⚠️ Calibration required. Gating parameters (MIN_AREA, MAX_AREA, MIN_CIRC) and the watershed min_distance must be tuned to your cohort's magnification (20x vs 40x). Review validation plots in data/processed/validation/plots/ — red overlays should align with nuclei, not artefacts or stroma.

conda activate heimdall_pixels && python scripts/pixels/extract_morphology.py

Phase III — Multi-Omics Discovery Sweep (`heimdall_brain`)

Script: scripts/multiomics_sweep.py

Runs a proteome-wide and transcriptome-wide correlation sweep independently, then produces combined outputs.

Per omics layer (proteins and genes separately):

Presence filter: drops features with fewer than MIN_PATIENTS non-NaN observations
Variance filter: drops near-constant features (CV < 0.01)
Median imputation on remaining features
Spearman correlation (rank-based, robust to outliers and non-normality) for every (omics feature × morphology feature) pair
Benjamini-Hochberg FDR correction across all tests in that layer

Outputs:

data/processed/proteome_discovery_results.csv
data/processed/transcriptome_discovery_results.csv
data/processed/multiomics_combined_results.csv
Volcano plots, per-layer heatmaps, combined overview plot — all saved to data/processed/validation/plots/

conda activate heimdall_brain && python scripts/multiomics_sweep.py

Phase IV — Ridge Fusion Model (`heimdall_brain`)

Script: scripts/triple_omics_fusion.py

Takes the top morphology-correlated targets (3 genes + 3 proteins, selected via Spearman + FDR) and trains a Leave-One-Out cross-validated RidgeCV model to predict each molecular target purely from nuclear morphology features.

Design decisions:

Target selection uses Spearman + BH-FDR, not raw Pearson sorting, to avoid cherry-picking inflated by multiple comparisons
RidgeCV uses GCV (Generalised Cross-Validation) internally — analytically exact for Ridge, no nested LOO needed, no data leakage
Alpha search space: 50 log-uniform values from 1e-3 to 1e4 (wider than typical defaults)
Morphology predictors are CV-filtered before entering the model
Per-target output: actual vs predicted scatter plot + morphology feature coefficient bar chart (mean ± SD across LOO folds)

Outputs:

data/processed/fusion_leaderboard.csv — R², MAE, mean alpha, n per target
Per-target diagnostic plots + leaderboard bar chart in data/processed/validation/plots/

conda activate heimdall_brain && python scripts/triple_omics_fusion.py

📁 Directory Structure

project_heimdall/
├── data/
│   ├── raw/
│   │   ├── proteomics/       # CPTAC_GBM_proteomics.tsv
│   │   └── genomics/         # CPTAC_GBM_RNAseq.tsv
│   └── processed/
│       ├── patches/           # Per-patient SM image patches
│       ├── morphology_features.csv
│       ├── proteome_discovery_results.csv
│       ├── transcriptome_discovery_results.csv
│       ├── multiomics_combined_results.csv
│       ├── fusion_leaderboard.csv
│       └── validation/
│           └── plots/         # All diagnostic and result plots
├── scripts
│   ├── base
│   │   └── heimdall_fetch.py
│   ├── brain
│   │   ├── heimdall_brain.py
│   │   └── heimdall_discovery.py
│   └── pixels
│       ├── extract_morphology.py
│       └── heimdall_dcm_patch.py
├── environments/
│   ├── heimdall_base.yml
│   ├── heimdall_pixels.yml
│   └── heimdall_brain.yml
└── README.md

📜 Licensing & Data Governance

Software: MIT License.

Data:

Proteomic & Genomic data: CPTAC (National Cancer Institute). Usage subject to NCI data use policy.
Slide Microscopy imaging: Sourced from the NCI Imaging Data Commons (IDC). Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
MRI data (not used): Restricted access — requires formal NCI data access request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Project HEIMDALL: GBM Pathogenomics Early Warning System

📌 Project Status: PAUSED

Why no findings — honest assessment

🔄 Original Concept vs What Was Actually Built

Original plan: Radiogenomics

Why it changed: MRI data is not publicly accessible

What replaced it: Slide Microscopy (Pathogenomics)

🛡️ The Legend of Heimdall

🛠️ Environments

🚀 Execution Pipeline

Phase I — Data Fetch (`heimdall_base`)

Phase II — Morphology Extraction (`heimdall_pixels`)

Phase III — Multi-Omics Discovery Sweep (`heimdall_brain`)

Phase IV — Ridge Fusion Model (`heimdall_brain`)

📁 Directory Structure

📜 Licensing & Data Governance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scripts		scripts
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
heimdall_base.yml		heimdall_base.yml
heimdall_brain.yml		heimdall_brain.yml
heimdall_pixels.yml		heimdall_pixels.yml

Folders and files

Latest commit

History

Repository files navigation

🧠 Project HEIMDALL: GBM Pathogenomics Early Warning System

📌 Project Status: PAUSED

Why no findings — honest assessment

🔄 Original Concept vs What Was Actually Built

Original plan: Radiogenomics

Why it changed: MRI data is not publicly accessible

What replaced it: Slide Microscopy (Pathogenomics)

🛡️ The Legend of Heimdall

🛠️ Environments

🚀 Execution Pipeline

Phase I — Data Fetch (heimdall_base)

Phase II — Morphology Extraction (heimdall_pixels)

Phase III — Multi-Omics Discovery Sweep (heimdall_brain)

Phase IV — Ridge Fusion Model (heimdall_brain)

📁 Directory Structure

📜 Licensing & Data Governance

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Phase I — Data Fetch (`heimdall_base`)

Phase II — Morphology Extraction (`heimdall_pixels`)

Phase III — Multi-Omics Discovery Sweep (`heimdall_brain`)

Phase IV — Ridge Fusion Model (`heimdall_brain`)

Packages