Skip to content

Latest commit

 

History

History
143 lines (95 loc) · 4.68 KB

File metadata and controls

143 lines (95 loc) · 4.68 KB

harmonypy

PyPI Downloads Tests DOI

harmonypy is a Python package for the Harmony algorithm for integrating multiple high-dimensional datasets. It uses a C++ backend (Armadillo) for fast linear algebra, matching the R harmony2 package step-by-step.

This animation shows Harmony aligning three single-cell RNA-seq datasets from different donors. → How to make this animation. Before Harmony, you can clearly distinguish cells from each of the three donors. After Harmony, the cells from different donors are mixed while preserving the overall shape of the data.

Installation

Install from PyPI (pre-built wheels for Linux and macOS):

pip install harmonypy

Building from source

Building from source requires a C++ compiler, CMake, and a BLAS library:

macOS (uses Apple Accelerate, no extra dependencies):

pip install .

Linux (requires OpenBLAS):

# Debian/Ubuntu
sudo apt install libopenblas-dev cmake

# RHEL/Fedora
sudo dnf install openblas-devel cmake

pip install .

Quick Start

import harmonypy as hm
import pandas as pd

# Load the principal components and metadata
pcs = pd.read_csv("data/pbmc_3500_pcs.tsv.gz", sep="\t")
meta = pd.read_csv("data/pbmc_3500_meta.tsv.gz", sep="\t")

# Run Harmony to correct for batch effects (donor)
harmony_out = hm.run_harmony(pcs, meta, "donor")

# Save corrected PCs (same shape as input)
result = pd.DataFrame(harmony_out.Z_corr, columns=pcs.columns)
result.to_csv("pbmc_3500_pcs_harmony.tsv", sep="\t", index=False)

Usage with Scanpy

import scanpy as sc
import harmonypy as hm

# Load and preprocess your data
adata = sc.read_h5ad("my_data.h5ad")
sc.pp.pca(adata)

# Get PCs from the AnnData object
pcs = adata.obsm['X_pca']
print(pcs.shape)  # (n_cells, n_pcs)

# Run Harmony on the PCA embedding
harmony_out = hm.run_harmony(pcs, adata.obs, "batch")

# Store corrected PCs back in the AnnData object
adata.obsm['X_pca_harmony'] = harmony_out.Z_corr

# Use harmonized PCs for downstream analysis
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.umap(adata)
sc.tl.leiden(adata)

Parameters

run_harmony accepts the same parameters as the R package:

Parameter Default Description
theta 2 Diversity penalty per batch variable
sigma 0.1 Kernel bandwidth for soft clustering
nclust min(N/30, 100) Number of clusters
max_iter_harmony 10 Maximum Harmony iterations
max_iter_kmeans 4 K-means iterations per Harmony round
epsilon_harmony 1e-2 Convergence threshold
ncores 0 BLAS threads (0 = all cores)
lamb None Ridge penalty (None = auto-estimate)

The ncores parameter controls BLAS threading (Accelerate on macOS, OpenBLAS on Linux). Default is 0 (use all available cores). Set ncores=1 for single-threaded execution.

Performance

The script in tests/test_harmony.py on an Apple M1 (2022) chip reports:

  Dataset                    Time    RSS delta
  ---------------------- -------- ------------
  Small (3.5k cells)        0.23s     45.2 MB
  Medium (69k cells)        4.76s    262.3 MB
  Large (858k cells)       29.29s   1969.5 MB

Citation

If you use Harmony in your work, please cite the original paper:

Korsunsky, I., Millard, N., Fan, J. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289–1296 (2019). https://doi.org/10.1038/s41592-019-0619-0

The Supplementary Information PDF provides detailed mathematical descriptions and implementation notes.

To learn more about Harmony 2, please see the preprint here:

Patikas, Nikolaos, Hongcheng Yao, Roopa Madhu, Soumya Raychaudhuri, Martin Hemberg, and Ilya Korsunsky. 2026. Integration of Large, Complex Single-Cell Datasets with Harmony2. bioRxiv. https://doi.org/10.64898/2026.03.16.711825