ldsc-rs is a from-scratch Rust reimplementation of Bulik-Sullivan et al.'s
LDSC — the standard tool for SNP-heritability
(h²) and genetic-correlation (rg) estimation from GWAS summary statistics. All
six original subcommands (munge-sumstats, l2, h2, rg, make-annot,
cts-annot) with numerically identical default output and a ~38× speedup
on LD-score computation (1000 Genomes reference, N=2,490). CLI is compatible
with Python LDSC's --l2/--h2/--rg syntax for drop-in replacement in
pipelines like NCI LDlink.
LDSC computes per-SNP LD scores ℓ_j = Σ_k r²_{jk} (the sum of squared
correlations between SNP j and other SNPs in a window), then regresses GWAS
chi-squares on those LD scores. The slope is proportional to SNP heritability;
two GWAS regressed jointly recover their genetic correlation. The compute
bottleneck is the LD-score step at biobank scale: a custom 50K-individual
reference panel takes 109 minutes in Python LDSC (measured on r6a.8xlarge,
32 vCPU, 240 GiB) and ~6 minutes in Rust --fast-f32 exact mode (measured
358 s on c6a.4xlarge); the standard CountSketch shortcut cuts the time further
but introduces systematic bias into downstream h².
-
Drop-in replacement at the default. Output is numerically identical to Python LDSC when invoked with
--python-compat(max_abs_diff = 0across 18,627 chr22 1000G EUR SNPs verified bit-identical). The h² regression is bit-identical to Python on identical LD-score inputs. ~38× faster than Python on LD-score computation at 1000G; ~18× faster at biobank (Rust--snp-level-masking --fast-f32vs measured Python); ~9× on h²/rg. -
Principled CountSketch (
--sketch d). The fused BED-decode → normalize → scatter-add kernel eliminates the N×c intermediate matrix entirely. Bias is corrected by inverting the renormalized-cosine bias(1−r²)(1−2r²)/dexactly (residual O(1/d²)) — derivation and Monte-Carlo validation indocs/countsketch-math-analysis.md. At biobank N=50K, sketch is 17–24× faster than--fast-f32exact. -
--sketch 1000 --snp-level-maskingis the universal sweet spot. The N×d sweep across N ∈ {503, 10K, 20K, 50K, 100K} (seepreprint/data/dn_sweep_full.csvand Figure 7 of the preprint) shows the accuracy curve is essentially N-independent: a user picking d does not need to know the panel size in advance. At d=1000 with masking on, ldsc-rs delivers Pearson r ≥ 0.993 vs exact LD scores and h² within ~0.003 of per-SNP exact across every panel size tested. Wall time is <15 s even at N=100,000. The masking flag toggles which h² truth cluster is reproduced:--sketch 1000(plain) → matches Python LDSC h² (chunked-window algorithm, the canonical implementation everyone has used since 2014).--sketch 1000 --snp-level-masking→ matches per-SNP exact h² (the LDSC paper's mathematical definition of ℓ_j = Σ_k r²_jk).
For strict per-SNP accuracy (partitioned heritability, fine-mapping), bump to
--sketch 5000for Pearson r ≥ 0.999 at ~30 s wall.And with
--snp-level-maskingon, sketch+mask is ~12% more accurate than Python LDSC at biobank: Python's chunked-window approximation has systematic 10–14% h² bias vs the paper's per-SNP exact definition (verified: Python h² = Rust chunk-exact h² to 4 decimals, both with Δ −0.015 / −0.020 / −0.047 vs per-SNP exact). Sketch+mask gets to the paper definition that Python's algorithm structurally can't.Cross-validated against GCTA, Python LDSC, chunk-exact, and per-SNP exact — full validation in
docs/perf-log.md2026-05-12 / 2026-05-13 entries.
Three implementations exist for computing LD scores from a reference panel — producing slightly different LD-score values that all agree on h² to ~1% on standard EUR-on-EUR setups:
| Tool | Windowing | Bias correction | Speed (1000G N=2,490) |
|---|---|---|---|
GCTA --ld-meanrsq |
Block-with-overlap (two-pass averaging) | --ld-score-adj |
C++ baseline |
| Python LDSC | 50-SNP chunked (undocumented approximation) | unbiased r² | 25 min 49 s |
| ldsc-rs | 200-SNP chunked (default) → per-SNP exact (--snp-level-masking) |
unbiased r² + quadratic CountSketch correction | 41 s exact, ~25 s --sketch 200 |
GCTA was the tool the original LDSC paper used to compute the published 1000G
reference scores; Python LDSC is the canonical implementation everyone has
replicated since 2014; ldsc-rs adds biobank-scale CountSketch + the
--sketch d --snp-level-masking combo that reaches the LDSC paper's exact
mathematical definition at sketch speed. For h² and rg regression itself,
ldsc-rs is bit-identical to Python LDSC given identical LD-score inputs.
For h² estimation specifically (rather than LD-score generation), other tools occupy adjacent niches: BOLT-LMM and REGENIE estimate h² directly from individual-level genotypes via mixed models (more powerful, more compute); ldsc-rs and Python LDSC operate on summary statistics only (no genotypes needed at h² time).
Fastest (no Rust required):
docker run --rm ghcr.io/sharifhsn/ldsc:latest --helpLocal install options:
- Prebuilt binaries from GitHub Releases (see Prebuilt Binaries below).
- Cargo install (requires Rust ≥ 1.85):
cargo install ldsc
Two common workflows. Both produce numerically identical h²/rg output to Python LDSC.
Skip LD-score computation by downloading the standard 1000G EUR reference:
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
tar -jxvf eur_w_ld_chr.tar.bz2
bunzip2 w_hm3.snplist.bz2
# Munge sumstats → standardized format
ldsc munge-sumstats --sumstats my_gwas.txt --N 50000 \
--merge-alleles w_hm3.snplist --out my_trait
# h² (single trait) or rg (pair)
ldsc h2 --h2 my_trait.sumstats.gz --ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ --out my_trait_h2
ldsc rg --rg trait1.sumstats.gz,trait2.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ --w-ld-chr eur_w_ld_chr/ \
--out trait1_vs_trait2For a biobank-scale reference panel (any N up to 100K+), the recommended
high-accuracy fast mode is --sketch 1000 --snp-level-masking. The
empirical N×d sweep shows this puts h² within ~0.003 of per-SNP exact
regardless of N (because accuracy depends on d alone, not on sample
size), at wall-clock <15 s on a laptop:
# Universal sweet spot: <15s on N up to 100K, full 1.66M SNPs
ldsc l2 --bfile my_biobank_panel \
--ld-wind-kb 1000 --sketch 1000 --snp-level-masking --mmap \
--out my_ld_scores
# Then h² with your custom LD scores
ldsc h2 --h2 my_trait.sumstats.gz \
--ref-ld-chr my_ld_scores --w-ld-chr my_ld_scores \
--out my_trait_h2For strict per-SNP accuracy (partitioned heritability, fine-mapping),
substitute --sketch 5000 (~30 s wall, Pearson r ≥ 0.999 vs exact).
For 1000G-scale references (N≈500-2,500), the speed difference between sketch
and exact is small enough that exact is preferable: ldsc l2 --bfile … --out … --ld-wind-kb 1000 --snp-level-masking --fast-f32.
Speedups below are biobank wall-clock (N=50,000, 1.66M SNPs, AWS c6a.4xlarge
unless noted). The Python LDSC baseline is 6,541 s (109 min) measured on
r6a.8xlarge (32 vCPU, 240 GiB) at --chunk-size 200 --ld-wind-kb 1000.
At d=1000, one flag toggles which "truth" the sketch reproduces (verified
across the N×d sweep with masking on — see
preprint/data/dn_sweep_full.csv):
--sketch 1000(plain) → matches Python LDSC h² (chunked-window algorithm)--sketch 1000 --snp-level-masking→ matches per-SNP exact h² (LDSC paper definition)
Both run in <15 s on a laptop up to N=100K — masking is a post-GEMM cutoff scan with <1% overhead.
| Goal | Use | Time @ N=50K | Time @ N=100K | h² vs per-SNP exact |
|---|---|---|---|---|
| Match Python LDSC bit-for-bit (1000G replication) | --python-compat |
— | — | bit-identical |
| Exact LD scores, paper-canonical math | --snp-level-masking --fast-f32 |
292 s | 1,013 s | truth (ref) |
| Universal sweet spot (Pearson r ≥ 0.993 across all N) | --sketch 1000 --snp-level-masking --mmap |
9.7 s | 14.4 s | within ~0.003 |
| Tighter accuracy (r ≥ 0.996) | --sketch 1600 --snp-level-masking --mmap |
12.1 s | 17.5 s | within ~0.002 |
| Strict per-SNP accuracy (r ≥ 0.999) | --sketch 5000 --snp-level-masking --mmap |
27.4 s | 32.1 s | within ~0.002 |
| Fastest, NOT in truth cluster (QC only) | --sketch 200 --snp-level-masking --mmap |
7.1 s | 12.4 s | drifts by 0.006–0.013 |
See Performance for measured timings. The full N×d sweep
(5 N × 8 d cells = 45 measurements with per-cell Pearson r and BMI h²)
lives at preprint/data/dn_sweep_full.csv with the cohort-validated figure
at preprint/figures/fig7_optimal_d.png.
For evaluation: what changes if you switch from Python LDSC to ldsc-rs?
- At the default: identical to Python LDSC for h²/rg with bit-identical
LD scores when you pass
--python-compattol2. - Without
--python-compat: ldsc-rsl2defaults to--chunk-size 200(vs Python's 50) for ~4× faster GEMM. Chunked windowing changes per-SNP LD scores by ~0.2 mean / ~6 max on chr22 EUR (Pearson r=0.9997). Downstream h² impact is <1% on EUR-on-EUR setups, but grows to 10–14% on biobank-scale multi-population references — see the--snp-level-maskingsection. --sketch dadds the CountSketch approximation; combinable with--snp-level-maskingto recover paper-exact h². See Choosing yourl2mode.
Python LDSC consists of three separate scripts; ldsc-rs consolidates them
into subcommands of a single ldsc binary, while still accepting the
original Python flag syntax:
| Python | Rust |
|---|---|
python munge_sumstats.py --sumstats … --out … |
ldsc munge-sumstats --sumstats … --out … |
python ldsc.py --l2 --bfile … --out … |
ldsc l2 --bfile … --out … (or ldsc --l2 …) |
python ldsc.py --h2 … --ref-ld-chr … |
ldsc h2 --h2 … --ref-ld-chr … (or ldsc --h2 …) |
python ldsc.py --rg … --ref-ld-chr … |
ldsc rg --rg … --ref-ld-chr … (or ldsc --rg …) |
python make_annot.py --bimfile … --bed-file … |
ldsc make-annot --bimfile … --bed-file … |
python ldsc.py --cts-bin … |
ldsc cts-annot … |
--mafin l2: default matches Python (MAF prefilter before LD computation).--n-mindefault: when--n-minis 0, matches Python (90th percentile / 1.5).--yes-really: Rust warns when the LD window spans a whole chromosome and--yes-reallyis not set (Python errors).--chunksizein munge: Polars LazyFrame streaming ignores chunk size for munge.--cts-binworkflow: supported directly byldsc l2and via the separateldsc cts-annotpreprocessor.--return-silly-things/--invert-anyway: accepted for CLI parity but no-ops; ldsc-rs never clips results and always uses a least-squares solver (warning emitted).--no-print-annot: only affects--cts-binoutput (suppresses.annot.gz).
If the binary is invoked via a symlink whose filename contains "munge"
(e.g. munge_sumstats.py), it auto-routes to the munge-sumstats
subcommand. Same for --l2 / --h2 / --rg flag-style invocation:
ln -s /usr/local/bin/ldsc /usr/local/bin/munge_sumstats.py
munge_sumstats.py --sumstats file.txt --out munged # worksstdout output matches the Python format for downstream parsing (the "Total Observed scale h2: ...", Lambda GC, Mean Chi^2, Intercept, Ratio lines that pipelines like NCI LDlink parse).
Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg.
Input may be plain, .gz, or .bz2, and tab- or whitespace-delimited.
ldsc munge-sumstats \
--sumstats my_gwas.txt.gz \
--out output_prefix \
[--merge-alleles w_hm3.snplist] \
[--signed-sumstats BETA,0] \
[--N 50000] \
[--info-min 0.9] \
[--maf-min 0.01]Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the
null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns.
--a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele).
--merge-alleles enforces allele concordance (mismatches are removed), matching Python behavior.
Use --daner or --daner-n for Ripke daner formats (infers N from FRQ_[A/U] headers or Nca/Nco columns).
Computes LD scores from a PLINK binary file set (.bed/.bim/.fam).
Annotation inputs (.annot) may be plain, .gz, or .bz2.
Tip for European GWAS: Pre-computed 1000G phase 3 LD scores are available from the Broad LDSCORE page. Download
eur_w_ld_chr.tar.bz2; aftertar -jxvf, the inner.l2.ldscore.gzfiles are already gzip-compressed and work directly withldsc. Non-European populations require computing your own LD scores from an appropriate reference panel.
ldsc l2 \
--bfile /path/to/1000G_EUR \
--out out/eur \
--ld-wind-cm 1.0 \
[--annot annotations/BaselineLD.] \
[--extract snplist.txt] \
[--maf 0.01] \
[--keep keep_individuals.txt] \
[--per-allele] \
[--pq-exp 1.0]ldsc l2 warns if the LD window spans an entire chromosome; use --yes-really to silence.
Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb
(physical distance), or --ld-wind-snps (fixed flanking SNP count).
Partitioned LD scores with --annot prefix: accepts either a single {prefix}.annot[.gz|.bz2] file or
per-chromosome {prefix}{chr}.annot[.gz|.bz2] files for each chromosome present in the BIM. Outputs one L2 column
per annotation and corresponding .l2.M / .l2.M_5_50 files.
--per-allele is equivalent to --pq-exp 1 (weights each r² by p·(1−p)). Use --pq-exp S to
apply (p·(1−p))^S weighting; output columns and .M files receive a _S{S} suffix.
--no-print-annot suppresses the .annot.gz output produced by --cts-bin.
See Speed/accuracy modes below for --sketch, --snp-level-masking,
--fast-f32, --mmap, and --gpu.
Estimates SNP heritability.
LD score inputs may be plain, .gz, or .bz2.
ldsc h2 \
--h2 trait.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out results--ref-ld-chr prefix appends the chromosome number then .l2.ldscore.gz. So
--ref-ld-chr eur_w_ld_chr/ reads eur_w_ld_chr/1.l2.ldscore.gz … eur_w_ld_chr/22.l2.ldscore.gz.
If the chromosome number falls in the middle of the filename, use @ as a placeholder:
--ref-ld-chr ld/chr@_scores → ld/chr1_scores.l2.ldscore.gz, etc.
The same convention applies to --w-ld-chr.
You may pass a comma-separated list to --ref-ld / --ref-ld-chr (Python behavior);
--w-ld / --w-ld-chr must point to a single fileset.
Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80,
--samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion),
--print-coefficients (partitioned h2: per-annotation τ and enrichment).
Overlapping annotations: use --overlap-annot with --frqfile-chr prefix (or --frqfile for
single filesets) to match Python’s overlap-adjusted results. When enabled, LDSC writes
<out>.results with overlap-aware proportion/enrichment columns.
Cell-type-specific h2: use --h2-cts and --ref-ld-chr-cts (see the LDSC wiki for .ldcts
format). Output is written to <out>.cell_type_results.txt. Add --print-all-cts to report
coefficients for all CTS LD score prefixes in each line.
Estimates genetic correlations across all pairs from a list of summary statistic files.
LD score inputs may be plain, .gz, or .bz2.
ldsc rg \
--rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out results--ref-ld-chr / --w-ld-chr follow the same prefix convention as h2 (see above).
Common options: --no-intercept, --intercept-h2 1,1,1 (one per trait), --intercept-gencov 0.0,0.0 (per-pair), --two-step 30,
--samp-prev / --pop-prev (comma-separated, one value per input file).
Generates 0/1 annotation files from a UCSC BED file or a gene set.
# From a BED file:
ldsc make-annot \
--bimfile my_data.bim \
--bed-file regions.bed \
--annot-file output.annot.gz \
--windowsize 100000
# From a gene set:
ldsc make-annot \
--bimfile my_data.bim \
--gene-set-file immune_genes.txt \
--gene-coord-file ENSG_coord.txt \
--annot-file output.annot.gz \
--windowsize 100000Bins one or more continuous annotations into categories and writes a .annot file
compatible with l2 --annot (Python --cts-bin preprocessing).
ldsc cts-annot \
--bimfile my_data.bim \
--cts-bin DAF.txt,DIST.txt \
--cts-breaks 0.1,0.25,0.4x10,100,1000 \
--cts-names DAF,DIST_TO_GENE \
--annot-file cts.annot.gzFive flags affect the speed/accuracy tradeoff in l2. They compose freely.
Run core matmuls in f32 with f64 accumulation. Halves GEMM bandwidth.
- N=2,490 (1000G): ~1.3× faster vs f64 exact
- N=50,000 (biobank): ~1.85× faster vs f64 exact
- Per-SNP LD score deltas vs f64:
max_abs_diff ≈ 0.008 - Downstream h2/rg: identical to f64 to displayed precision
--sketch automatically enables f32; --fast-f32 is only needed for exact mode.
Conceptually, d is the number of "synthetic individuals" we
compress the N real individuals into. CountSketch randomly assigns
each of the N individuals to one of d buckets with a random ±1 sign,
then sums them up. Each bucket aggregates the genotypes of roughly
N/d real individuals with random sign cancellations. The trick: when
we compute pairwise quantities between SNPs (like the inner products
that make up LD scores), the random signs cancel in expectation, and
the sketched inner products are unbiased estimators of the real ones
with per-pair variance ∝ 1/d. Larger d → tighter LD-score estimates;
smaller d → faster downstream linear algebra (all GEMMs are now in d
dimensions, not N).
The decoupling is the key insight: compute cost scales with N (every individual touches one bucket once, an O(N·c) scatter-add pass), but accuracy scales with d alone (the pairwise SNP inner-product variance is 1/d regardless of N). This is why the optimal d turns out to be the same across panel sizes from 1000 Genomes (N=503) all the way to N=100,000 — accuracy depends on the compression ratio's tightness, not the original count.
The actual mechanism: a CountSketch matrix S ∈ ℝ^(d×N) with one ±1
entry per column (hash + sign chosen per individual) acts on the
N×c genotype chunk G to produce S·G ∈ ℝ^(d×c). The fused
decode-normalize-scatter-add kernel does this in a single O(N·c)
pass without materializing G. After projection, columns are
renormalized (ratio estimator) so ‖x̃‖² = N exactly, and the
sketch bias (1−r²)(1−2r²)/d is corrected by quadratic inversion
(residual O(1/d²)). See
docs/countsketch-math-analysis.md
for derivation and Monte-Carlo validation.
LD-score level accuracy as a function of d, measured across N ∈ {503, 10K,
20K, 50K, 100K} on full-genome 1.66M SNPs with --snp-level-masking
(preprint/data/dn_sweep_full.csv). The curves nearly coincide across N,
so the table is approximately N-invariant:
| d | LD-score Pearson r vs per-SNP exact | h² Δ vs per-SNP exact |
|---|---|---|
| 100 | 0.93 | 0.006–0.034 |
| 200 | 0.97 | 0.006–0.013 |
| 500 | 0.99 | 0.000–0.006 |
| 1,000 (recommended) | 0.994 | 0.000–0.003 |
| 1,600 | 0.996 | 0.000–0.002 |
| 5,000 | 0.999 | 0.000–0.002 |
--sketch automatically enables f32 (CountSketch ±1 entries are exactly
representable in f32). Deterministic (seed=42). Requires 3 ≤ d < N
(ldsc-rs errors if out of bounds); warns if d ≤ 50 (Taylor truncation
- sqrt amplification breaks the bias correction). Incompatible with
--gpu. Works with partitioned annotations.
Plain --sketch d carries downstream h² attenuation bias even with the
quadratic correction — the residual error comes not from the sketch itself
but from the chunked-window approximation that the sketch path inherits.
Combine with --snp-level-masking to fix this.
The optimal d depends on N. Two facts to understand:
-
Cost scales with d slowly until the GEMM crossover
d* ≈ √N.T(d) ≈ T_scatter + T_GEMM × d. Belowd*doubling d costs almost nothing; aboved*cost grows linearly. Empirically:d* ≈ 260at 1000G N=2,490;d* ≈ 2,545at biobank N=50K. So you have a wide "free range" of d at biobank scale and a narrow one at 1000G scale. -
Sketch attenuation bias on h² grows with N at fixed d. Sketch per-pair noise variance is ~2/d regardless of N, but at larger N the true LD-score signal is tighter, so the relative noise impact on the regression is larger. Larger N → need larger d for the same h² accuracy.
Recommendations (combine with --snp-level-masking for the h² rows):
| N (reference panel) | Use | Speedup vs exact-f32 |
|---|---|---|
| ~500 – 2,500 (1000G-scale) | --snp-level-masking --fast-f32 (no sketch) |
1.0× (exact is already fast at this N) |
| ~10,000 | --sketch 1000 --snp-level-masking |
~5× (extrapolated) |
| ~50,000 (biobank) | --sketch 1600 --snp-level-masking |
~17× (measured) |
| ~500,000 (UKBB-scale) | --sketch 5000 --snp-level-masking (untested) |
~30-50× (extrapolated) |
The biobank row is the well-tested one (full validation in
docs/perf-log.md 2026-05-12 entry). UKBB-scale d
follows roughly d ≈ √N. For LD-score-only workflows (visualization,
QC — no h² downstream), --sketch 200 is fine at any N: LD-score
Pearson r ≥ 0.92 regardless of N (per the small-N table above; the per-
pair variance bound 2/d is N-independent for normalized columns).
The LDSC paper defines ℓ_j = Σ_k r²_{jk} over an exact per-SNP window. Python LDSC and ldsc-rs both default to a chunked approximation that uses one common window for all SNPs in a chunk (faster GEMM, slightly inflated LD scores at chunk edges).
--snp-level-masking zeroes r² entries outside each SNP's true per-SNP
window after the GEMM. Negligible cost (O(window + chunk) monotonic scan).
The h² impact depends strongly on (LD-reference N, population match):
| Setup | Windowing gap (chunked vs per-SNP exact) |
|---|---|
| 1000G EUR, N=503 (typical h² use case) | 1–2% h² |
| Biobank synthetic, multi-pop, N=50,000 | 10–14% h² |
The gap grows with N because higher-N r² estimates are tighter, so the
"spurious" pairs at chunk boundaries carry real LD signal not noise. See
docs/perf-log.md 2026-05-12 entry for the full
cross-method h² breakdown.
At d=1600, the masking flag toggles which truth cluster the sketch reproduces — both within 0.001 h² across BMI 2010 / BMI 2018 / Height 2018 at biobank scale:
# Match Python LDSC chunked h² (the canonical implementation), 311× faster
ldsc l2 --bfile … --out … --ld-wind-kb 1000 --sketch 1600 --mmap # ~21s
# Match per-SNP exact h² (the LDSC paper's mathematical definition)
ldsc l2 --bfile … --out … --ld-wind-kb 1000 --sketch 1600 --snp-level-masking --mmap # ~21s
# 1000G: smaller speed advantage; exact (~4× slower) is usually preferred
ldsc l2 --bfile … --out … --ld-wind-kb 1000 --snp-level-masking --fast-f32The two flag combinations run at the same speed (masking is a post-GEMM cutoff scan with negligible cost). Pick which truth you want to reproduce:
- Python LDSC h² if you're replicating published results that used Python LDSC's
chunked-window approximation.
--sketch 1600matches within 0.001. - Per-SNP exact h² (LDSC paper's mathematical definition) if you want the most
defensible h² estimate.
--sketch 1600 --snp-level-maskingmatches within 0.001, and is ~12% MORE ACCURATE than Python LDSC at biobank scale (Python's chunked- window approximation has 10–14% h² bias vs the paper definition).
Cross-validated against per-SNP exact, GCTA, chunk-exact, and Python LDSC at biobank N=50K and at 1000G N=503. At d=1000 the combo is slightly faster (~19 s, ~19×) but Height shifts ~0.003 h², still within the regression SE of ~0.008. At 1000G scale (N=503), the d=480 combo reaches GCTA-quality LD scores (Pearson r=0.994 vs per-SNP exact, lowest max-error of any method).
For HPC deployments with GPFS, Lustre, or other networked filesystems,
--mmap uses memory-mapped I/O instead of buffered reads:
- Zero-copy access for the fused CountSketch path (eliminates ~20GB of memcpy at biobank scale)
- OS-managed readahead via
MADV_SEQUENTIAL— GPFS can prefetch across storage nodes in parallel - Async prefetch via
MADV_WILLNEEDon the next chunk — no reader thread contention - No seek invalidation — unlike
BufReader, mmap'd pages stay resident once faulted
On local SSD with warm cache, --mmap regresses ~15% due to page fault overhead.
Use the default (no flag) for local storage. --mmap is designed for networked filesystems.
Build with --features gpu to enable CUDA-accelerated matrix multiplication via
CubeCL. Requires a CUDA toolkit at build time.
cargo build --release --features gpu
ldsc l2 --bfile … --out … --gpu # f32 compute (default)
ldsc l2 --bfile … --out … --gpu --gpu-f64 # native f64 compute
ldsc l2 --bfile … --out … --gpu --gpu-flex32 # half-precision compute, f32 accumulationPrecision options:
--gpu: Default f32 compute--gpu-f64: Native f64 on GPU (slower but numerically exact)--gpu-flex32: Half-precision compute with f32 accumulation (fastest, slight accuracy loss)--gpu-tile-cols N: Split large window matrices into VRAM-fitting tiles
At 1000G scale (n=2,490), GPU is transfer-bound and slower than CPU. GPU acceleration
targets biobank-scale cohorts (n ≥ 50K) where each chunk's GEMM is large enough for
compute to dominate PCIe transfer. Incompatible with --sketch.
Benchmarks on AWS c6a.4xlarge (AMD EPYC 7R13, 16 vCPU). Static musl binary with mimalloc, AVX2+FMA target features. 1000G timings are hyperfine (1 warmup + 3 timed runs); biobank V0.5.0 sketch+masking timings are single-shot AWS Batch on-demand.
| Mode | Wall time | vs Python | Accuracy |
|---|---|---|---|
| Python LDSC | 25 min 49 s | 1.0× | reference |
| Rust f64 (default) | 41.1 s | ~38× | exact (max_abs_diff = 0 on chr22 EUR with --python-compat) |
Rust f32 (--fast-f32) |
~32 s | ~48× | max_abs_diff ≈ 0.008 chr22 EUR |
Rust sketch (--sketch 200) |
25.4 s | ~61× | Pearson r ≈ 0.93 vs chunk-exact |
Rust sketch (--sketch 50) |
15.3 s | ~101× | Pearson r ≈ 0.81 (QC only) |
Python LDSC at biobank measured on r6a.8xlarge (32 vCPU, 240 GiB, same chunk-size 200 + ld-wind-kb 1000 as Rust); Rust modes measured on c6a.4xlarge (16 vCPU, 28 GiB).
Python LDSC and Rust chunked-exact produce bit-identical h² values at biobank (both use the same chunked-window approximation; the algorithm matches, only the language differs). Both diverge from per-SNP exact — the LDSC paper's mathematical definition — by 10–14% h² because of chunked-window bias.
| Mode | Wall time | vs Python | h² |Δ| vs per-SNP exact (BMI 2010 / BMI 2018 / Height 2018) | h² |Δ| vs Python LDSC |
|---|---|---|---|---|
| Python LDSC | 6,541 s (109 min) | 1.0× | 0.015 / 0.020 / 0.047 (chunked-window bias) | 0 (reference) |
--fast-f32 (chunked-exact, ≡ Python h²) |
358 s | 18× | 0.015 / 0.020 / 0.047 | 0 / 0 / 0 (bit-identical to Python) |
--snp-level-masking --fast-f32 (per-SNP exact, "math truth") |
361 s | 18× | 0 (truth) | 0.015 / 0.020 / 0.047 |
--sketch 1600 (matches Python h² within 0.001) |
21 s | 311× | 0.015 / 0.020 / 0.047 | 0.0001 / 0.0001 / 0.0008 |
--sketch 1600 --snp-level-masking (matches per-SNP exact within 0.001) |
21 s | 311× | 0.0002 / 0.0003 / 0.0005 | 0.015 / 0.020 / 0.047 |
--sketch 1000 (chunked) |
18 s | 363× | 0.015 / 0.020 / 0.044 | 0.0004 / 0.0005 / 0.0031 |
--sketch 200 (default, fastest) |
15 s | 436× | 0.013 / 0.013 / 0.009 | 0.002 / 0.008 / 0.038 |
Cross-method h² validated on BMI 2010, BMI 2018, Height 2018 against
per-SNP exact, GCTA, Python LDSC, and chunk-exact references. Full tables
in docs/perf-log.md 2026-05-12 entry.
Fused CountSketch reads packed BED bytes and scatter-adds directly into the d×c sketch buffer, eliminating the N×c intermediate entirely. Cost is O(N×c) independent of d — d=200 has the same scatter time as d=1000; larger d only matters when d² × c starts contributing meaningfully (≈ d=2000+ at biobank). A 28 GB container is recommended for N=50K.
| Window | Python | Rust f64 | Speedup |
|---|---|---|---|
--ld-wind-kb 100 |
44.2 s | 5.1 s | 8.7× |
--ld-wind-kb 500 |
48.4 s | 6.2 s | 7.8× |
--ld-wind-kb 1000 |
53.7 s | 8.6 s | 6.2× |
--ld-wind-kb 2000 |
61.8 s | 12.9 s | 4.8× |
The ring-buffer algorithm keeps memory bounded by the LD window size, not total SNP count.
Scaling with M (fixed N=2,490, exact-f64):
| M (SNPs) | Est. wall time (AWS EPYC 16v) | BED size | Peak memory |
|---|---|---|---|
| 1.66M | 41 s (measured) | 1 GB | ~100 MB |
| 5M | ~124 s | 3 GB | ~100 MB |
| 10M | ~247 s | 6 GB | ~100 MB |
| 50M | ~1,235 s (~21 min) | 30 GB | ~100 MB |
Scaling with N (fixed M=1.66M, full-genome):
| N (individuals) | Mode | Wall time |
|---|---|---|
| 2,490 (1000G) | exact-f64 | 41.1 s (measured) |
| 2,490 (1000G) | --sketch 200 |
25.4 s (measured) |
| 50,000 (biobank) | exact-f32 (--fast-f32 --global-pass) |
361 s (measured) |
| 50,000 (biobank) | --sketch 1600 --snp-level-masking --mmap |
21 s (measured) |
| 50,000 (biobank) | --sketch 200 --mmap |
15 s (measured) |
At biobank N, GEMM cost is O(N × w × c) per chunk; CountSketch reduces this
to O(N×c) scatter-add, independent of d. BED I/O is sequential and
throughput-bound; --fast-f32 halves BED read bytes per chunk.
| Workflow | Rust | Python | Speedup |
|---|---|---|---|
| munge-sumstats | 3.74 s | 62.65 s | 16.75× |
| h2 | 0.90 s | 7.81 s | 8.68× |
| rg (two traits) | 2.93 s | 28.09 s | 9.59× |
(Dataset: 497 MB Pan-UKBB sumstats input; UKBB.EUR LD scores, 381,831 SNPs after merge.)
Releases include Linux, macOS, and Windows archives that contain ldsc, LICENSE, and README.md.
# Linux (x86_64)
curl -L -o ldsc_linux-x86_64.tar.gz \
https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_linux-x86_64.tar.gz
tar -xzf ldsc_linux-x86_64.tar.gz
./ldsc --help
# macOS (Apple Silicon)
curl -L -o ldsc_macos-aarch64.tar.gz \
https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_macos-aarch64.tar.gz
tar -xzf ldsc_macos-aarch64.tar.gz
./ldsc --help
# Windows (x86_64)
# Download the zip from the release page and extract:
# https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_windows-x86_64.zipImages are published to the GitHub Container Registry on every push to main and for each version tag.
docker pull ghcr.io/sharifhsn/ldsc:latest
# Run with local data mounted
docker run --rm \
-v /path/to/data:/data \
ghcr.io/sharifhsn/ldsc:latest \
h2 --h2 /data/trait.sumstats.gz \
--ref-ld-chr /data/eur_w_ld_chr/ \
--w-ld-chr /data/eur_w_ld_chr/ \
--out /data/resultsVersion tags (v1.2.3) produce :1.2.3, :1.2, and :latest. Pushes to main produce a :main
tag and a short-SHA tag (:sha-XXXXXXX).
Requires a Rust toolchain (≥ 1.85; edition 2024 features used).
cargo build --release
# binary: target/release/ldscThe release profile sets opt-level = 3, lto = "thin", codegen-units = 1.
For a fully static binary that runs on any Linux (including CentOS 7 / RHEL 7 HPC clusters),
build with musl and the mimalloc feature:
# Requires the musl target (one-time setup)
rustup target add x86_64-unknown-linux-musl
cargo build --release --features mimalloc --target x86_64-unknown-linux-musl
strip target/x86_64-unknown-linux-musl/release/ldsc
# Copy the single binary to your cluster — no dependencies neededThis produces a static-pie binary with zero shared library dependencies. The mimalloc
feature replaces musl's default allocator with Microsoft's mimalloc,
which eliminates the ~12% performance penalty musl's single-threaded allocator incurs under
rayon parallelism. On AWS EPYC benchmarks, musl + mimalloc is actually 4.5% faster than
the default glibc build (69.9s vs 73.2s on full 1000G).
Without --features mimalloc, a musl build works but is ~12% slower due to allocator contention.
--rayon-threads N: Rayon thread count for jackknife inh2/rg.--polars-threads N: Polars thread count for CSV streaming inmunge-sumstats.--mmap: Memory-mapped BED I/O forl2. Recommended for GPFS/Lustre HPC.
- CountSketch math analysis — rigorous derivation of the CountSketch bias on the renormalized cosine, Monte Carlo validation, and the §15 sketch+masking finding.
- Performance deep-dive — algorithmic complexity for each mode, scaling analysis for dense SNP panels (O(M² × N) with distance-based windows), downstream h2/rg regression accuracy by sketch dimension, and why Python is slow.
- Performance log — append-only log of every perf experiment; the 2026-05-12 entry has the cross-method h² validation (GCTA + Python LDSC + chunk-exact + per-SNP exact + sketch+masking).
- Architecture & source map — module-level code map, key data-flow invariants, and dependency rationale.
- GCTA source audit — what GCTA's
--ld-meanrsqactually does, since GCTA is the tool the original LDSC paper used to compute the published 1000G reference scores.
Releases are cut with cargo-release and tagged as vX.Y.Z. Tag pushes trigger the release
workflow, which builds and uploads platform tarballs to GitHub Releases.
cargo release patch
cargo release patch --executeRequires Docker with BuildKit (default since Docker 23):
docker build -t ldsc .The multi-stage Dockerfile uses cargo-chef to
cache dependency compilation in a separate layer, so incremental rebuilds only recompile changed
source files.
If you use ldsc-rs in your research, please cite:
Haason, S. & Khan, Y. (2026). ldsc-rs: Exact and approximate LD Score Regression at biobank scale. bioRxiv.
@article{haason2026ldscrs,
author = {Haason, Sharif and Khan, Yousef},
title = {ldsc-rs: Exact and approximate {LD} {Score} Regression at biobank scale},
journal = {bioRxiv},
year = {2026},
}