A bioinformatics tool for local diploid assembly and variant calling, methylation analysis from lrWGS BAM files using graph-based approaches.
Haplograph is designed for accurate haplotype or meplotype reconstruction and variant calling in local genomic regions, particularly useful for complex region such as HLA/MHC analysis, and other highly polymorphic regions. The tool uses graph-based assembly to reconstruct haplotypes (methylation aware) and call variants with high accuracy.
- Graph-based Haplotype Assembly: Constructs sequence graphs from aligned reads
- Variant Calling: Generates VCF files with phased variants
- Multiple Output Formats: Supports both GFA (Graph Fragment Assembly), VCF and FASTA formats
- Germline Haplotype Filtering: Focus on major haplotypes for cleaner results
- Read-based Phasing: Uses read overlap information for accurate phasing
- Comprehensive Evaluation: Compare results against truth sets
- Flexible Parameters: Configurable window sizes, read support thresholds, and filtering options
- Rust 1.70+ (install from rustup.rs)
- HTSlib development libraries
- Standard bioinformatics tools (samtools, bcftools) for file processing
# Clone the repository
git clone https://github.com/broadinstitute/haplograph.git
cd haplograph
# Build the project
cargo build --release
# Install globally (optional)
cargo install --path .Haplograph provides four main commands for different stages of haplotype analysis:
Extract haplotypes from BAM files and build sequence graphs:
# Basic haplotype extraction
haplograph haplograph \
--alignment-bam input.bam \
--reference-fa reference.fa \
--sampleid SAMPLE001 \
--locus chr6:29943661-29943700 \
--output-prefix output/HLA_A
# With custom parameters
haplograph haplograph \
--alignment-bam input.bam \
--reference-fa reference.fa \
--sampleid SAMPLE001 \
--locus chr6:29943661-29943700 \
--frequency-min 0.05 \
--min-reads 3 \
--window-size 500 \
--primary-only \
--default-file-format gfa \
--output-prefix output/HLA_A \
--verboseParameters:
--alignment-bam: Input BAM file--reference-fa: Reference FASTA file--sampleid: Sample identifier--locus: Genomic region (format: chr:start-end)--frequency-min: Minimum allele frequency (default: 0.01)--min-reads: Minimum supporting reads (default: 2)--window-size: Window size for analysis (default: 100)--primary-only: Use only primary alignments--default-file-format: Output format (fasta or gfa, default: gfa)
Assemble haplotypes from GFA files:
# Basic assembly
haplograph assemble \
--graph-gfa output/HLA_A.gfa \
--output-prefix output/HLA_A_asm
# Germline-only assembly with specific haplotype count
haplograph assemble \
--graph-gfa output/HLA_A.gfa \
--major-haplotype-only \
--number-of-haplotypes 2 \
--output-prefix output/HLA_A_asm \
--verboseParameters:
--graph-gfa: Input GFA file--locus: Genomic region--major-haplotype-only: Focus on major haplotypes only--number-of-haplotypes: Number of haplotypes to extract (default: 2)
Call variants from assembled haplotypes:
# Basic variant calling
haplograph call \
--gfa-file output/HLA_A_asm.gfa \
--sampleid SAMPLE001 \
--reference-fa reference.fa \
--output-prefix output/HLA_A_variants \
--phase-variants
# With phasing
haplograph call \
--gfa-file output/HLA_A_asm.gfa \
--sampleid SAMPLE001 \
--reference-fa reference.fa \
--phase-variants \
--maximum-haplotypes 2 \
--output-prefix output/HLA_A_variants \
--verboseParameters:
--gfa-file: Input GFA file from assembly--sampleid: Sample identifier--reference-fa: Reference FASTA file--phase-variants: Enable variant phasing--maximum-haplotypes: Maximum number of haplotypes (default: 2)
Evaluate haplotype accuracy against truth sets:
haplograph evaluate \
--truth-fasta truth_haplotypes.fasta \
--query-fasta output/HLA_A_asm.fasta \
--seq-number 2 \
--output-prefix output/evaluation \
--verboseParameters:
--truth-fasta: Truth haplotype sequences--query-fasta: Query haplotype sequences--seq-number: Number of sequences to compare (default: 2)
# 1. Extract haplotypes and build graph
haplograph haplograph \
--alignment-bam sample.bam \
--reference-fa hg38.fa \
--sampleid HG002 \
--locus chr6:29943661-29943700 \
--output-prefix output/HLA_A
# 2. Assemble haplotypes
haplograph assemble \
--graph-gfa output/HLA_A.gfa \
--major-haplotype-only \
--number-of-haplotypes 2 \
--output-prefix output/HLA_A_asm
# 3. Call variants
haplograph call \
--gfa-file output/HLA_A_asm.gfa \
--sampleid HG002 \
--reference-fa hg38.fa \
--phase-variants \
--output-prefix output/HLA_A_variants
# 4. Evaluate results (if truth available)
haplograph evaluate \
--truth-fasta truth_HLA_A.fasta \
--query-fasta output/HLA_A_asm.fasta \
--output-prefix output/evaluation{prefix}.gfa: Sequence graph in GFA format{prefix}.fasta: Haplotype sequences (if fasta format selected)
{prefix}.fasta: Assembled haplotype sequences- Contains multiple haplotype sequences with support information
{prefix}.vcf.gz: Compressed VCF file with called variants{prefix}.vcf.gz.tbi: Tabix index for the VCF file- Includes phased variants with quality scores
{prefix}.tsv: Evaluation metrics and accuracy scores- Detailed comparison between truth and query sequences
haplograph/
├── Cargo.toml # Project configuration and dependencies
├── src/
│ ├── main.rs # Command-line interface
│ ├── asm.rs # Assembly and graph traversal
│ ├── call.rs # Variant calling and VCF generation
│ ├── eval.rs # Evaluation and comparison
│ ├── graph.rs # Graph construction
│ ├── hap.rs # Haplotype extraction
│ ├── intervals.rs # Genomic interval processing
│ └── util.rs # Utility functions
├── wdl/ # Workflow Definition Language files
├── example/ # Example data and scripts
├── output/ # Output directory (created during analysis)
└── README.md # This file
- rust-htslib: HTSlib bindings for BAM/VCF file handling
- bio: Bioinformatics algorithms and data structures
- clap: Command-line argument parsing
- anyhow: Error handling
- serde/serde_json: Serialization for GFA annotations
- rayon: Parallel processing
- ndarray: Numerical computing
- flate2: Compression support
- regex: Pattern matching
- csv: CSV file processing
- criterion: Benchmarking
- log/env_logger: Logging
- indicatif: Progress bars
High-coverage data:
haplograph haplotype \
--frequency-min 0.01 \
--min-reads 5 \
--window-size 200 \
--primary-onlyLow-coverage data:
haplograph haplotype \
--frequency-min 0.05 \
--min-reads 2 \
--window-size 100Complex regions (e.g., HLA):
haplograph haplotype \
--frequency-min 0.02 \
--min-reads 2 \
--window-size 100 \
--default-file-format gfa- Memory usage: For large regions, consider reducing window size
- No variants called: Check read coverage and adjust frequency thresholds
- Graph assembly fails: Ensure sufficient read overlap and adjust parameters
- Use
--primary-onlyfor faster processing - Adjust
--window-sizebased on region complexity - Use appropriate
--frequency-minfor your data type
cargo test
cargo test -- --nocapture # With outputcargo doc --opencargo fmt
cargo clippyIf you use Haplograph in your research, please cite:
Haplograph: A bioinformatics tool for haplotype analysis
Version 0.1.0
https://github.com/broadinstitute/haplograph
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and support, please open an issue on the GitHub repository or contact the maintainers.