This repository contains a Snakemake pipeline that:
- Builds graphs (GFA) from FASTA, FASTQ (ggcat+Lighter), VCF (vg), or pre‑existing GFAs.
- Cleans and "bluntifies" the graphs.
- Prepares unidirectional graph representations (sgraph / edgelist).
- Runs benchmarks on several programs (BubbleGun, BubbleFinder, vg snarls, clsd).
- Aggregates results in a single table and produces plots (time / memory).
- Unix‑like OS (Linux, macOS).
- Snakemake (recommended via conda/mamba).
- conda or mamba available in
$PATH.
(The pipeline usesconda:directives in the rules.) - Internet access (to download data and binaries).
The following tools are managed automatically by the Snakefile if no binary is specified in datasets.yaml:
- BubbleFinder, cloned and built from GitHub (commit
6ba16c95). - GetBlunted, precompiled binary downloaded (release
v1.0.0). - clsd, cloned and built (commit
c49598fc). - Lighter, cloned and built (commit
d8621db1).
# 1) Clone the repository
git clone https://github.com/algbio/BubbleFinder-experiments.git
cd BubbleFinder-experiments
# 2) (Optional) Create a conda/mamba env with Snakemake version 9
mamba env create -f environment.yml
conda activate benchMake sure conda or mamba is in $PATH when you run Snakemake.
The datasets.yaml file describes:
- datasets (
datasets:), - builders (how to produce raw GFAs),
- tools and conda environments,
- benchmark programs and their parameters.
Each dataset has a builder field:
ggcat_from_fasta- Input: FASTA/FNA (or
.tar.gzarchive containing FASTA files). - Example:
coli3682(from Zenodo).
- Input: FASTA/FNA (or
ggcat_from_reads_lighter- Input: FASTQ(.gz) → Lighter correction → ggcat.
vg_from_vcf- Input:
fa_gz(reference) +vcf_gz→vg construct→ GFA.
- Input:
gfa_from_url- Input: pre‑built GFA downloaded from a URL.
pggb_from_fasta- Input: FASTA, graph built with pggb (pangenome graphs).
Under datasets::
- name: coli3682
enabled: true
builder: ggcat_from_fasta
...enabled: true→ used.enabled: false→ ignored.enabled: auto→ enabled only if input files can be detected (useful for HG00733, etc.).
Global (in defaults.bench.programs):
defaults:
bench:
reps: 2
programs:
- BubbleGun_gfa // => BubbleGun, with a .GFA file as input
- sbSPQR_gfa // => BubbleFinder, bidirectional edges
- vg_snarls_gfa // => vg snarls, with a .GFA file as input
- clsd_sb // => clsd only handles unidirectional edges
- sbSPQR_sb // => BubbleFinder, unidirectional edges
- sbSPQR_snarls_gfa // => BubbleFinder, snarls modePer dataset (overrides the global value):
- name: coli3682
...
bench_programs:
- clsd_sbSee the comments in datasets.yaml for the full list of programs and options.
snakemake -n -psnakemake all --use-conda -j 8--use-conda: required to create/use the environments defined inconfig/*.yml.-j 8: number of parallel jobs (adapt to your machine / cluster).
- Build and clean only the GFA of
coli3682:
snakemake --use-conda -j 4 data/coli3682/coli3682.cleaned.gfa- Run only the benchmarks and aggregation:
snakemake --use-conda -j 8 results/benchmarks.tsvFor each dataset <name> (e.g. coli3682):
data/<name>/<name>.*.gfa- Raw GFA (
...ggcat.fasta.gfa,...vg.gfa,...pggb.gfa, etc.).
- Raw GFA (
data/<name>/<name>.bluntified.gfa(temporary).data/<name>/<name>.cleaned.gfa- Cleaned GFA (no H‑lines, bluntified).
data/<name>/<name>.sb.cleaned.gfa- unidirectional graph representation version (if
ggcat_force_fis enabled for this dataset).
- unidirectional graph representation version (if
data/<name>/<name>.sbspqr.sgraph- sgraph used by BubbleFinder/sbSPQR (unidirectional graph representation mode).
data/<name>/<name>.clsd.edgelist- Edgelist for
clsd.
- Edgelist for
results/.prechecks.ok- Marker indicating that prechecks have run.
results/bench/- Per‑program, per‑dataset, per‑rep benchmark TSV files.
- Name:
<program>/<dataset>.t<threads>.rep<rep>.tsv.
results/prog_out/- Raw program outputs (JSON, sgraphs, program‑specific logs).
results/logs/- Detailed logs per step (ggcat, vg, sgraph, bench, etc.).
results/benchmarks.tsv- Aggregated benchmark table (time, memory, signature, etc.).
results/plots/time_by_dataset_program.pngrss_by_dataset_program.png
results/summary/reruns_planned.tsv- Information used for automatic reruns (timeouts, failures).
In datasets.yaml, under defaults.tools, you can point to pre‑installed binaries to avoid automatic cloning/building:
defaults:
tools:
spqr_bin: /path/to/BubbleFinder
get_blunted: /path/to/get_blunted
lighter_bin: /path/to/lighter
clsd_bin: /path/to/clsdIf these paths are not set, the Snakefile will:
- clone/build
BubbleFinder,clsd, andLighterunderbuild/, - download
get_bluntedintobin/.
- Error in prechecks (
prechecks):- Check
results/logs/*(especiallyresults/logs/bench,results/logs/ggcat,results/logs/vg). - Ensure
condaormambais available.
- Check
- Frequent timeouts:
- Adjust
defaults.tools.timeout.secondsindatasets.yaml.
- Adjust
- Problem with a specific dataset:
- Run a single target, e.g.:
snakemake --use-conda -j 4 data/coli3682/coli3682.cleaned.gfa
- Run a single target, e.g.:
- Bluntification issues:
- A WARN about
get_bluntedmeans the pipeline falls back to "naive bluntify" (overlap fields forced to*). - Install GetBlunted and set
defaults.tools.get_bluntedfor more better behavior.
- A WARN about