Skip to content

gi-bielefeld/pangrowth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pangrowth

logo_pangrowth

pangrowth is an efficient tool designed for genomic researchers to predict the openness of a pangenome, estimate the core genome size and the pangenome diversiy using Hill numbers.

This tool is capable of analyzing fasta sequences using k-mers, as well as any other genomic elements such as genes, CDS, ORFs, as long as it is provided as either a frequency histogram or a pan-matrix (with columns representing genomes and rows representing items; see panmatrix_ecoli_n50.txt for an example).

Key features

  • k-mer counting: utilizes a modified version of yak to count k-mers
  • growth/core calculation: computes the exact expected genomic growth/core size quadratically in the number of genomes
  • hill numbers: compute Hill numbers from frequency list
  • colored compacted de Bruijn graph (cdbg): estimates pangenome diversity of the ccdbg using Hill numbers, by combining k-mer and infix equivalents histograms

Table of Contents

Install

git clone https://github.com/gi-bielefeld/pangrowth
cd pangrowth
mkdir build
cd build
cmake ..
make 

To plot the results we need the following python libraries: numpy, pandas, matplotlib, scipy and searbon. You can install them with:

pip install -r requirements.txt

Usage

Histogram from fasta files

./pangrowth hist -k 17 -t 12 data/fa/*.fna.gz > hist.txt
  • pangrowth also accepts a file containing a list of fasta files (each one on a single line) passed with the paremeter -i fasta_list.txt

To visualize the histogram:

python scripts/plot_hist.py hist.txt hist.pdf

k-mer frequency histogram of 12 ecoli

If you have multiple histograms that you want to compare with different number of genomes you can use:

python scripts/plot_hist.py --norm_x --norm_y=both hist.txt data/hist_ecoli_n50.txt data/hist_ecoli_n200.txt hist_multiple.pdf
  • The flag --norm_x normalize the x-axis to be between (0,1].
  • The flag --norm_y allows two types of normalization:
    • multiplicity which adjusts each histogram value h[i] multiplying it by its index i (i.e., h[i] * i, this means that values appearing once remain the same, values appearing twice are doubled, and so on)
    • percentage which divides the values of h[i] by the total sum of h (its total sum equals 1) The --norm_y=both applies both in series.

k-mer frequency histogram of multiple ecoli

Pangenome growth from histogram (or pan-matrix)

./pangrowth growth -h data/hist_ecoli_n50.txt > growth.txt
#./pangrowth growth -p data/panmatrix_ecoli_n50.txt > growth.txt

To fit the openness and visualize the growth:

python scripts/plot_growth.py growth.txt growth.pdf

k-mer growth of ecoli

We can again pass multiple growth files to scripts/plot_growth.py to compare with other species.

python scripts/plot_growth.py growth.txt data/growth_ecoli_n200.txt growth_multiple.pdf

k-mer growth of multiple ecoli

Pangenome core from histogram (or pan-matrix)

./pangrowth core -h data/hist_ecoli_n50.txt > core.txt
#./pangrowth core -p data/panmatrix_ecoli_n50.txt > core.txt
./pangrowth core -h data/hist_ecoli_n50.txt -q 0.9 > core_q90.txt
  • The -q takes a quorum to considered the item in the core (default 1.0).

To fit the core genome and report the percentage of core item over the expected genome size:

python scripts/plot_core.py core_q90.txt data/core_q90_ecoli_n200.txt core.pdf

The expected genome size is calculated as the total sum of the histogram divided by the number of genomes.

k-mer core size of multiple ecoli

Hill numbers from k-mer histogram

Hill numbers measure pangenome diversity (species richness, exponential entropy, inverse Simpson index) from a k-mer frequency histogram:

./pangrowth hill -p 30 data/hist_ecoli_n50.txt
  • -p INT sets the number of sample points (default: 30); use -p 0 to output all points
  • -f FILE reads sample points from a file (one integer per line), overriding -p

The output is a tab-separated table with columns: fit, m, richness, exp_entropy, inv_gini_simp, where fit is int (interpolation), obs (observed), or ext (extrapolation) and m is the number of genomes.

Colored compacted de Bruijn graph

The colored compacted de Bruijn graph (cdbg) compacts non-branching path of k-mers into unitigs. Its diversity can be estimated by combining a k-mer histogram with an infix equivalents histogram.

Step 1: generate the infix equivalents histogram from fasta files:

./pangrowth hist -k 17 -t 12 data/fa/*.fna.gz > hist.txt
./pangrowth hist_infix -k 17 -t 12 -T data/fa/*.fna.gz > hist_infix.txt

Options for hist_infix are the same as for hist:

  • -k INT k-mer size (default: 17)
  • -t INT number of worker threads (default: 4)
  • -i PATH file containing a list of fasta files (one per line)
  • -b turn off canonical k-mer transformation
  • -T account for telomeres breaking unitigs
  • -c INT minimum k-mer count to consider a (k+1)-mer (default: 1)

Step 2: compute Hill numbers for the cdbg using both histograms:

./pangrowth hill_cdbg hist.txt hist_infix.txt

The output format is identical to hill: a tab-separated table with columns fit, m, richness, exp_entropy, inv_gini_simp.

Publication

Parmigiani, L., Wittler, R., Stoye, J.,: Revisiting pangenome openness with k-mers. PCI Comp & Biol. (2024).

Contact

For any question, feedback or problem, please feel free to file an issue on Github or contact me here and I will get back to you as soon as possible.

Pangrowth is provided as a service of the German Network for Bioinformatics Infrastructure (de.NBI). We would appriciate if you would participate in the evaluation of Pangrwoth by completing this very short survey.