Pangrowth

pangrowth is an efficient tool designed for genomic researchers to predict the openness of a pangenome, estimate the core genome size and the pangenome diversiy using Hill numbers.

This tool is capable of analyzing fasta sequences using k-mers, as well as any other genomic elements such as genes, CDS, ORFs, as long as it is provided as either a frequency histogram or a pan-matrix (with columns representing genomes and rows representing items; see panmatrix_ecoli_n50.txt for an example).

Key features

k-mer counting: utilizes a modified version of yak to count k-mers
growth/core calculation: computes the exact expected genomic growth/core size quadratically in the number of genomes
hill numbers: compute Hill numbers from frequency list
colored compacted de Bruijn graph (cdbg): estimates pangenome diversity of the ccdbg using Hill numbers, by combining k-mer and infix equivalents histograms

Install

git clone https://github.com/gi-bielefeld/pangrowth
cd pangrowth
mkdir build
cd build
cmake ..
make

To plot the results we need the following python libraries: numpy, pandas, matplotlib, scipy and searbon. You can install them with:

pip install -r requirements.txt

Usage

Histogram from fasta files

./pangrowth hist -k 17 -t 12 data/fa/*.fna.gz > hist.txt

pangrowth also accepts a file containing a list of fasta files (each one on a single line) passed with the paremeter -i fasta_list.txt

To visualize the histogram:

python scripts/plot_hist.py hist.txt hist.pdf

If you have multiple histograms that you want to compare with different number of genomes you can use:

python scripts/plot_hist.py --norm_x --norm_y=both hist.txt data/hist_ecoli_n50.txt data/hist_ecoli_n200.txt hist_multiple.pdf

The flag --norm_x normalize the x-axis to be between (0,1].
The flag --norm_y allows two types of normalization:
- multiplicity which adjusts each histogram value h[i] multiplying it by its index i (i.e., h[i] * i, this means that values appearing once remain the same, values appearing twice are doubled, and so on)
- percentage which divides the values of h[i] by the total sum of h (its total sum equals 1) The --norm_y=both applies both in series.

Pangenome growth from histogram (or pan-matrix)

./pangrowth growth -h data/hist_ecoli_n50.txt > growth.txt
#./pangrowth growth -p data/panmatrix_ecoli_n50.txt > growth.txt

To fit the openness and visualize the growth:

python scripts/plot_growth.py growth.txt growth.pdf

We can again pass multiple growth files to scripts/plot_growth.py to compare with other species.

python scripts/plot_growth.py growth.txt data/growth_ecoli_n200.txt growth_multiple.pdf

Pangenome core from histogram (or pan-matrix)

./pangrowth core -h data/hist_ecoli_n50.txt > core.txt
#./pangrowth core -p data/panmatrix_ecoli_n50.txt > core.txt
./pangrowth core -h data/hist_ecoli_n50.txt -q 0.9 > core_q90.txt

The -q takes a quorum to considered the item in the core (default 1.0).

To fit the core genome and report the percentage of core item over the expected genome size:

python scripts/plot_core.py core_q90.txt data/core_q90_ecoli_n200.txt core.pdf

The expected genome size is calculated as the total sum of the histogram divided by the number of genomes.

Hill numbers from k-mer histogram

Hill numbers measure pangenome diversity (species richness, exponential entropy, inverse Simpson index) from a k-mer frequency histogram:

./pangrowth hill -p 30 data/hist_ecoli_n50.txt

-p INT sets the number of sample points (default: 30); use -p 0 to output all points
-f FILE reads sample points from a file (one integer per line), overriding -p

The output is a tab-separated table with columns: fit, m, richness, exp_entropy, inv_gini_simp, where fit is int (interpolation), obs (observed), or ext (extrapolation) and m is the number of genomes.

Colored compacted de Bruijn graph

The colored compacted de Bruijn graph (cdbg) compacts non-branching path of k-mers into unitigs. Its diversity can be estimated by combining a k-mer histogram with an infix equivalents histogram.

Step 1: generate the infix equivalents histogram from fasta files:

./pangrowth hist -k 17 -t 12 data/fa/*.fna.gz > hist.txt
./pangrowth hist_infix -k 17 -t 12 -T data/fa/*.fna.gz > hist_infix.txt

Options for hist_infix are the same as for hist:

-k INT k-mer size (default: 17)
-t INT number of worker threads (default: 4)
-i PATH file containing a list of fasta files (one per line)
-b turn off canonical k-mer transformation
-T account for telomeres breaking unitigs
-c INT minimum k-mer count to consider a (k+1)-mer (default: 1)

Step 2: compute Hill numbers for the cdbg using both histograms:

./pangrowth hill_cdbg hist.txt hist_infix.txt

The output format is identical to hill: a tab-separated table with columns fit, m, richness, exp_entropy, inv_gini_simp.

Publication

Parmigiani, L., Wittler, R., Stoye, J.,: Revisiting pangenome openness with k-mers. PCI Comp & Biol. (2024).

Contact

For any question, feedback or problem, please feel free to file an issue on Github or contact me here and I will get back to you as soon as possible.

Pangrowth is provided as a service of the German Network for Bioinformatics Infrastructure (de.NBI). We would appriciate if you would participate in the evaluation of Pangrwoth by completing this very short survey.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
png		png
rmath		rmath
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pangrowth

Key features

Table of Contents

Install

Usage

Histogram from fasta files

Pangenome growth from histogram (or pan-matrix)

Pangenome core from histogram (or pan-matrix)

Hill numbers from k-mer histogram

Colored compacted de Bruijn graph

Publication

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pangrowth

Key features

Table of Contents

Install

Usage

Histogram from fasta files

Pangenome growth from histogram (or pan-matrix)

Pangenome core from histogram (or pan-matrix)

Hill numbers from k-mer histogram

Colored compacted de Bruijn graph

Publication

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages