Phytohormones-Fungi: a reproducible phylogenomic pipeline for cross-kingdom homolog detection and analysis

This repository contains the complete, reproducible bioinformatics pipeline used in the study of phytohormone-associated gene homologs across the fungal tree of life. The workflow integrates a taxonomically curated custom protein database, cross-kingdom homology detection with domain architecture validation, phylogeny-aware sequence alignment, and maximum-likelihood phylogenetic inference.

Associated publication: García-Estrada, D.A., et al. (2026). Phytohormones in Fungi: Inter-Kingdom Modulators or Fungal Self-Controlling Elements?. ASM Microbiology Society. (Manuscript submitted)

Overview

The pipeline is divided into two sequential phases. Phase 1 constructs a non-redundant protein database from 232 organisms spanning the full fungal tree of life and key outgroups (Viridiplantae, Metazoa, Bacteria, Archaea). Phase 2 performs homology detection, curation, and phylogenetic inference using that database.

The Trichoderma atroviride v3 proteome (the primary study organism, currently unpublished) was incorporated manually into the database and is not downloaded by the automated scripts.

flowchart TD
    A["**Phase 1**
    organisms.tsv 232 taxa — Fungi, Bacteria, 
    Viridiplantae, Metazoa"] --> B["download_proteomes.sh UniProt ref → reviewed → complete → NCBI RefSeq fallback"]
    B --> C["prepare_blastdb.sh Header standardization TaxID-prefix + makeblastdb → hormoneDB"]
    M["*T. atroviride* v3 unpublished —
    added manually"] --> C

    D["Plant seed sequences
    Experimentally characterized
    SwissProt / literature"] --> E0{"**Phase 2A**
    Seed expansion
    Manual curation"}
    
    E0 --> F["BLASTp → NCBI
    Initial cross-kingdom search"]

    E0 --> E["hmmscan_domains.sh
    hmmscan --cut_tc vs Pfam-A
    Discover domain architecture
    per seed protein"]
    E --> G["search_hmm_ids.sh
    hmmsearch against *T. atroviride* v3"]
    G --> H["domain_search_script.sh
    Complete domain architecture
    validation — all domains required"]
    F --> I["Expanded and validated seeds
    multi-kingdom representatives"]
    H --> I

    C --> J["**Phase 2B**
    blastp_batch.sh
    BLOSUM45 · e≤1×10⁻⁵
    word_size=3 · SEG=yes"]
    I --> J
    J --> K["blastp-analysis.Rmd
    Best hit per Query–Organism
    bitscore → qcovs → pident → evalue"]
    K --> L["extract_sequences_batch.sh
    + GetSeqsFromFasta.py"]

    L --> N["**Phase 2C**
    phylo_pipeline.sh
    seqkit rmdup → CD-HIT 99%
    MAFFT → FastTree → PRANK
    trimAl → IQ-TREE MFP+LG+C60"]

    N --> O["**Phase 2D**
    phylogenetic-tree.Rmd
    phylogenetic-tree_batch.R
    ggtree + Kingdom / Phylum metadata"]
    O --> P["Publication figures
    SVG · PNG"]
    
    N --> Q["**Phase 3**
    AlphaFold v3
    Structural modeling"]
    Q --> R["ChimeraX
    RMSD structural comparison"]

Phase 2A uses two independent methods (BLASTp and profile HMM search) to find cross-kingdom homologs of plant seed sequences. A candidate was accepted only if it possessed the complete domain architecture of the reference protein — sequences missing any expected domain were excluded. Candidate selection in this phase was performed manually by the authors.

Phase 2B–D applies the expanded, validated seed set against the custom database, curates one representative sequence per organism per gene family, and infers maximum-likelihood phylogenies with statistical support.

Repository structure

Phytohormones-Fungi/
│
├── README.md                        # This file
├── LICENSE                          # MIT License
├── CITATION.cff                     # Citation metadata
├── env/
│   ├──environment.yml               # Conda environment
│   └──environment.lock.linux-64.yml # Exact locked environment used in the study
├── scripts/                         # All pipeline scripts
│   ├── README.md                    # Script-level documentation (inputs, outputs, rationale)
│   │
│   ├── # Phase 1 — Database construction
│   ├── 01-download_proteomes.sh
│   ├── 02-prepare_blastdb.sh
│   │
│   ├── # Phase 2A — Seed expansion (HMM-based)
│   ├── 03-hmmscan_domains.sh
│   ├── 04-extract_hmm_ids.sh
│   ├── 05-generate_hmm_consensus.sh
│   ├── 06-search_hmm_ids.sh
│   ├── 07-domain_search_script.sh
│   │
│   ├── # Phase 2B — BLASTp and curation
│   ├── 08-blastp_batch.sh
│   ├── 09-blastp-analysis.Rmd
│   ├── 10-extract_sequences_batch.sh
│   ├── GetSeqsFromFasta.py
│   │
│   ├── # Phase 2C–D — Phylogenetics and visualization
│   ├── 11-phylo_pipeline.sh
│   ├── 12-phylogenetic-tree.Rmd
│   └── 12-phylogenetic-tree_batch.R
│
├── data/
│   ├── organisms.tsv                # List of organisms used to build the DB
│   └── metadata.tsv                 # Organism, TaxIDs, Kingdom, Phylum, Early, etc.
│
└── docker/
    └── Dockerfile                   # Container with all command-line dependencies

Note on large files: Protein databases (hormoneDB.fasta, proteome FASTA files) and analysis outputs (alignments, tree files) are not tracked in Git due to size.

Installation

Option 1 — Conda (recommended)

Two environment files are provided to balance reproducibility and portability.

Cross-platform (recommended for new users):

git clone https://github.com/DavidAlberto/Phytohormones-Fungi.git
cd Phytohormones-Fungi

conda env create -f env/environment.yml
conda activate hormone-phylo

Exact environment used in the study (Linux x86-64 only):

conda create --name hormone-phylo --file env/environment.lock.linux-64.yml
conda activate hormone-phylo

Verify key tools:

blastp -version
mafft --version
iqtree3 --version
trimal --version
hmmscan -h

R packages

install.packages(c("tidyverse", "svglite", "here", "ape", "phangorn", "gridExtra"))

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("ggtree", "treeio"))

Option 2 — Docker

docker pull davidalbertoge/hormone-analysis:latest
docker run -v $(pwd):/home/ -it davidalbertoge/hormone-analysis:latest

The Docker image covers all command-line tools. R and R packages require separate installation or the Conda environment above.

Conda path configuration

phylo_pipeline.sh initializes Conda from the path specified in the CONDA_BASE variable at the top of the script. The default is ~/miniconda3. If your installation is elsewhere (e.g., ~/anaconda3, /opt/conda), edit that variable before running:

# At the top of phylo_pipeline.sh — USER CONFIGURATION section
readonly CONDA_BASE="$HOME/miniconda3"   # adjust as needed

Pipeline

Phase 1 — Custom database construction

1.1 Download proteomes

Edit data/organisms.tsv to define the target organisms (one scientific name per line), then run:

bash scripts/01-download_proteomes.sh

Downloads one proteome per organism using a hierarchical fallback strategy (UniProt reference → reviewed → complete → NCBI RefSeq). Outputs are written to proteomes/ and download manifests to manifests/.

Important: The T. atroviride v3 proteome must be added manually to proteomes/ before the next step, named as <TaxID>.fasta following the same convention.

1.2 Build the BLAST database

bash scripts/02-prepare_blastdb.sh

# Then index for BLASTp:
makeblastdb -in data/hormoneDB.fasta -dbtype prot -parse_seqids \
            -out data/blastDB

Standardizes all FASTA headers to >TaxID|original_header format and merges all proteomes into hormoneDB.fasta.

Phase 2A — Seed expansion (cross-kingdom search)

This phase expands and validates the initial plant seed sequences using two independent methods before the main database search. Candidate selection was performed manually by the authors.

2A.0 Make HMMER database

wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
mv Pfam-A.hmm.gz data/hmmerDB/.
gunzip data/hmmerDB/Pfam-A.hmm.gz
hmmpress data/hmmerDB/Pfam-A.hmm

2A.1 Discover domain architecture of seed proteins

bash scripts/03-hmmscan_domains.sh data/hmmerDB/Pfam-A.hmm \
      input/hmmer/seed_proteins.fasta results/hmmer/hmmscan_domains

2A.2 Fetch Pfam HMM profiles

bash scripts/04-extract_hmm_ids.sh data/hmmerDB/Pfam-A.hmm \
      input/hmmer/pfam_ids.txt results/hmmer/hmm_profiles/

2A.3 Generate consensus sequences (optional)

bash scripts/05-generate_hmm_consensus.sh \
      results/hmmer/hmm_profiles/ results/hmmer/hmm_consensus/

2A.4 Search HMMs against T. atroviride v3

bash scripts/06-search_hmm_ids.sh results/hmmer/hmm_profiles/ \
      data/proteomes/<TaxID_Tatroviride>.fasta results/hmmer/hmm_search/

2A.5 Validate complete domain architecture

bash scripts/07-domain_search_script.sh <PFAM_ID> input/hmmer/candidate_sequences.fasta \
      results/hmmer/hmm_domains/<PFAM_ID>

Candidates passing both BLASTp similarity criteria and complete domain validation were used as expanded seeds in Phase 2B.

Phase 2B — BLASTp against the custom database

2B.1 Run BLASTp in batch

Place expanded seed FASTA files in input/hormone/ (one file per gene family), then:

bash scripts/08-blastp_batch.sh

Parameters: BLOSUM45 matrix, e-value ≤ 1×10⁻⁵, word size 3, SEG filter enabled, post-search filters ≥30% identity and ≥50% query coverage.

2B.2 Curate best hits per organism

Open and knit scripts/09-blastp-analysis.Rmd in RStudio, or render from the command line:

Selects one best hit per Query–Organism pair (priority: bit score → query coverage → percent identity → e-value). Outputs a consolidated TSV and individual TSVs per gene family.

2B.3 Extract sequences

bash scripts/10-extract_sequences_batch.sh

Retrieves FASTA sequences from hormoneDB.fasta using the curated subject IDs.

Phase 2C — Phylogenetic inference

Place the sequences from Phase 2B into 01_sequences/ (one FASTA per gene family), then:

bash scripts/11-phylo_pipeline.sh

The pipeline runs the following steps with checkpoints (completed steps are skipped on re-run):

Step	Tool	Purpose
Header rename	`awk`	Standardize to `Genus_Species_TaxID_UniprotID_GeneName`
Deduplication	`seqkit rmdup`	Remove exact sequence duplicates
Clustering	`cd-hit -c 0.99`	Remove sequences ≥99% identical
Alignment	`mafft --auto --reorder`	Multiple sequence alignment
Guide tree	`FastTree -lg -gamma`	Approximate ML tree for PRANK
Alignment refinement	`prank -protein -iterate=3`	Phylogeny-aware alignment
Trimming	`trimal -automated1`	Remove poorly aligned columns
Phylogenetic inference	`iqtree3 MFP + LG+C60`	ML tree with bootstrap support

Outputs are written to numbered directories (02_filtering/ through 07_iqtree/).

IQ-TREE model note: MFP (ModelFinder Plus) explores standard substitution models. -madd LG+C60,LG+F+C60 additionally tests profile mixture models, which better capture compositional heterogeneity across distantly related taxa spanning multiple kingdoms.

Computational note: The LG+C60 models are substantially slower than standard models. Running on a dataset of ~150 sequences per gene family, IQ-TREE may require 1–12 hours per gene family depending on available CPU cores. Use -T AUTO (already set) to utilize all available threads.

Phase 2D — Tree visualization

Single tree (interactive):

Open scripts/12-phylogenetic-tree.Rmd in RStudio, set the path to your .treefile and metadata file in the configuration section, then knit.

Batch processing (all gene families):

Rscript scripts/12-phylogenetic-tree_batch.R

Generates cladogram and phylogram figures (SVG + PNG) for all trees in the IQ-TREE output directory. Tip labels are colored by Kingdom and shaped by Phylum using the taxonomy metadata file.

Phase 3 — Structural comparison

Representative sequences from major phylogenetic clades were modeled with AlphaFold v2 and structural comparisons were performed in ChimeraX using RMSD as the similarity metric. This phase was conducted manually and is not automated by a script.

AlphaFold web server: https://alphafold.ebi.ac.uk/
ChimeraX download: https://www.cgl.ucsf.edu/chimerax/

Data requirements

organisms.tsv

Defines the 232 organisms used to build the custom database. Covers:

Fungi: Ascomycota, Basidiomycota, Chytridiomycota, Mucoromycota, Zoopagomycota, Blastocladiomycota, Microsporidia, and early-diverging lineages
Outgroups: Viridiplantae (embryophytes and algae), Metazoa (vertebrates and invertebrates), Bacteria (Proteobacteria, Cyanobacteria, Firmicutes, Actinobacteria, Thermotogae), Archaea, and unicellular eukaryotes

The taxonomic breadth was deliberately designed to place fungal genes in a broad evolutionary context and assess phytohormone-associated gene distribution across the tree of life.

taxonomy_metadata.tsv

Required by the visualization scripts. Tab-separated with the following columns:

Column	Description	Example
TaxID	NCBI Taxonomy ID	5476
Organism	Scientific name	Candida albicans
Kingdom	Taxonomic kingdom	Fungi
Phylum	Taxonomic phylum	Ascomycota
EarlyDivergent	Basal lineage flag	TRUE / FALSE

Note on T. atroviride v3

This proteome is currently unpublished and not available in public databases. It was incorporated manually into the custom database. Researchers wishing to reproduce the exact analysis may request it from the corresponding author.

Software requirements

Command-line tools

Tool	Version tested	Purpose
BLAST+	2.17.0	Homology searches
MAFFT	7.526	Multiple sequence alignment
FastTree	2.2.0	Guide tree generation
PRANK	170427	Phylogeny-aware alignment refinement
trimAl	1.5.0	Alignment trimming
IQ-TREE	3.0.1	Phylogenetic inference
CD-HIT	4.8.1	Sequence clustering
SeqKit	2.10.1	Sequence deduplication
HMMER	3.4	Profile HMM searches
NCBI E-Direct	24.0	Proteome download from NCBI
Python	3.12	Sequence extraction helper
BioPython	1.86	FASTA parsing

All versions are pinned in env/environment.lock.linux-64.yml and available via Conda.

R packages

Package	Source	Purpose
tidyverse	CRAN	Data wrangling and plotting
ggplot2	CRAN	Graphics
gridExtra	CRAN	Multi-panel figures
svglite	CRAN	SVG export
here	CRAN	Project-relative paths
ape	CRAN	Phylogenetic data structures
phangorn	CRAN	Phylogenetic analysis
ggtree	Bioconductor	Phylogenetic tree visualization
treeio	Bioconductor	Tree I/O (IQ-TREE format support)
rmarkdown	CRAN	Reproducible report generation

Reproducibility

Locked environment: env/environment.lock.linux-64.yml pins the exact build string of every dependency used in the published analysis (Linux x86-64).
Docker image: davidalbertoge/hormone-analysis:latest provides a pre-configured container with all command-line tools.
Checkpointed pipeline: phylo_pipeline.sh skips completed steps on re-run, allowing safe interruption and resumption.
Zenodo archive: A snapshot of this repository including all analysis outputs (alignments, tree files, figures) is permanently archived at https://doi.org/10.5281/zenodo.20115767.
Database versions: Proteomes were downloaded from UniProt in [october 2025] and from NCBI RefSeq in [october 2025]. The exact download manifest is available in manifests/proteomes.tsv in the Zenodo archive.

Citation

If you use this pipeline, please cite both the article and the software using the following formats (APA 7th Edition):

Article

García-Estrada, D. A., Ruvalcaba-Villagrán, M. L., Sánchez-Fonseca, A. G., Duran-Palmerin, J., Ornelas-Paz, J., Vargas-Gasca, F., Olmedo-Monfil, V., & Herrera-Estrella, A. (2026). Phytohormones in Fungi: Inter-Kingdom Modulators or Fungal Self-Controlling Elements? ASM Microbiology Society. (Manuscript submitted).

Software

García-Estrada, D. A. (2026). Phytohormones-Fungi: a reproducible phylogenomic pipeline for cross-kingdom homolog detection and analysis (Version 1.1.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.20115767

BibTeX:

@software{garcia2026pipeline,
  author    = {García-Estrada, David Alberto},
  title     = {Phytohormones-Fungi: A reproducible phylogenomic pipeline for cross-kingdom homolog detection and analysis},
  year      = {2026},
  version   = {1.1.0},
  doi       = {10.5281/zenodo.20115767},
  url       = {https://doi.org/10.5281/zenodo.20115767},
  publisher = {Zenodo},
  keywords  = {phylogenomics, phytohormones, fungi, pipeline}
}

@article{garcia2026fungi,
  author    = {García-Estrada, David Alberto and Ruvalcaba-Villagrán, Melanie L. and Sánchez-Fonseca, Axel G. and Duran-Palmerin, Jonathan and Ornelas-Paz, Juan and Vargas-Gasca, Francisco and Olmedo-Monfil, Vianey and Herrera-Estrella, Alfredo},
  title     = {Phytohormones in Fungi: Inter-Kingdom Modulators or Fungal Self-Controlling Elements?},
  journal   = {ASM Microbiology Society},
  year      = {2026},
  note      = {Manuscript submitted for publication}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Contact

David Alberto García Estrada Researcher — Center for Research in Advanced Materials (CIMAV), Chihuahua, México

For technical questions, please open an issue rather than emailing directly — this keeps solutions visible to other users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phytohormones-Fungi: a reproducible phylogenomic pipeline for cross-kingdom homolog detection and analysis

Table of Contents

Overview

Repository structure

Installation

Option 1 — Conda (recommended)

Option 2 — Docker

Conda path configuration

Pipeline

Phase 1 — Custom database construction

Phase 2A — Seed expansion (cross-kingdom search)

Phase 2B — BLASTp against the custom database

Phase 2C — Phylogenetic inference

Phase 2D — Tree visualization

Phase 3 — Structural comparison

Data requirements

organisms.tsv

taxonomy_metadata.tsv

Note on T. atroviride v3

Software requirements

Command-line tools

R packages

Reproducibility

Citation

Article

Software

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
docker		docker
env		env
scripts		scripts
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Phytohormones-Fungi: a reproducible phylogenomic pipeline for cross-kingdom homolog detection and analysis

Table of Contents

Overview

Repository structure

Installation

Option 1 — Conda (recommended)

Option 2 — Docker

Conda path configuration

Pipeline

Phase 1 — Custom database construction

Phase 2A — Seed expansion (cross-kingdom search)

Phase 2B — BLASTp against the custom database

Phase 2C — Phylogenetic inference

Phase 2D — Tree visualization

Phase 3 — Structural comparison

Data requirements

organisms.tsv

taxonomy_metadata.tsv

Note on T. atroviride v3

Software requirements

Command-line tools

R packages

Reproducibility

Citation

Article

Software

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages