This repository contains the pipeline, datasets and scripts used in the following paper:
- Y. Tabatabaee, S. Claramunt, S. Mirarab (2026). Coalescent-based branch length estimation improves dating of species trees. Systematic Biology, syag038. https://doi.org/10.1093/sysbio/syag038
For experiments in this study, we generated three sets of simulated datasets with gene tree discordance due to incomplete lineage sorting (ILS) and analyzed two avian biological datasets from Harvey et. al. (2020) and Stiller et. al. (2024). The simulated datasets have model species trees with substitution-unit, generation-unit, and time-unit branch lengths. All datasets can be accessed from Dryad https://doi.org/10.5061/dryad.hmgqnk9xv.
We provide an end-to-end pipeline for coalescent-aware divergence time estimation from gene trees. The pipeline has the following steps:
- Infers a species tree topology using ASTRAL/ASTER.
- Estimates coalescent-aware branch lengths using CASTLES-Pro.
- Converts substitution-unit branch lengths into time units using an ML-based dating method.
To run the dating pipeline, follow these commands.
git clone https://github.com/ytabatabaee/coalescent-based-dating.git
cd coalescent-based-dating
conda create -n cbdating python=3.10
conda activate cbdating
conda install -c bioconda astral-tree newick_utilsIn addition, you need to install the ML-based dating methods you plan to use. The pipeline currently supports TreePL, MD-Cat, wLogDate, and LSD2. To suggest additional dating methods, please submit an issue.
The pipeline can be used to infer the species tree and date it, or use a user-provided species tree. To run either pipeline, use the following command:
python run_pipeline.py \
--gene-trees <gene-tree-file> \
[--species-tree <species-tree-file>] \
--calibrations <calibration-file> \
--method <dating-method> \
[--outgroup <outgroup-name>] \
[--CI "<num-samples> <lower-quantile> <upper-quantile>"] \
--output <output-directory>--gene-treesgene trees in Newick format--species-treeoptional user-provided species tree topology in Newick format--calibrationscalibration file--methoddating method (currently supportstreepl,mdcat,wlogdate,lsd2)--outgroupoptional outgroup taxon name for rooting the species tree before dating--outputoutput directory--CIMD-Cat confidence intervals (num_samples lower_quantile upper_quantile)--seq-lengthoptional sequence length passed to MD-Cat--mdcat-pMD-Cat-pvalue (default:10)
The three --CI values are user-controlled. For example, --CI "1000 0.025 0.975" computes 95% confidence intervals from 1000 posterior samples, while --CI "100 0.05 0.95" uses 100 samples and the 5th/95th percentiles (90% confidence intervals). Confidence intervals are currently supported only for --method mdcat.
Different ML-based dating methods use different calibration formats for pre-specified times. TreePL can use minimum/maximum bounds, such as
mrca = calib1 taxonA taxonB
min = calib1 66
max = calib1 72
mrca = calib2 taxonC taxonD
min = calib2 120
max = calib2 135
MD-Cat, wLogDate, LSD2, and TreePL also accept fixed node ages using either internal node labels or the MRCA of two terminal nodes:
i1 8.055
i7 2.812
i16 3.393
mrca(24,3) 7.534
mrca(taxonA,taxonB) 66.0
For mrca(taxonA,taxonB) entries, the pipeline labels the corresponding
internal node before running dating methods that require node labels. For
TreePL, the pipeline converts fixed-age entries to equivalent minimum and
maximum bounds internally.
The output directory includes the following files
results/
├── species_tree_su.tre
├── dated_tree.tre
├── logs/
└── intermediate/
where
species_tree_su.treis the species tree with CASTLES-Pro substitution-unit branch lengths.dated_tree.treis the time-calibrated species tree dated using the selected method.
When --CI is used with MD-Cat, dated_tree.tre includes confidence interval annotations. logs and intermediate directories include config and intermediate files generated by the dating methods.
The example/ directory includes one sample 30-taxon dataset.
python run_pipeline.py \
--gene-trees example/gene_trees.tre \
--species-tree example/species_tree.tre \
--calibrations example/calibrations.txt \
--method mdcat \
--CI "100 0.05 0.95" \
--output results/example/This dataset has six model conditions with varying deviation from the molecular clock and inclusion of an outgroup, each with 100 replicates. The model conditions are specified as outgroup.[has-OG].species.[DEV].genes.[DEV] where [has-OG] is 1 when the dataset has an outgroup and 0 otherwise, and [DEV] shows the level of deviation from the clock (parameter α of the gamma distribution) that is set to 5 (low), 1.5 (medium), or 0.15 (high). Original dataset is from Mai at al. (2017) and available at https://uym2.github.io/MinVar-Rooting/. Below is a description of files in each directory.
truegenetrees: true gene treesestimatedgenetre.gtr: gene trees estimated under GTR evolution models_tree.trees: true species tree in substitution unitss_tree_gu.trees: true species tree in generation unitss_tree_tu_5.trees: true species tree in time units (million years), assuming an average generation time of 5 yearss_tree_unit_ultrametric.trees: unit ultrametric true species trees_tree_tu_5_calibrations_n[num-calib]_[root_unfixed].txt: calibration information (given in the format of (node, time) pairs)s_tree_tu_5_calib_n[num-calib]_[root_unfixed]_mcmctree.ctl: control file for MCMCTrees_tree_tu_5_calib_n[num-calib]_[root_unfixed]_mcmctree.date.nwk[.normalized]: tree dated with MCMCTree with [num-calib] calibration points. The .[normalized] flag specifies the unit-ultrametric version of the dated tree.RAxML_result.concat_align_s_tree.trees.rooted.labeled: true species tree furnished with RAxML SU branch lengthscastlespro_estimatedgenetre.gtr_s_tree.trees.rooted.labeled: true species tree furnished with CASTLES-Pro SU branch lengths[dating-method]_n[num-calib]_[root_unfixed]_RAxML_result.concat_align_s_tree.trees.rooted.labeled.[normalized]: RAxML SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension.[dating-method]_n[num-calib]_[root_unfixed]_castlespro_estimatedgenetre.gtr_s_tree.trees.rooted.labeled.[normalized]: CASTLES-Pro SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension.[dating-method]_CI_n[num-calib]_[root_unfixed]_castlespro_estimatedgenetre.gtr_s_tree.trees.rooted.labeled.[normalized]: same as above but with each branch annotated with confidence intervals for branch lengths and node ages.ad.txt: average RF distance between the model species tree and true gene treesgtee_gtr.txt: average RF distance between true and estimated gene trees
This dataset has four model conditions with varying sequence lengths (1600bp, 800bp, 400bp, 200bp) corresponding to different levels of gene tree estimation error (23%, 31%, 42%, and 55%). The original dataset is from Zhang et. al. (2018) and available at https://gitlab.com/esayyari/ASTRALIII/. Below is a description of files in each directory.
truegenetrees: true gene treesfasttree_genetrees_[seq-len]_non.[num-genes]: [num-genes] gene trees estimated from sequence alignments with length [seq-len]bps_tree.trees: true species tree in substitution unitss_tree_gu.trees: true species tree in generation unitss_tree_tu_5.trees: true species tree in time units (million years), assuming an average generation time of 5 yearss_tree_unit_ultrametric.trees: unit ultrametric true species trees_tree_tu_5_calibrations_n[num-calib]_[root_unfixed].txt: calibration information (given in the format of (node, time) pairs)RAxML_result.concat_for_fasttree_[seq-len].[num-genes]_s_tree.trees.rooted.labeled: true species tree furnished with RAxML SU branch lengths given a concatenation of [seq-len]bp sequences for the first [num-genes] genescastlespro_fasttree_genetrees_[seq-len]_non.[num-genes]_s_tree.trees.rooted.labeled: true species tree furnished with CASTLES-Pro SU branch lengths given [num-genes] FastTree gene trees estimated from [seq-len]bp sequencesad.txt: average RF distance between the model species tree and true gene trees[dating-method]_n[num-calib]_[root_unfixed]_castlespro_fasttree_genetrees_[seq-len]_non.[num-genes]_s_tree.trees.rooted.labeled.[normalized]: CASTLES-Pro SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points for the model condition corresponding to [num-genes] genes and [seq-len]bp sequences. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension. Files with additional extensions include config files and log files for each method.[dating-method]_n[num-calib]_[root_unfixed]_RAxML_result.concat_for_fasttree_[seq-len].[num-genes]_s_tree.trees.rooted.labeled.[normalized]: RAxML SU tree dated with [dating-method] (can be treepl, wlogdate, mdcat, and lsd2) with [num-calib] calibration points for the model condition corresponding to [num-genes] genes and [seq-len]bp sequences. The .[normalized] flag specifies the unit-ultrametric version of the dated tree. Trees dated with lsd2 have a.date.nwkextension. Files with additional extensions include config files and log files for each method.
This dataset has 8 model conditions with 50, 100, 200, 500, 1K, 2K, 5K, and 10K-taxon trees with 20 replicates in each condition. Below is a description of files in each directory large/[num-taxa]/[rep-num]/ where [num-taxa] is the model condition (number of taxa) and [rep-num] is the replicate index.
truegenetrees: true gene treesestimatedgenetre: estimated gene treess_tree.trees: true species tree in substitution unitss_tree_gu.trees: true species tree in generation unitss_tree_tu_5.trees: true species tree in time units (million years), assuming an average generation time of 5 yearss_tree_unit_ultrametric.trees: unit ultrametric true species trees_tree_tu_5_calibrations_n[num-calib].txt: calibration informationRAxML_result.concat_s_tree.trees.rooted.labeled: true species tree furnished with RAxML SU branch lengthscastlespro_estimatedgenetre_s_tree.trees.rooted.labeled: true species tree furnished with CASTLES-Pro SU branch lengthstreepl_n[num-calib]_RAxML_result.concat_s_tree.trees.rooted.labeled: RAxML SU tree dated with TreePL with [num-calib] calibration pointstreepl_n[num-calib]_RAxML_result.concat_s_tree.trees.rooted.labeled.config: config file for TreePL for the RAxML dated treetreepl_n[num-calib]_castlespro_estimatedgenetre_s_tree.trees.rooted.labeled: CASTLES-Pro SU tree dated with TreePL with [num-calib] calibration pointstreepl_n[num-calib]_castlespro_estimatedgenetre_s_tree.trees.rooted.labeled.config: config file for TreePL for the CASTLES-Pro dated treead.txt: average RF distance between the model species tree and true gene treesgtee.txt: average RF distance between true and estimated gene trees
-
Neoavian: 363-taxon neoavian dataset from Stiller et al. (2024) with 63,430 single-copy genes. The original data is available here. Results from the analysis in this study is available at /biological/avian-stiller. Below is a description of files in this directory.
-
mdcat_[CI]_median_40Kl_caml_stiller.rooted.tre: ASTRAL tree furnished with ConBL branch lengths dated with MD-Cat. TheCIflag denotes confidence intervals. -
mdcat_[CI]_median_40Kl_astral4_stiller_nolabel.rooted.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with MD-Cat. TheCIflag denotes confidence intervals. -
wlogdate_median_astral4_stiller.rooted.no_label.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with wLogDate -
wlogdate_median_astral_63K_concat_bl.rooted.no_label.tre: ASTRAL tree furnished with ConBL branch lengths dated with wLogDate -
treepl_median_astral_63K_concat_bl.rooted.tre: ASTRAL tree furnished with ConBL branch lengths dated with TreePL -
treepl_median_castlespro_stiller.rooted.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with TreePL -
genera_castlespro_concat.csv: Age of genera estimated using different dating methods with CASTLES-Pro or ConBL branch lengths on concatenation or ASTRAL topology -
families_castlespro_concat.csv: Age of families estimated using different dating methods with CASTLES-Pro or ConBL branch lengths on concatenation or ASTRAL topology -
orders_castlespro_concat.csv: Age of orders estimated using different dating methods with CASTLES-Pro or ConBL branch lengths on concatenation or ASTRAL topology -
ltt_stiller.csv: Lineage-through-time information for different dated trees -
fossil_list_median.txt: Fossil calibration information for median (50% quantiles) -
Suboscines: 1683-taxon suboscines dataset from Harvey et. al. (2020) with 2,389 single-copy genes. The original data is available at https://github.com/mgharvey/tyranni. Results from the analysis in this study is available at /biological/suboscines-harvey. Below is a description of files in this directory.
-
concat_T400F.examl.rooted.tre: Concatenation tree furnished with CAML branch lengths -
castlespro_T400F.examl.rooted.tre: Concatenation tree furnished with CASTLES-Pro branch lengths -
treepl_castlespro_T400F.astral.rooted.tre: ASTRAL tree furnished with CASTLES-Pro branch lengths dated with TreePL -
treepl_examl_T400F.astral.rooted.tre: ASTRAL tree furnished with CAML branch lengths dated with TreePL -
treepl_castlespro_T400F.examl.rooted.tre: Concatenation tree furnished with CASTLES-Pro branch lengths dated with TreePL -
treepl_concat_T400F.examl.rooted.tre: Concatenation tree furnished with CAML branch lengths dated with TreePL -
genera_treepl_castlespro_concat.csv: Age of different genera estimated using TreePL+CASLTES-Pro andTreePL+ConBL on the concatenation topology -
families_treepl_castlespro_concat.csv: Age of different families estimated using TreePL+CASLTES-Pro andTreePL+ConBL on the concatenation topology -
suboscines_ltt.csv: Lineage-through-time information for the four different dated trees (ASTRAL or CAML furnished with CASTLES-Pro or ConBL)
