Multi-task classification of ancient DNA samples using unitig abundances as features. A unitig is a maximal non-branching path in a de Bruijn graph β it compacts overlapping k-mers into a single sequence, reducing redundancy while preserving genomic diversity.
Given raw sequencing reads, DIANA counts how many reference unitigs are present in the sample and feeds the resulting abundance vector into a multi-task neural network that simultaneously predicts:
| Task | Labels |
|---|---|
| Sample type | ancient metagenome, modern metagenome |
| Community type | gut, oral, plant tissue, skeletal tissue, soft tissue, not applicable (env sample) |
| Sample host | Homo sapiens, Homo sapiens neanderthalensis, Pan troglodytes, Gorilla sp., Ursus arctos, Canis lupus, Mammuthus primigenius, Rangifer tarandus, Ambrosia artemisiifolia, Arabidopsis thaliana, other mammal, not applicable (env sample) |
| Material | bone, dental calculus, digestive tract contents, leaf, midden, permafrost, plaque, saliva, sediment, shell, skin, soil, tissue, tooth |
Trained on 2,597 samples from the AncientMetagenomeDir database.
- OS: Linux (tested on Red Hat Enterprise Linux 8.10; expected to work on any modern Linux distribution with Conda/Mamba)
- DIANA version: v0.1.0
- Mamba (recommended) or Conda installed and initialised
- At least 10 GB free disk space
- Internet connection for downloading models
- All software dependencies and version numbers are listed in
environment.yml
| Step | Typical time |
|---|---|
git clone --recurse-submodules |
~30 seconds |
mamba env create -f environment.yml -p ./env |
~10 minutes |
bash install.sh (builds Rust binaries + downloads ~560 MB) |
~1 minute |
git clone --recurse-submodules https://github.com/CamilaDuitama/DIANA.git
cd DIANA
mamba env create -f environment.yml -p ./env
mamba activate ./env
bash install.shA small bundled test sample is included in test_data/ β a 1 % random subsample (seed 42, ~182 k read pairs, 9 MB each) of ERR3609654, an ancient oral metagenome. Use it to verify the installation without downloading the full 1.6 GB dataset. Expected run time: ~20 seconds on a standard desktop/laptop.
diana-predict \
--sample test_data/ERR3609654_1_small.fastq.gz test_data/ERR3609654_2_small.fastq.gz \
--model results/training/best_model.pth \
--training-matrix training_matrix \
--output test_resultsView the predictions:
cat test_results/ERR3609654/ERR3609654_predictions.jsondiana-predict writes results to test_results/ERR3609654/:
ERR3609654_predictions.jsonβ predicted class and probability for each taskplots/ERR3609654_*_barplot.{html,png}β one interactive bar chart per task
Each bar chart shows every class on the y-axis and its predicted probability on the x-axis; the most probable class is highlighted. The .html version is fully interactive (hover for exact values). Below are the four charts produced for ERR3609654:
Sample type β Is the sample ancient or modern?
Community type β What microbial community does the sample come from?
Sample host β Which host species does the sample originate from?
Material β What physical material was the sample extracted from?
# Single-end
diana-predict --sample sample.fastq.gz \
--model results/training/best_model.pth \
--training-matrix training_matrix \
--output results/predictions
# Paired-end
diana-predict --sample sample_R1.fastq.gz sample_R2.fastq.gz \
--model results/training/best_model.pth \
--training-matrix training_matrix \
--output results/predictions| Argument | Description |
|---|---|
--sample |
Gzipped FASTQ or FASTA (*.fastq.gz, *.fq.gz, *.fasta.gz, *.fa.gz, *.fna.gz). Provide two files for paired-end. |
--model |
Path to best_model.pth |
--training-matrix |
Directory containing unitigs.fa and reference_kmers.fasta |
--output |
Output directory |
--threads |
Number of threads (default: 10) |
diana-project is an optional companion tool that projects a sample onto the training PCA space, finds its nearest neighbours among the 2,597 training samples, and saves interactive HTML + PNG scatter plots.
diana-project --sample results/predictions/sample_id/For each prediction task it produces a pca_projection_<task>.html/png: training samples are coloured by label, the five nearest neighbours are highlighted in yellow, and the new sample is shown as a red star.
PCA projection (sample type) β ERR3609654 (red star) lands among ancient samples. Its five nearest neighbours (yellow diamonds) are all ancient oral metagenomes.
Species abundance β Top microbial species detected in the sample's unitigs, giving a quick taxonomic overview.
Make sure the environment is activated (mamba activate ./env). The diana-predict and diana-project commands are registered as entry points when the environment is created.
OOM during k-mer counting is common for high-diversity samples (dental calculus, oral metagenomes). Retry with more RAM (--mem=32G on SLURM). Calculus samples can require >256 GB.
If mamba run is unavailable or broken, activate the environment first and call the command directly:
mamba activate ./env # or: source activate ./env
diana-predict ...Apache 2.0 β see LICENSE.
If you use DIANA in your research, please cite:
@article{diana2026,
title = {{DIANA}: Deep Learning Identification and Assessment of Ancient {DNA}},
author = {Duitama Gonz{\'{a}}lez, Camila and Lopopolo, Maria and Nishimura, Luca
and Faure, Roland and Duchene, Sebastian},
year = {2026},
note = {Correspondence: cduitama@pasteur.fr}
}The trained model weights and PCA reference are hosted on Hugging Face: cduitamag/DIANA.
Reference k-mers and unitig BLAST annotations are deposited on Zenodo: 10.5281/zenodo.18157419.







