This Snakemake-based pipeline processes sequencing reads from epicPCR experiments to link antimicrobial resistance (AMR) or other target genes with genus-specific 16S rRNA gene sequences from prokaryotic cells. It orchestrates all workflow steps, including database downloads, sequence alignment, read filtering, and the generation of visual reports, ensuring reproducibility and streamlined analysis.
- Auto-downloads and prepares SILVA and CARD databases for use in the analysis
- Auto-downloads user-specified targets (UniRef )and adds it to the CARD database
- Performs Diamond/Usearch sequence alignments against both databases
- Integrates similarity search results to identify linked AMR and microbial markers
- Generates filtered results based on alignment quality and similarity thresholds
- Produces graphical outputs such as genus abundance plots, boxplots, and tables
- Generates an HTML report summarizing the analysis
This pipeline is possible to deploy with snakedeploy. We recommend to only use this with already present experience with snakemake and snakedeploy. If this is not the case, we recommend to use the full installation guide.
For usage:
- Make sure snakedeploy and snakemake≥v9 is installed
- deploy minimum workflow distribution (with desired destination)
snakedeploy deploy-workflow https://github.com/IKIM-Essen/ERMA --dest . --branch snakedeploy- prepare setup
cp "files-to-analyze" data/fastq- start pipeline (change config if necessary)
snakemake --cores all --sdm condaThe pipeline requires Snakemake (≥ Version 9.0) with support for conda environments. You can install it using conda/mamba (via Miniconda or Anaconda):
mamba create -n snakemake9 -c conda-forge -c bioconda snakemake=9 --yes
mamba activate snakemake9Alternatively, follow the official Snakemake installation guide for more documentation and options.
Install Dependencies: This pipeline uses conda environments to manage its dependencies. Snakemake will automatically create and manage these environments when run with the --use-conda flag (default in profile).
ERMA relies on two primary reference databases: SILVA for taxonomic assignment and CARD for antimicrobial resistance gene detection. Both databases are handled directly through the workflow configuration and can be fetched automatically or provided locally, depending on user requirements.
By default, ERMA downloads both databases using the URLs defined in the configuration file:
download_silva: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz"
download_card: "https://card.mcmaster.ca/download/0/broadstreet-v4.0.1.tar.bz2"When these entries remain set to URLs, the workflow retrieves and processes the databases automatically during execution. No additional preparation steps are required. Once downloaded, the workflow uses the present files for reiterations.
If the user prefers handling the downloads manually or already has the required databases available on disk, the workflow can operate entirely from local paths. To do so, replace the URLs in the configuration file with absolute file paths:
download_silva: "/path/to/local/SILVA_138.2.fasta.gz"
download_card: "/path/to/local/broadstreet-v4.0.1.tar.bz2"Once local paths are set, ERMA will skip remote downloads and use the supplied files without modification of the pipeline structure.
ERMA provides an optional mechanism to incorporate UniRef clusters representing additional genes or functions relevant to the analysis. This process is fully automated and controlled through the following section of the configuration file:
add_uniref_targets: False
uniprot_cluster: "100"
uniprot_targets: ["int1", "inti1", "class_1_integron"]
max_entry_count: 1000
low_freq_threshold: 0.01To activate UniRef integration, set:
add_uniref_targets: TrueThen adjust the uniprot_targets list to include any desired search terms or gene names. ERMA downloads the corresponding UniRef entries, filters them according to the configured uniprot_cluster level (e.g., UniRef100), and integrates them directly into the CARD-derived reference database used for similarity searches. This option enables users to extend the AMR and functional screening capabilities of ERMA without requiring manual reference curation.
Users who prefer to add custom targets manually can follow this approach:
- Retrieve the protein FASTA sequences for the target gene(s) from UniProt, CARD, or any external source
- Append the new sequences to the CARD protein FASTA file used by ERMA (protein_fasta_protein_homolog_model.fasta)
- Rezip the beforehand unzipped files to the same format as before
- Modify the workflow configuration to point to the updated local database files This approach offers full control over the source, format, and curation of custom targets while keeping the surrounding pipeline unchanged
First, clone the pipeline repository to your local machine:
git clone https://github.com/IKIM-essen/ERMA.git
cd ERMAPrepare Data Folder: You need to place your raw sequencing files (fastq.gz format) in the data/fastq/ directory or change this path to the desired directory.
Modify the Config File: Open the config/config.yaml file and change the base_dir parameter to the base directory where the pipeline is located. The most relevant parameters for the standard user are:
runname: "runname123"
min_similarity: "0.9" # threshold to filter blast hits by percentage identity
num_parts: 1 # number of subfiles the fastqs are split into (prevents crashing with very large input)
max_threads: 16
seq_tech: "Illumina" # Put here "Illumina" or "ONT" according to used technology (only important for using rule prepare_fastqs)Skip this when analyzing single-end or already merged paired-end reads.
When analyzing paired-end reads, they must be merged before starting the pipeline. For this, ERMA provides a module. Follow the steps:
- Copy the to-be-merged paired-end fastq.gz files in data/fastq
- Set the seq_tech parameter to "Illumina" in config/config.yaml (if not already)
- Run from the ERMA root folder:
snakemake prepare_fastqsThis will execute a bash script with the essential step of merging R1 and R2 read with NGmerge
Skip this when the input files are already in "one fastq.gz file per sample" - format.
When starting with raw ONT output, this routine can be used to demultiplex the samples according to the respective barcodes. For that, following must be prepared:
- Enter the desired sample name - barcode combinations in the file "barcode-rename.csv" which can be found at data/ONT
- Copy the fastq_pass folder of the targeted ONT run in data/ONT or change the respective path in the config file
- Define the target fragment lengths parameter in the config file - everything outside the filter intervall parameter gets filtered out beforehand
- Set the seq_tech parameter to "ONT" in config/config.yaml (if not already)
- Run from the ERMA root folder:
snakemake prepare_fastqs --cores NThis will execute a python script that filters and demultiplex the ONT output in one fastq file per sample.
To ensure a robust running of the pipeline, sample names are restricted to only contain:
- Uppercase and/or lowercase letters
- digits
- underscores
Sample names containing other special characters ("-" "," "." "/" "$" "@" etc.) will result in an error
To run the pipeline in the default mode, a profile with snakemake specific parameters is provided. Therefore, the only neccessary command to start is:
snakemake --profile profileFor testing the workflow the user can copy the provided dummy data:
cp .github/data/fastq/test_epic_data.fastq.gz data/fastq/
In this case, the similarity search mode in the config file can be changed to "test", searching only one of the two strands.
The pipeline is designed to handle large sequencing datasets in parallel, so it's recommended to run it on a machine with sufficient computational resources. However, to run the pipeline on machines with less resources, it is recommended to split the fastq files or the tables in smaller chunks to prevent RAM overflow. This can be done by increase num_parts in the config file. If any errors occur during the pipeline run, Snakemake will provide detailed logs, allowing you to debug and troubleshoot any issues. You are most welcome to create an Issue when running into problems.
We acknowledge that many users perform a regular 16S experiment additional to the epicPCR experiment. However, we have decided not to include an extra analysis featue for this in ERMA. We would like to encourage the user to use RiboSnake (10.46471/gigabyte.132), a validated, automated, reproducible QIIME2-based pipeline implemented in Snakemake for analysing 16S rRNA gene amplicon sequencing data.
This project is licensed under the MIT License.