-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This repository includes several workflows written in Common Workflow Language (CWL) for human resequence data analysis. Before running this workflow, please read GATK3 license, especially for non-academic users.
Server recommendations
- Memory ≥ 64GB
- #. Threads ≥ 16
- Disk space ≥ ~100GB / sample (depends on read depth)
Install software
- Install CWL execution engine on your server. We tested our workflows on cwltool (version 1.0.20181217162649).
Data required to run workflows
- Reference genome data and index files ... The following files can be downloaded from DDBJ FTP site.
- hs37d5.fa
- hs37d5.fa.amb
- hs37d5.fa.ann
- hs37d5.fa.bwt
- hs37d5.fa.pac
- hs37d5.fa.sa
- hs37d5.fa.dict
- hs37d5.fa.fai
- hs37d5.fa.autosomal.interval_list
- hs37d5.fa.chrX.interval_list
- hs37d5.fa.chrY.interval_list
- GATK Resource Bundle data (required only for running Workflows/bams2gvcf.wBQSR.cwl) ... The following files can be downloaded from GATK Resource Bundle web site.
- dbsnp_138.b37.vcf
- dbsnp_138.b37.vcf.idx
- Mills_and_1000G_gold_standard.indels.b37.vcf
- Mills_and_1000G_gold_standard.indels.b37.vcf.idx
- 1000G_phase1.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf.idx
Small open data for testing
We downloaded original paired-end WGS data in cram format for 8 males from 1000 Genomes Project, converted the format from cram to BAM, downsampled with probability of 0.001, and converted the format from BAM to FastQ. The following file can be downloaded from DBCLS web site.
- 20200924-test-data.tar.gz (422.8MB)
- This workflow takes as input paired-end (PE) fastq files and outputs a BAM file.
- PE fastq files are mapped onto a human reference genome using BWA MEM (version 0.7.12), which outputs a SAM file.
- Then, the SAM file is sorted and converted into BAM file using picard SortSam command (version 2.10.6).
Usage example
$ cwltool Workflows/fastqPE2bam.cwl <job-file>
Job file example
reference:
class: File
format: http://edamontology.org/format_1929
path: ../../human-reseq.REF/hs37d5.fa
RG_ID: ERR034597 # type "string"
RG_PL: ILLUMINA # type "string"
RG_PU: ERR034597 # type "string"
RG_LB: ERR034597 # type "string"
RG_SM: NA18942 # type "string"
fq1:
class: File
path: ../../human-reseq.DATA/ERR034597_1.fastq
format: http://edamontology.org/format_1930
fq2:
class: File
path: ../../human-reseq.DATA/ERR034597_2.fastq
format: http://edamontology.org/format_1930
nthreads: 16 # type "int"
outprefix: ERR034597 # type "string"
Input parameters specified in job file
-
reference (type "File")
- Specify FASTA file for reference genome sequence that are used in mapping. Index files (.amb / .ann / .bwt / .pac / .sa) created by BWA are needed.
-
RG_ID (type "string")
- Specify read group identifier.
-
RG_PL (type "string")
- Specify read group platform/technology.
-
RG_PU (type "string")
- Specify read group platform unit.
-
RG_LB (type "string")
- Specify read group library identifier.
-
RG_SM (type "string")
- Specify read group sample identifier.
-
fq1 (type "File")
- Specify FASTQ file of one of paired-end reads.
-
fq2 (type "File")
- Specify FASTQ file of one of paired-end reads.
-
nthreads (type "int")
- Specify the number of threads to be used.
-
outprefix (type "string")
- Specify output prefix.
See Sequence Alignment/Map Format Specification for details of read group description
Steps in this workflow
- bwa mem (Version 0.7.12)
- picard SortSam (Version 2.10.6)
Output files
Output of this workflow includes the following files:
-
outprefix.sam
- Results of mapping onto human reference genome in SAM format.
-
outprefix.bam
- Results of mapping onto human reference genome in BAM format.
-
outprefix.bai
- Index file for the mapping results.
- This workflow takes as input a single-end (SE) fastq file and outputs a BAM file.
- SE fastq file is mapped onto a human reference genome using BWA MEM (version 0.7.12), which outputs a SAM file.
- Then, the SAM file is sorted and converted into BAM file using picard SortSam command (version 2.10.6).
Usage example
$ cwltool Workflows/fastqSE2bam.cwl <job-file>
Job file example
reference:
class: File
format: http://edamontology.org/format_1929
path: ../../human-reseq.REF/hs37d5.fa
RG_ID: ERR034597 # type "string"
RG_PL: ILLUMINA # type "string"
RG_PU: ERR034597 # type "string"
RG_LB: ERR034597 # type "string"
RG_SM: NA18942 # type "string"
fq:
class: File
path: ../../human-reseq.DATA/ERR034597_1.fastq
format: http://edamontology.org/format_1930
nthreads: 16 # type "int"
outprefix: ERR034597 # type "string"
Input parameters specified in job file
-
reference (type "File")
- Specify FASTA file for reference genome sequence that are used in mapping. Index files (.amb / .ann / .bwt / .pac / .sa) created by BWA are needed.
-
RG_ID (type "string")
- Specify read group identifier.
-
RG_PL (type "string")
- Specify read group platform/technology.
-
RG_PU (type "string")
- Specify read group platform unit.
-
RG_LB (type "string")
- Specify read group library identifier.
-
RG_SM (type "string")
- Specify read group sample identifier.
-
fq (type "File")
- Specify FASTQ file of single-end reads.
-
nthreads (type "int")
- Specify the number of threads to be used.
-
outprefix (type "string")
- Specify output prefix.
See Sequence Alignment/Map Format Specification for details of read group description
Steps in this workflow
- bwa mem (Version 0.7.12)
- picard SortSam (Version 2.10.6)
Output files
Output of this workflow includes the following files:
-
outprefix.sam
- Results of mapping onto human reference genome in SAM format.
-
outprefix.bam
- Results of mapping onto human reference genome in BAM format.
-
outprefix.bai
- Index file for the mapping results.
- This workflow takes as input a list of BAM files and outpus a genomic VCF file. This workflow does not perform base quality recalibration. This workflow assumes that the target sample is female.
- PCR duplicons are removed using picard MarkDuplicates (version 2.10.6), which outputs a BAM file.
- Metrics for the merged BAM file are calculated using samtools (1.6), picard CollectMultipleMetrics (version 2.18.23), and picard CollectWgsMetrics (version 2.10.6).
- Variants are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option. A genomic VCF file will be output.
Usage example
$ cwltool Workflows/bams2gvcf.woBQSR_female.cwl <job-file>
Job file example
bam_files: # type "File"
- class: File
path: ../NA18939.SRR768162.bam
format: http://edamontology.org/format_2572
- class: File
path: ../NA18939.SRR768163.bam
format: http://edamontology.org/format_2572
nthreads: 16 # type "int"
outprefix: NA18939.woBQSR_female # type "string"
reference: # type "File"
class: File
format: http://edamontology.org/format_1929
path: ../../human-reseq.REF/hs37d5.fa
reference_interval_name_autosome: autosome # type "string"
reference_interval_list_autosome:
class: File
path: ../../human-reseq.REF/hs37d5.fa.autosomal.interval_list
reference_interval_name_chrX: chrX # type "string"
reference_interval_list_chrX:
class: File
path: ../../human-reseq.REF/hs37d5.fa.chrX.interval_list
reference_interval_name_chrY: chrY # type "string"
reference_interval_list_chrY:
class: File
path: ../../human-reseq.REF/hs37d5.fa.chrY.interval_list
Input parameters specified in job file
-
bam_files (type "File[]" -- array of files)
- Specify BAM file(s) used for genotype calling.
-
reference (type "File")
- Specify FASTA file for reference genome sequence. Index files (.fai / ^.dict) are needed.
-
reference_interval_name_autosome (type "string") (default value = autosome)
- Specify interval name for autosomes. Used for determining output metrics file name.
-
reference_interval_list_autosome (type "File")
- Specify interval list file for autosomes. Used for calculating WGS metrics.
-
reference_interval_name_chrX (type "string") (default value = chrX)
- Specify interval name for chromosome X. Used for determining output metrics file name.
-
reference_interval_list_chrX (type "File")
- Specify interval list file for chromosome X. Used for calculating WGS metrics.
-
reference_interval_name_chrY (type "string") (default value = chrY)
- Specify interval name for chromosome Y. Used for determining output metrics file name.
-
reference_interval_list_chrY (type "File")
- Specify interval list file for chromosome Y. Used for calculating WGS metrics.
-
nthreads (type "int")
- Specify the number of threads to be used.
-
outprefix (type "string")
- Specify output prefix.
Steps in this workflow
- picard MarkDuplicates (Version 2.10.6)
- picard CollectMultipleMetrics (Version 2.18.23)
- samtools flagstat (Version 1.6)
- samtools idxstats (Version 1.6)
- picard CollectWgsMetrics (Version 2.10.6)
- gatk3 HaplotypeCaller (Version 3.7.10)
Output files
Output of this workflow includes the following files:
-
outprefix.bam.reference_interval_name_autosome.wgs_metrics
- WGS metrics for autosomes.
-
outprefix.bam.reference_interval_name_chrX.wgs_metrics
- WGS metrics for chromosome X.
-
outprefix.bam.reference_interval_name_chrY.wgs_metrics
- WGS metrics for chromosome Y.
-
outprefix.g.vcf.gz
- Results of genotype calling in genomic VCF format.
-
outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling.
- This workflow takes as input a list of BAM files and outpus two genomic VCF file (one VCF file is for autosome variants, and another is for sex chromosome variants). This workflow does not perform base quality recalibration. This workflow assumes that the target sample is male.
- PCR duplicons are removed using picard MarkDuplicates (version 2.10.6), which outputs a BAM file.
- Metrics for the merged BAM file are calculated using samtools (1.6), picard CollectMultipleMetrics (version 2.18.23), and picard CollectWgsMetrics (version 2.10.6).
- Variants on autosomes are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option. A genomic VCF file for autosome variants will be output.
- Variants on sex chromosomes are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option for PAR regions and with ploidy=1 option for non-PAR regions. A genomic VCF file for sex chromosome (X/Y) variants will be output.
Usage example
$ cwltool Workflows/bams2gvcf.woBQSR_male.cwl <job-file>
Job file example
bam_files: # type "File"
- class: File
path: ../NA18966.ERR234335.bam
format: http://edamontology.org/format_2572
nthreads: 16 # type "int"
outprefix: NA18966.woBQSR_male # type "string"
chrXY_outprefix: NA18966.woBQSR_male.chrXY # type "string"
reference: # type "File"
class: File
format: http://edamontology.org/format_1929
path: ../../human-reseq.REF/hs37d5.fa
reference_interval_name_autosome: autosome # type "string"
reference_interval_list_autosome:
class: File
path: ../../human-reseq.REF/hs37d5.fa.autosomal.interval_list
reference_interval_name_chrX: chrX # type "string"
reference_interval_list_chrX:
class: File
path: ../../human-reseq.REF/hs37d5.fa.chrX.interval_list
reference_interval_name_chrY: chrY # type "string"
reference_interval_list_chrY:
class: File
path: ../../human-reseq.REF/hs37d5.fa.chrY.interval_list
Input parameters specified in job file
-
bam_files (type "File[]" -- array of files)
- Specify BAM file(s) used for genotype calling.
-
reference (type "File")
- Specify FASTA file for reference genome sequence. Index files (.fai / ^.dict) are needed.
-
reference_interval_name_autosome (type "string") (default value = autosome)
- Specify interval name for autosomes. Used for determining output metrics file name.
-
reference_interval_list_autosome (type "File")
- Specify interval list file for autosomes. Used for calculating WGS metrics.
-
reference_interval_name_chrX (type "string") (default value = chrX)
- Specify interval name for chromosome X. Used for determining output metrics file name.
-
reference_interval_list_chrX (type "File")
- Specify interval list file for chromosome X. Used for calculating WGS metrics.
-
reference_interval_name_chrY (type "string") (default value = chrY)
- Specify interval name for chromosome Y. Used for determining output metrics file name.
-
reference_interval_list_chrY (type "File")
- Specify interval list file for chromosome Y. Used for calculating WGS metrics.
-
nthreads (type "int")
- Specify the number of threads to be used.
-
outprefix (type "string")
- Specify output prefix for autosomal variants.
-
chrXY_outprefix (type "string")
- Specify output prefix for sex-chromosome variants.
Steps in this workflow
- picard MarkDuplicates (Version 2.10.6)
- picard CollectMultipleMetrics (Version 2.18.23)
- samtools flagstat (Version 1.6)
- samtools idxstats (Version 1.6)
- picard CollectWgsMetrics (Version 2.10.6)
- gatk3 HaplotypeCaller (Version 3.7.10)
- gatk3 SelectVariants (Version 3.7.10)
- bcctools concat (Version 1.6)
- bcftools index (Version 1.6)
Output files
Output of this workflow includes the following files:
-
outprefix.bam.reference_interval_name_autosome.wgs_metrics
- WGS metrics for autosomes.
-
outprefix.bam.reference_interval_name_chrX.wgs_metrics
- WGS metrics for chromosome X.
-
outprefix.bam.reference_interval_name_chrY.wgs_metrics
- WGS metrics for chromosome Y.
-
outprefix.g.vcf.gz
- Results of genotype calling for autosomes in genomic VCF format.
-
outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling for autosomes.
-
chrXY_outprefix.g.vcf.gz
- Results of genotype calling for sex-chromosomes in genomic VCF format.
-
chrXY_outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling for sex-chromosomes.
- This workflow takes as input a list of BAM files and outpus a genomic VCF file. This workflow performs base quality recalibration. If the target sample is male, variant call results for sex chromosomes are unreliable.
- PCR duplicons are removed using picard MarkDuplicates (version 2.10.6), which outputs a BAM file.
- Base quality recalbration is performed using gatk3 BaseRecalibrator (version 3.7.0).
- Metrics for the merged BAM file after base quality recalibration are calculated using samtools (1.6), picard CollectMultipleMetrics (version 2.18.23), and picard CollectWgsMetrics (version 2.10.6).
- Variants are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option. A genomic VCF file will be output.
Usage example
$ cwltool Workflows/bams2gvcf.wBQSR.cwl <job-file>
Job file example
bam_files: # type "File"
- class: File
path: ../NA18939.SRR768162.bam
format: http://edamontology.org/format_2572
- class: File
path: ../NA18939.SRR768163.bam
format: http://edamontology.org/format_2572
nthreads: 16 # type "int"
outprefix: NA18939.wBQSR_female # type "string"
reference: # type "File"
class: File
format: http://edamontology.org/format_1929
path: ../../human-reseq.REF/hs37d5.fa
reference_interval_name_autosome: autosome # type "string"
reference_interval_list_autosome:
class: File
path: ../../human-reseq.REF/hs37d5.fa.autosomal.interval_list
reference_interval_name_chrX: chrX # type "string"
reference_interval_list_chrX:
class: File
path: ../../human-reseq.REF/hs37d5.fa.chrX.interval_list
reference_interval_name_chrY: chrY # type "string"
reference_interval_list_chrY:
class: File
path: ../../human-reseq.REF/hs37d5.fa.chrY.interval_list
dbsnp:
class: File
path: ../../human-reseq.DATA/bundle.b37/dbsnp_138.b37.vcf
mills_indel:
class: File
path: ../../human-reseq.DATA/bundle.b37/Mills_and_1000G_gold_standard.indels.b37.vcf
onek_indel:
class: File
path: ../../human-reseq.DATA/bundle.b37/1000G_phase1.indels.b37.vcf
Input parameters specified in job file
-
bam_files (type "File[]" -- array of files)
- Specify BAM file(s) used for genotype calling.
-
reference (type "File")
- Specify FASTA file for reference genome sequence. Index files (.fai / ^.dict) are needed.
-
reference_interval_name_autosome (type "string") (default value = autosome)
- Specify interval name for autosomes. Used for determining output metrics file name.
-
reference_interval_list_autosome (type "File")
- Specify interval list file for autosomes. Used for calculating WGS metrics.
-
reference_interval_name_chrX (type "string") (default value = chrX)
- Specify interval name for chromosome X. Used for determining output metrics file name.
-
reference_interval_list_chrX (type "File")
- Specify interval list file for chromosome X. Used for calculating WGS metrics.
-
reference_interval_name_chrY (type "string") (default value = chrY)
- Specify interval name for chromosome Y. Used for determining output metrics file name.
-
reference_interval_list_chrY (type "File")
- Specify interval list file for chromosome Y. Used for calculating WGS metrics.
-
dbsnp (type "File")
- Specify dbSNP database file. Used for base quality recalibration. An index file (.idx) is needed.
-
mills_indel (type "File")
- Specify Mills_and_1000G_gold_standard indel database file. Used for base quality recalibration. An index file (.idx) is needed.
-
onek_indel (type "File")
- Specify 1000G_phase1 indel database file. Used for base quality recalibration. An index file (.idx) is needed.
-
nthreads (type "int")
- Specify the number of threads to be used.
-
outprefix (type "string")
- Specify output prefix.
Steps in this workflow
- picard MarkDuplicates (Version 2.10.6)
- gatk3 BaseRecalbrator
- gatk3 AnalyzeCovariates
- gatk3 PrintReads
- picard CollectMultipleMetrics (Version 2.18.23)
- samtools flagstat (Version 1.6)
- samtools idxstats (Version 1.6)
- picard CollectWgsMetrics (Version 2.10.6)
- gatk3 HaplotypeCaller (Version 3.7.10)
Output files
Output of this workflow includes the following files:
-
outprefix.bam.reference_interval_name_autosome.wgs_metrics
- WGS metrics for autosomes.
-
outprefix.bam.reference_interval_name_chrX.wgs_metrics
- WGS metrics for chromosome X.
-
outprefix.bam.reference_interval_name_chrY.wgs_metrics
- WGS metrics for chromosome Y.
-
outprefix.g.vcf.gz
- Results of genotype calling in genomic VCF format.
-
outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling.