Home

Human Re-sequencing Data Analysis

This repository includes several workflows written in Common Workflow Language (CWL) for human resequence data analysis. Before running this workflow, please read GATK3 license, especially for non-academic users.

Preparation for running workflows

Server recommendations

Memory ≥ 64GB
#. Threads ≥ 16
Disk space ≥ ~100GB / sample (depends on read depth)

Install software

Install CWL execution engine on your server. We tested our workflows on cwltool (version 1.0.20181217162649).

Data required to run workflows

Reference genome data and index files ... The following files can be downloaded from DDBJ FTP site.
- hs37d5.fa
- hs37d5.fa.amb
- hs37d5.fa.ann
- hs37d5.fa.bwt
- hs37d5.fa.pac
- hs37d5.fa.sa
- hs37d5.fa.dict
- hs37d5.fa.fai
- hs37d5.fa.autosomal.interval_list
- hs37d5.fa.chrX.interval_list
- hs37d5.fa.chrY.interval_list
GATK Resource Bundle data (required only for running Workflows/bams2gvcf.wBQSR.cwl) ... The following files can be downloaded from GATK Resource Bundle web site.
- dbsnp_138.b37.vcf
- dbsnp_138.b37.vcf.idx
- Mills_and_1000G_gold_standard.indels.b37.vcf
- Mills_and_1000G_gold_standard.indels.b37.vcf.idx
- 1000G_phase1.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf.idx

Small open data for testing

We downloaded original paired-end WGS data in cram format for 8 males from 1000 Genomes Project, converted the format from cram to BAM, downsampled with probability of 0.001, and converted the format from BAM to FastQ. The following file can be downloaded from DBCLS web site.

20200924-test-data.tar.gz (422.8MB)

From FastQ file(s) to a BAM file

Workflows/fastqPE2bam.cwl

This workflow takes as input paired-end (PE) fastq files and outputs a BAM file.
PE fastq files are mapped onto a human reference genome using BWA MEM (version 0.7.12), which outputs a SAM file.
Then, the SAM file is sorted and converted into BAM file using picard SortSam command (version 2.10.6).

Usage example

$ cwltool Workflows/fastqPE2bam.cwl <job-file>

Job file example

reference:
    class: File
    format: http://edamontology.org/format_1929
    path: ../../human-reseq.REF/hs37d5.fa
RG_ID: ERR034597 # type "string"
RG_PL: ILLUMINA # type "string"
RG_PU: ERR034597 # type "string"
RG_LB: ERR034597 # type "string"
RG_SM: NA18942 # type "string"
fq1:
    class: File
    path: ../../human-reseq.DATA/ERR034597_1.fastq
    format: http://edamontology.org/format_1930
fq2:
    class: File
    path: ../../human-reseq.DATA/ERR034597_2.fastq
    format: http://edamontology.org/format_1930
nthreads: 16 # type "int"
outprefix: ERR034597 # type "string"

Input parameters specified in job file

reference (type "File")
- Specify FASTA file for reference genome sequence that are used in mapping. Index files (.amb / .ann / .bwt / .pac / .sa) created by BWA are needed.
RG_ID (type "string")
- Specify read group identifier.
RG_PL (type "string")
- Specify read group platform/technology.
RG_PU (type "string")
- Specify read group platform unit.
RG_LB (type "string")
- Specify read group library identifier.
RG_SM (type "string")
- Specify read group sample identifier.
fq1 (type "File")
- Specify FASTQ file of one of paired-end reads.
fq2 (type "File")
- Specify FASTQ file of one of paired-end reads.
nthreads (type "int")
- Specify the number of threads to be used.
outprefix (type "string")
- Specify output prefix.

See Sequence Alignment/Map Format Specification for details of read group description

Steps in this workflow

bwa mem (Version 0.7.12)
picard SortSam (Version 2.10.6)

Output files

Output of this workflow includes the following files:

outprefix.sam
- Results of mapping onto human reference genome in SAM format.
outprefix.bam
- Results of mapping onto human reference genome in BAM format.
outprefix.bai
- Index file for the mapping results.

Workflows/fastqSE2bam.cwl

This workflow takes as input a single-end (SE) fastq file and outputs a BAM file.
SE fastq file is mapped onto a human reference genome using BWA MEM (version 0.7.12), which outputs a SAM file.
Then, the SAM file is sorted and converted into BAM file using picard SortSam command (version 2.10.6).

Usage example

$ cwltool Workflows/fastqSE2bam.cwl <job-file>

Job file example

reference:
    class: File
    format: http://edamontology.org/format_1929
    path: ../../human-reseq.REF/hs37d5.fa
RG_ID: ERR034597 # type "string"
RG_PL: ILLUMINA # type "string"
RG_PU: ERR034597 # type "string"
RG_LB: ERR034597 # type "string"
RG_SM: NA18942 # type "string"
fq:
    class: File
    path: ../../human-reseq.DATA/ERR034597_1.fastq
    format: http://edamontology.org/format_1930
nthreads: 16 # type "int"
outprefix: ERR034597 # type "string"

Input parameters specified in job file

reference (type "File")
- Specify FASTA file for reference genome sequence that are used in mapping. Index files (.amb / .ann / .bwt / .pac / .sa) created by BWA are needed.
RG_ID (type "string")
- Specify read group identifier.
RG_PL (type "string")
- Specify read group platform/technology.
RG_PU (type "string")
- Specify read group platform unit.
RG_LB (type "string")
- Specify read group library identifier.
RG_SM (type "string")
- Specify read group sample identifier.
fq (type "File")
- Specify FASTQ file of single-end reads.
nthreads (type "int")
- Specify the number of threads to be used.
outprefix (type "string")
- Specify output prefix.

See Sequence Alignment/Map Format Specification for details of read group description

Steps in this workflow

bwa mem (Version 0.7.12)
picard SortSam (Version 2.10.6)

Output files

Output of this workflow includes the following files:

outprefix.sam
- Results of mapping onto human reference genome in SAM format.
outprefix.bam
- Results of mapping onto human reference genome in BAM format.
outprefix.bai
- Index file for the mapping results.

From BAM file(s) to a genomic VCF file

Workflows/bams2gvcf.woBQSR_female.cwl

This workflow takes as input a list of BAM files and outpus a genomic VCF file. This workflow does not perform base quality recalibration. This workflow assumes that the target sample is female.
PCR duplicons are removed using picard MarkDuplicates (version 2.10.6), which outputs a BAM file.
Metrics for the merged BAM file are calculated using samtools (1.6), picard CollectMultipleMetrics (version 2.18.23), and picard CollectWgsMetrics (version 2.10.6).
Variants are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option. A genomic VCF file will be output.

Usage example

$ cwltool Workflows/bams2gvcf.woBQSR_female.cwl <job-file>

Job file example

bam_files:  # type "File"
  - class: File
    path: ../NA18939.SRR768162.bam
    format: http://edamontology.org/format_2572
  - class: File
    path: ../NA18939.SRR768163.bam
    format: http://edamontology.org/format_2572

nthreads: 16 # type "int"

outprefix: NA18939.woBQSR_female # type "string"

reference:  # type "File"
    class: File
    format: http://edamontology.org/format_1929
    path: ../../human-reseq.REF/hs37d5.fa
reference_interval_name_autosome: autosome # type "string"
reference_interval_list_autosome:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.autosomal.interval_list
reference_interval_name_chrX: chrX # type "string"
reference_interval_list_chrX:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.chrX.interval_list
reference_interval_name_chrY: chrY # type "string"
reference_interval_list_chrY:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.chrY.interval_list

Input parameters specified in job file

bam_files (type "File[]" -- array of files)
- Specify BAM file(s) used for genotype calling.
reference (type "File")
- Specify FASTA file for reference genome sequence. Index files (.fai / ^.dict) are needed.
reference_interval_name_autosome (type "string") (default value = autosome)
- Specify interval name for autosomes. Used for determining output metrics file name.
reference_interval_list_autosome (type "File")
- Specify interval list file for autosomes. Used for calculating WGS metrics.
reference_interval_name_chrX (type "string") (default value = chrX)
- Specify interval name for chromosome X. Used for determining output metrics file name.
reference_interval_list_chrX (type "File")
- Specify interval list file for chromosome X. Used for calculating WGS metrics.
reference_interval_name_chrY (type "string") (default value = chrY)
- Specify interval name for chromosome Y. Used for determining output metrics file name.
reference_interval_list_chrY (type "File")
- Specify interval list file for chromosome Y. Used for calculating WGS metrics.
nthreads (type "int")
- Specify the number of threads to be used.
outprefix (type "string")
- Specify output prefix.

Steps in this workflow

picard MarkDuplicates (Version 2.10.6)
picard CollectMultipleMetrics (Version 2.18.23)
samtools flagstat (Version 1.6)
samtools idxstats (Version 1.6)
picard CollectWgsMetrics (Version 2.10.6)
gatk3 HaplotypeCaller (Version 3.7.10)

Output files

Output of this workflow includes the following files:

outprefix.bam.reference_interval_name_autosome.wgs_metrics
- WGS metrics for autosomes.
outprefix.bam.reference_interval_name_chrX.wgs_metrics
- WGS metrics for chromosome X.
outprefix.bam.reference_interval_name_chrY.wgs_metrics
- WGS metrics for chromosome Y.
outprefix.g.vcf.gz
- Results of genotype calling in genomic VCF format.
outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling.

Workflows/bams2gvcf.woBQSR_male.cwl

This workflow takes as input a list of BAM files and outpus two genomic VCF file (one VCF file is for autosome variants, and another is for sex chromosome variants). This workflow does not perform base quality recalibration. This workflow assumes that the target sample is male.
PCR duplicons are removed using picard MarkDuplicates (version 2.10.6), which outputs a BAM file.
Metrics for the merged BAM file are calculated using samtools (1.6), picard CollectMultipleMetrics (version 2.18.23), and picard CollectWgsMetrics (version 2.10.6).
Variants on autosomes are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option. A genomic VCF file for autosome variants will be output.
Variants on sex chromosomes are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option for PAR regions and with ploidy=1 option for non-PAR regions. A genomic VCF file for sex chromosome (X/Y) variants will be output.

Usage example

$ cwltool Workflows/bams2gvcf.woBQSR_male.cwl <job-file>

Job file example

bam_files:  # type "File"
  - class: File
    path: ../NA18966.ERR234335.bam
    format: http://edamontology.org/format_2572

nthreads: 16 # type "int"

outprefix: NA18966.woBQSR_male # type "string"
chrXY_outprefix: NA18966.woBQSR_male.chrXY # type "string"

reference:  # type "File"
    class: File
    format: http://edamontology.org/format_1929
    path: ../../human-reseq.REF/hs37d5.fa
reference_interval_name_autosome: autosome # type "string"
reference_interval_list_autosome:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.autosomal.interval_list
reference_interval_name_chrX: chrX # type "string"
reference_interval_list_chrX:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.chrX.interval_list
reference_interval_name_chrY: chrY # type "string"
reference_interval_list_chrY:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.chrY.interval_list

Input parameters specified in job file

bam_files (type "File[]" -- array of files)
- Specify BAM file(s) used for genotype calling.
reference (type "File")
- Specify FASTA file for reference genome sequence. Index files (.fai / ^.dict) are needed.
reference_interval_name_autosome (type "string") (default value = autosome)
- Specify interval name for autosomes. Used for determining output metrics file name.
reference_interval_list_autosome (type "File")
- Specify interval list file for autosomes. Used for calculating WGS metrics.
reference_interval_name_chrX (type "string") (default value = chrX)
- Specify interval name for chromosome X. Used for determining output metrics file name.
reference_interval_list_chrX (type "File")
- Specify interval list file for chromosome X. Used for calculating WGS metrics.
reference_interval_name_chrY (type "string") (default value = chrY)
- Specify interval name for chromosome Y. Used for determining output metrics file name.
reference_interval_list_chrY (type "File")
- Specify interval list file for chromosome Y. Used for calculating WGS metrics.
nthreads (type "int")
- Specify the number of threads to be used.
outprefix (type "string")
- Specify output prefix for autosomal variants.
chrXY_outprefix (type "string")
- Specify output prefix for sex-chromosome variants.

Steps in this workflow

picard MarkDuplicates (Version 2.10.6)
picard CollectMultipleMetrics (Version 2.18.23)
samtools flagstat (Version 1.6)
samtools idxstats (Version 1.6)
picard CollectWgsMetrics (Version 2.10.6)
gatk3 HaplotypeCaller (Version 3.7.10)
gatk3 SelectVariants (Version 3.7.10)
bcctools concat (Version 1.6)
bcftools index (Version 1.6)

Output files

Output of this workflow includes the following files:

outprefix.bam.reference_interval_name_autosome.wgs_metrics
- WGS metrics for autosomes.
outprefix.bam.reference_interval_name_chrX.wgs_metrics
- WGS metrics for chromosome X.
outprefix.bam.reference_interval_name_chrY.wgs_metrics
- WGS metrics for chromosome Y.
outprefix.g.vcf.gz
- Results of genotype calling for autosomes in genomic VCF format.
outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling for autosomes.
chrXY_outprefix.g.vcf.gz
- Results of genotype calling for sex-chromosomes in genomic VCF format.
chrXY_outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling for sex-chromosomes.

Workflows/bams2gvcf.wBQSR.cwl

This workflow takes as input a list of BAM files and outpus a genomic VCF file. This workflow performs base quality recalibration. If the target sample is male, variant call results for sex chromosomes are unreliable.
PCR duplicons are removed using picard MarkDuplicates (version 2.10.6), which outputs a BAM file.
Base quality recalbration is performed using gatk3 BaseRecalibrator (version 3.7.0).
Metrics for the merged BAM file after base quality recalibration are calculated using samtools (1.6), picard CollectMultipleMetrics (version 2.18.23), and picard CollectWgsMetrics (version 2.10.6).
Variants are called using gatk3 HaplotypeCaller (version 3.7.0) with ploidy=2 option. A genomic VCF file will be output.

Usage example

$ cwltool Workflows/bams2gvcf.wBQSR.cwl <job-file>

Job file example

bam_files:  # type "File"
  - class: File
    path: ../NA18939.SRR768162.bam
    format: http://edamontology.org/format_2572
  - class: File
    path: ../NA18939.SRR768163.bam
    format: http://edamontology.org/format_2572

nthreads: 16 # type "int"

outprefix: NA18939.wBQSR_female # type "string"

reference:  # type "File"
    class: File
    format: http://edamontology.org/format_1929
    path: ../../human-reseq.REF/hs37d5.fa
reference_interval_name_autosome: autosome # type "string"
reference_interval_list_autosome:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.autosomal.interval_list
reference_interval_name_chrX: chrX # type "string"
reference_interval_list_chrX:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.chrX.interval_list
reference_interval_name_chrY: chrY # type "string"
reference_interval_list_chrY:
    class: File
    path: ../../human-reseq.REF/hs37d5.fa.chrY.interval_list

dbsnp:
    class: File
    path: ../../human-reseq.DATA/bundle.b37/dbsnp_138.b37.vcf
mills_indel:
    class: File
    path: ../../human-reseq.DATA/bundle.b37/Mills_and_1000G_gold_standard.indels.b37.vcf
onek_indel:
    class: File
    path: ../../human-reseq.DATA/bundle.b37/1000G_phase1.indels.b37.vcf

Input parameters specified in job file

bam_files (type "File[]" -- array of files)
- Specify BAM file(s) used for genotype calling.
reference (type "File")
- Specify FASTA file for reference genome sequence. Index files (.fai / ^.dict) are needed.
reference_interval_name_autosome (type "string") (default value = autosome)
- Specify interval name for autosomes. Used for determining output metrics file name.
reference_interval_list_autosome (type "File")
- Specify interval list file for autosomes. Used for calculating WGS metrics.
reference_interval_name_chrX (type "string") (default value = chrX)
- Specify interval name for chromosome X. Used for determining output metrics file name.
reference_interval_list_chrX (type "File")
- Specify interval list file for chromosome X. Used for calculating WGS metrics.
reference_interval_name_chrY (type "string") (default value = chrY)
- Specify interval name for chromosome Y. Used for determining output metrics file name.
reference_interval_list_chrY (type "File")
- Specify interval list file for chromosome Y. Used for calculating WGS metrics.
dbsnp (type "File")
- Specify dbSNP database file. Used for base quality recalibration. An index file (.idx) is needed.
mills_indel (type "File")
- Specify Mills_and_1000G_gold_standard indel database file. Used for base quality recalibration. An index file (.idx) is needed.
onek_indel (type "File")
- Specify 1000G_phase1 indel database file. Used for base quality recalibration. An index file (.idx) is needed.
nthreads (type "int")
- Specify the number of threads to be used.
outprefix (type "string")
- Specify output prefix.

Steps in this workflow

picard MarkDuplicates (Version 2.10.6)
gatk3 BaseRecalbrator
gatk3 AnalyzeCovariates
gatk3 PrintReads
picard CollectMultipleMetrics (Version 2.18.23)
samtools flagstat (Version 1.6)
samtools idxstats (Version 1.6)
picard CollectWgsMetrics (Version 2.10.6)
gatk3 HaplotypeCaller (Version 3.7.10)

Output files

Output of this workflow includes the following files:

outprefix.bam.reference_interval_name_autosome.wgs_metrics
- WGS metrics for autosomes.
outprefix.bam.reference_interval_name_chrX.wgs_metrics
- WGS metrics for chromosome X.
outprefix.bam.reference_interval_name_chrY.wgs_metrics
- WGS metrics for chromosome Y.
outprefix.g.vcf.gz
- Results of genotype calling in genomic VCF format.
outprefix.g.vcf.gz.tbi
- Index file for the results of genotype calling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Human Re-sequencing Data Analysis

Preparation for running workflows

From FastQ file(s) to a BAM file

Workflows/fastqPE2bam.cwl

Workflows/fastqSE2bam.cwl

From BAM file(s) to a genomic VCF file

Workflows/bams2gvcf.woBQSR_female.cwl

Workflows/bams2gvcf.woBQSR_male.cwl

Workflows/bams2gvcf.wBQSR.cwl

Clone this wiki locally