Skip to content

Latest commit

 

History

History
236 lines (175 loc) · 7.24 KB

File metadata and controls

236 lines (175 loc) · 7.24 KB

FastQC Documentation for ATHENA Pipeline

Overview

FastQC is a quality control tool for high-throughput sequencing data integrated into the ATHENA bioinformatics pipeline. This tool provides comprehensive quality assessment for raw sequence data, helping researchers identify potential issues before downstream analysis.

Features in ATHENA

  • Automated Quality Control: Performs comprehensive quality analysis on FASTQ files
  • Parallel Processing: Uses multi-threading for faster processing of multiple files
  • Report Generation: Creates both HTML reports and terminal summaries
  • Pipeline Integration: Seamlessly integrated with ATHENA's start command for automated workflows
  • Metadata Extraction: Extracts and processes quality metrics from FastQC output

Command-Line Usage

Standalone FastQC Command

./athena fastqc [OPTIONS]

Required Parameters

  • -1, --input1 FILE: First input FASTQ file (for paired-end data) or single input file
  • -2, --input2 FILE: Second input FASTQ file (for paired-end data, optional)
  • -o, --output DIR: Output directory for FastQC results

Optional Parameters

  • -r, --report: Generate and display a comprehensive quality report in the terminal
  • -v, --verbose: Enable verbose output for debugging purposes

Example Usage

Single-End Analysis

./athena fastqc -1 sample.fastq -o results/

Paired-End Analysis

./athena fastqc -1 sample_R1.fastq -2 sample_R2.fastq -o results/

With Report Generation

./athena fastqc -1 sample_R1.fastq -2 sample_R2.fastq -o results/ -r

With Verbose Output

./athena fastqc -1 sample_R1.fastq -2 sample_R2.fastq -o results/ -v

Integration with ATHENA Pipeline

FastQC is automatically executed in the ATHENA pipeline through the start command:

./athena start -1 input_R1.fastq -2 input_R2.fastq -o output_directory/

Pipeline Workflow

  1. Initial Quality Assessment: FastQC runs on raw input files
  2. Trimming: Trimmomatic processes the data based on initial QC results
  3. Final Quality Assessment: FastQC runs again on trimmed data
  4. Report Generation: Comprehensive quality reports are generated for comparison

Quality Control Modules

FastQC analyzes several quality aspects of your sequencing data:

Basic Statistics

  • Total sequences
  • Sequence length
  • GC content percentage
  • Sequences flagged as poor quality

Per Base Sequence Quality

  • Quality scores across all bases at each position
  • Identifies regions of poor quality
  • Helps determine trimming parameters

Per Sequence Quality Scores

  • Distribution of quality scores across all sequences
  • Identifies overall data quality

Per Base N Content

  • Percentage of bases called as N at each position
  • High N content indicates sequencing problems

Sequence Length Distribution

  • Distribution of sequence lengths in the file
  • Important for variable-length sequencing technologies

Sequence Duplication Levels

  • Level of duplication in the sequence file
  • High duplication may indicate PCR artifacts

Overrepresented Sequences

  • Sequences that make up more than 0.1% of the total
  • Often indicates contamination or adapter sequences

Adapter Content

  • Presence of adapter sequences
  • Critical for determining trimming requirements

Output Files

Directory Structure

output_directory/
├── sample_R1_fastqc.html    # HTML quality report for read 1
├── sample_R1_fastqc.zip     # Detailed results archive for read 1
├── sample_R2_fastqc.html    # HTML quality report for read 2 (if paired-end)
└── sample_R2_fastqc.zip     # Detailed results archive for read 2 (if paired-end)

HTML Reports

  • Interactive quality control reports
  • Visual representations of quality metrics
  • Pass/Warning/Fail status for each module
  • Detailed explanations of potential issues

ZIP Archives

  • Raw data files used to generate reports
  • Tab-delimited data files
  • Can be processed programmatically for custom analysis

Report Generation Features

ATHENA's FastQC integration includes advanced report generation:

Terminal Report

When using the -r flag, ATHENA processes FastQC ZIP files and generates:

  • Summary Statistics: Key metrics for each input file
  • Quality Assessment: Pass/Warning/Fail status for each module
  • Comparative Analysis: Side-by-side comparison for paired-end reads
  • Recommendations: Suggested parameters for downstream processing

Threading and Performance

  • Utilizes multi-threading for ZIP file processing
  • Parallel analysis of multiple input files
  • Significantly faster report generation for large datasets

Best Practices

Input Preparation

  • Ensure FASTQ files are properly formatted
  • Use absolute paths to avoid file location issues
  • Verify file integrity before analysis

Output Organization

  • Use descriptive output directory names
  • Keep FastQC results organized by sample/experiment
  • Archive results for future reference

Parameter Selection

  • Use -r flag for detailed terminal reports
  • Enable -v for troubleshooting
  • Process paired-end files together for comparative analysis

Troubleshooting

Common Issues

File Not Found

Error: Could not find input file

Solution: Check file paths and ensure files exist

Permission Denied

Error: Cannot write to output directory

Solution: Verify write permissions for output directory

Memory Issues

Error: Out of memory

Solution: Process smaller files or increase available memory

Verbose Mode

Enable verbose mode (-v) to see detailed execution information:

  • FastQC command construction
  • File processing status
  • Threading information
  • Error details

Performance Considerations

Threading

  • FastQC utilizes available CPU cores
  • ATHENA's implementation adds additional threading for report processing
  • Performance scales with number of available cores

Memory Usage

  • Memory requirements scale with file size
  • Typical usage: 1-2GB RAM for standard FASTQ files
  • Large files may require additional memory

Processing Time

  • Depends on file size and system specifications
  • Typical processing: 1-5 minutes for standard FASTQ files
  • Parallel processing significantly reduces total time for multiple files

Integration Notes

Version Compatibility

  • Tested with FastQC v0.11.9 and newer
  • Compatible with standard FASTQ formats
  • Supports both compressed (.gz) and uncompressed files

Dependencies

  • Java Runtime Environment (JRE) required for FastQC
  • FastQC executable must be available in system PATH
  • ATHENA handles command construction and execution

Pipeline Coordination

  • Results are automatically processed by subsequent pipeline steps
  • Quality metrics inform Trimmomatic parameter selection
  • Output organization facilitates downstream analysis

Support and Further Information

For additional information about FastQC capabilities and detailed module explanations, refer to:

This documentation covers FastQC as integrated within the ATHENA pipeline. For general FastQC usage outside of ATHENA, refer to the official FastQC documentation.